Skip to content

fix(annotator,types): bound untrusted strings flowing into the YAML sidecar (#232)#243

Merged
williamzujkowski merged 1 commit into
mainfrom
fix/bound-yaml-strings-232
Jun 30, 2026
Merged

fix(annotator,types): bound untrusted strings flowing into the YAML sidecar (#232)#243
williamzujkowski merged 1 commit into
mainfrom
fix/bound-yaml-strings-232

Conversation

@williamzujkowski

Copy link
Copy Markdown
Collaborator

Summary

caseName, citation, statuteVersionRef, and statuteVersionNote were unbounded z.string()s. They originate from the CourtListener API and are written into a hand-rolled YAML sidecar, so an arbitrarily long (hostile or accidental) value could bloat output and widen the injection surface.

Fix

  • Schema caps (fail-closed backstop): CaseAnnotationSchema now bounds caseName ≤ 500, citation ≤ 200, statuteVersionRef ≤ 100, statuteVersionNote ≤ 500 (holdingSummary was already 500). Nothing over-cap can ever be serialized.
  • Code-unit-safe truncation: generalized the snippet truncation into truncate(value, max) — slices by UTF-16 code unit (matching the schema's code-unit caps) and drops a trailing lone high surrogate so a split never leaves a broken character. Applied to caseName and citation (alongside holdingSummary), so legitimately long values are clamped rather than dropped by validation.

Tests

  • schemas.test.ts: each cap rejects over-length and accepts at-cap.
  • annotator.test.ts: over-cap untrusted fields are clamped so the annotation still validates; truncation that splits an emoji surrogate pair never leaves a lone high surrogate before the ellipsis.

Full local CI (build + typecheck + lint + test) green.

Closes #232

…idecar (#232)

caseName, citation, statuteVersionRef and statuteVersionNote were unbounded
z.string()s. They originate from CourtListener and are written into a
hand-rolled YAML sidecar, so an arbitrarily long value could bloat output and
widen the injection surface.

- Add length caps to CaseAnnotationSchema: caseName(500), citation(200),
  statuteVersionRef(100), statuteVersionNote(500) — a fail-closed backstop so
  nothing over-cap is ever serialized.
- Generalize the snippet truncation into truncate(value, max): it slices by
  UTF-16 code unit (matching the schema's code-unit caps) and drops a trailing
  lone high surrogate so a split never leaves a broken character. Apply it to
  caseName and citation (in addition to holdingSummary) so legitimately long
  values are clamped rather than dropped by validation.

Adds schema cap tests and annotator truncation/surrogate-guard tests.

Closes #232

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@williamzujkowski williamzujkowski requested a review from a team as a code owner June 30, 2026 01:33
@williamzujkowski williamzujkowski merged commit 66b7e3b into main Jun 30, 2026
3 checks passed
@williamzujkowski williamzujkowski deleted the fix/bound-yaml-strings-232 branch June 30, 2026 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hardening(annotator/types): bound untrusted strings flowing into hand-rolled YAML (schema maxlen + surrogate-pair split)

1 participant