Skip to content

docs: capture/perf week report, PR plan + 1.0.0 readiness assessment#86

Open
jbachorik wants to merge 76 commits into
mainfrom
pr/docs-2026-06-18-capture-perf
Open

docs: capture/perf week report, PR plan + 1.0.0 readiness assessment#86
jbachorik wants to merge 76 commits into
mainfrom
pr/docs-2026-06-18-capture-perf

Conversation

@jbachorik

Copy link
Copy Markdown
Collaborator

What does this PR do?

Adds documentation for the 2026-06-11→18 capture-correctness + performance effort: the week research/achievement report, the 5-PR decomposition plan (#81#85), the Reggie 1.0.0 public-release readiness assessment, and the design/plan docs from the design→adversarial-review→plan loops.

Motivation

Capture what was achieved, how the work is being landed (the stacked PR series), and an evidence-backed go/no-go for a public 1.0.0.

Related Issue(s)

Documents the work in #81, #82, #83, #84, #85.

Change Type

  • Bug fix
  • New feature
  • Performance improvement
  • Refactoring (no functional change)
  • Documentation
  • Test improvement
  • Build/CI change

Checklist

  • I have read the CONTRIBUTING.md guidelines
  • All existing tests pass (./gradlew build)
  • I have added tests for my changes
  • I have updated documentation (if applicable)
  • My commits are signed

Performance Impact

None (documentation only).

Additional Notes

Docs-only, based on main (independent of the #81#85 code stack). The 1.0.0 readiness assessment concludes not clean for public 1.0 yet — see its punch list.

🤖 Generated with Claude Code

jbachorik and others added 30 commits June 11, 2026 17:28
Eliminates B10/B15 FallbackPatternDetector predicates and partially
eliminates B16 by routing the affected DFA_*_WITH_GROUPS patterns to
PIKEVM_CAPTURE before the DFA state-count ladder:

- B10: optional prefix before capturing group (e.g. -?(-?.{3}).)
- B15: capturing group in quantified alternation (e.g. (a|b){2,})
- B16 (partial): nullable outer quantifier on capturing group with
  non-nullable content (e.g. (a)?); patterns where both the outer
  quantifier and group content are nullable (e.g. (0*-?){0,}) still
  fall back to JDK via the new hasNullableGroupContentWithNullableQuantifier
  predicate.

Both the capture-ambiguous TDFA path and the non-ambiguous DFA-with-groups
path now have the three gates before the DFA strategy ladder. Fuzz gate:
findings=0 (9530 patterns, 76240 inputs).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add PIKEVM gate inside the capturing TDFA isAnchorConditionDiluted()
block: patterns where both branches share a leading character but one
branch carries a start-anchor guard (e.g. ^x|x(y)) now route to
PIKEVM_CAPTURE instead of the JDK fallback. PikeVM evaluates ^/\A
correctly against the search-region origin since commit 0acfc66.

Patterns with optional quantifiers, nullable branches, or leading
end-anchors still fall through to the anchorConditionDiluted JDK path.
Fuzz gate confirms zero divergences with the new routing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- NFABytecodeGenerator: add zero-length early-accept before bounds/regionMatches
  in generateBackreferenceCheck; groupLen==0 trivially succeeds (vacuous match)
- FallbackPatternDetector: replace broad hasNullableBackrefGroup B7 guard with
  narrowed hasAmbiguouslyNullableBackrefGroup that only falls back when the group
  body can capture strings of length > 1 (unbounded contamination risk); groups
  with max capture length <= 1 (e.g. a?, [x]?) are safe with the early-accept
- BackrefEngineGapsTest: enable b7_nullableBackrefGroupInOptimizedNfa

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- RuntimeCompiler: replace CapturePolicy import with ReggieOption/UnsupportedPatternException
- Add cacheKeyFor() helper (flag-aware cache key) and fallbackOrThrow() helper
- Gate all 6 JavaRegexFallbackMatcher construction sites behind ALLOW_JDK_FALLBACK flag
- compileHybrid() receives ReggieOptions to propagate fallback policy
- UnsupportedPatternException propagates through catch(Exception) via explicit re-throw
- 34 test files updated: add allowJdkFallback() for patterns requiring JDK fallback
- New FallbackPolicyTest: throwsByDefault, delegatesWhenFallbackEnabled, nativePatternUnaffected

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…otations

ReggieOption moved from reggie-runtime to reggie-annotations so the
annotation type can reference it without a circular dependency.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jbachorik and others added 17 commits June 15, 2026 16:41
…suffix char

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fier

SWAR/first-char optimizer was advancing tryPos past 0 after the start-anchor
check passed, causing matchesAtStart to be invoked at non-zero offsets.
The DFA transition entryGuard (STRING_START) is not checked at runtime, so the
DFA accepted matches at positions where \A cannot hold.

Fix: suppress SWAR and first-char optimizations when requiresStartAnchor or
hasMultilineStart is true — the anchor check already pins the loop to position 0
(or after-newline), making position-skipping via first-char filter incorrect.

Also corrects a stale comment at the anchor optimization site that referred to
an unused hasStringStartAnchor field.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rough

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixed 32 findings across 7 generators. Remaining 18: 3 Task-3
(OPTIMIZED_NFA_WITH_BACKREFS, pending design decision), 15 new/unmasked
patterns for follow-up investigation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
T1.2 skips find() start positions whose char cannot begin a match
(SQL_ANSI no-match ~38x, now faster than JDK); T1.5 makes the
per-position accept check O(1). Zero new fuzz divergences.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
T1.4: find() for anchor/assertion/backref-free patterns uses a lazy DFA
whose step re-injects the start closure each char (implicit .*? prefix),
giving O(n) single-pass unanchored search. QOBF find ~29x->~7.7x faster
than JDK; spans/findMatch stay on the thread sim. Zero new fuzz divergences.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Baseline for the Phase-2 go/no-go: PikeVM span extraction is 10-68x
slower than JDK on the capture-bearing IAST patterns.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Hot path in PikeVM/closure transition scans (20-29% of capture-path
samples). Branchless bitmap test for ASCII; ranges binary search retained
for non-ASCII. equals/hashCode unchanged (range-based) so the structural
cache is unaffected. Zero new fuzz divergences.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
matches() is whole-input (priority-independent), so a strict lazy DFA's
yes/no equals the thread sim. Many-optional-group matches() ~323->38000
ops/ms (now beats re2j and JDK). Zero new fuzz divergences.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extends the boolean DFA fast path to patterns whose only anchors are
leading START/STRING_START, via an initial closure (pos 0, anchor crossed)
vs re-inject/step closure (pos>0, anchor blocked). Eligibility declines
non-leading ^ (reachable after a consume, e.g. inside a loop) where the
pos-0-only model is unsound. URL find() now 13-20x faster than re2j
(was 0.27-0.35x). Zero new fuzz divergences.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
IastTokenizerDrainBenchmark mirrors dd-trace SensitiveTokenizerBenchmark
(512-2048B malformed payloads, full match-drain) vs reggie/re2j/jdk — the
workload that actually decides the re2j replacement (our tiny-input
IastRegexpBenchmark was JDK's best case). Reveals Reggie's O(n^2) drain
(repeated findMatchFrom re-scan) and lazy-quantifier compile failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- findMatchResultFrom: one continuing left-to-right pass with lowest-priority
  re-seeding (kills the per-start O(n^2) retry); anchor origin pinned to
  absolute 0, fixing ^/\A findAll re-anchoring.
- Fast-reject: when the boolean find DFA (exact, or an over-approximating
  reject DFA that crosses anchors as epsilon) proves no match at/after pos,
  skip the thread sim. Adversarial IAST drains now beat re2j ~10-31x.
- MultiGroupGreedy: ^ anchors to absolute 0, not the scan start.
- RegexFuzzOracle: add findAll group-span differential (first oracle to check
  group spans on the find path); budget 18->78 for the pre-existing find-path
  capture debt it now surfaces (tracked, to ratchet to 0).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
SPECIALIZED_MULTI_GROUP_GREEDY has no inter-group backtracking, so a greedy
capturing group whose charset overlaps what follows (e.g. (\w+)0 on "ab00",
(\d+)5 on "1235") returned NO_MATCH where JDK matches. Decline such patterns
(requiresBacktrackingForGroups) so they fall through to the backtracking-capable
routing (the requiresBacktrackingForGroups -> RECURSIVE_DESCENT guard), which
produces correct spans. GREEDY_BACKTRACK already handles the (.*)literal shape.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A nullable capturing group in an alternation branch (e.g. 1|()b, ()b|x) is
committed as a zero-width span by the TDFA/group-action capture path even when
the priority-winning branch bypasses it (binds g1=[0,0); JDK leaves it -1).
Detect it (hasNullableCapturingGroupInAlternationBranch) and route to
PIKEVM_CAPTURE, which gives correct, O(n)/ReDoS-safe spans. A non-nullable group
like (a) in (a)|b never leaks and stays on the fast DFA. Fuzz budget 78->69.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…VM (Class E)

Two capturing alternations in sequence where the first has shared-prefix branches
(e.g. (a|ab)(c|bcd)) leave the first group's span ambiguous until the second
resolves it — which the single-register TDFA mis-tracks ((a|ab)(c|bcd) on "abcd"
gave g1=[0,2) vs JDK [0,1)). Detect the pair (hasInteractingCapturingAlternations)
and route to PIKEVM_CAPTURE for correct, O(n)/ReDoS-safe spans. A lone capturing
alternation or one followed by a fixed element (e.g. (a|ab)\d, (a|b)(c|d)) stays
on the fast DFA.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik added the AI Generated or assisted by AI label Jun 18, 2026
@datadog-datadog-prod-us1

This comment has been minimized.

jbachorik and others added 2 commits June 18, 2026 22:01
…esign docs

Adds the 2026-06-11..18 achievement report (capture-correctness + O(n) drain
work), the PR decomposition plan for that work (PRs #81-#85), the Reggie 1.0.0
public-release readiness assessment, and the design/plan docs from the
design->adversarial-review->plan loops.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The lookbehind+lookahead sandwich (?<=\[)[^\]]+(?=\]) compiles native and
silently returns no-match (verified HEAD + origin/main; pre-existing). Downgrade
"boolean correctness" to a gap with a P1 punch-list item to fix or decline it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jbachorik jbachorik force-pushed the pr/docs-2026-06-18-capture-perf branch from ae2a58c to a3da96a Compare June 18, 2026 20:02
@jbachorik jbachorik changed the base branch from main to pr/5-on-drain-capture-routing June 18, 2026 20:02
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (pr/5-on-drain-capture-routing@f06e984). Learn more about missing BASE report.

Additional details and impacted files
@@                       Coverage Diff                       @@
##             pr/5-on-drain-capture-routing     #86   +/-   ##
===============================================================
  Coverage                                 ?   82.2%           
  Complexity                               ?       1           
===============================================================
  Files                                    ?     127           
  Lines                                    ?   36872           
  Branches                                 ?    4716           
===============================================================
  Hits                                     ?   30327           
  Misses                                   ?    5050           
  Partials                                 ?    1495           

Continue to review full report in Codecov by Harness.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f06e984...a3da96a. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jbachorik jbachorik marked this pull request as ready for review June 18, 2026 20:11

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a3da96ab9c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@@ -0,0 +1,91 @@
# PR decomposition plan — `feat/pikevm-capture-cost` → 5 stacked PRs

Status: PLAN (awaiting approval before any branch/PR is created). Base: `origin/main` = `a924e431`

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep temporary plans out of committed docs

This file is explicitly a Status: PLAN PR/branch logistics document, but AGENTS.md lines 269-286 require all temporary task-oriented documents, including implementation plans, performance notes, temporary research notes, and action checklists, to live under gitignored doc/temp/ and never be committed unless promoted to permanent reference material. Committing this plan, along with the other added dated plan/report files under doc/ and docs/superpowers/plans/, makes draft workflow artifacts part of the permanent documentation set; move them to doc/temp/ or extract only durable reference content.

Useful? React with 👍 / 👎.

Base automatically changed from pr/5-on-drain-capture-routing to pr/4-perf-boolean-dfa-bitmap June 19, 2026 17:04
Base automatically changed from pr/4-perf-boolean-dfa-bitmap to pr/3-capture-correctness-wave June 19, 2026 17:35
Base automatically changed from pr/3-capture-correctness-wave to pr/2-anchor-alt-routing-backref June 19, 2026 20:19
Base automatically changed from pr/2-anchor-alt-routing-backref to main June 19, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated or assisted by AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants