claude-code-chat-browser: schema drift detection for upstream JSONL c…#108
Conversation
…hanges (#5) Fingerprint known Claude Code JSONL field paths against a committed schema_baseline.json, warn on drift during parsing, expose GET /api/schema-report, and surface a dismissible amber banner on the session list page. Warnings only - parsing is never blocked.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds schema drift tracking from JSONL parsing through an API endpoint and session UI banner, backed by a committed baseline, drift tests, and updated benchmarks. ChangesSchema Drift Detection
Estimated code review effort: 4 (Complex) | ~45 minutes Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
utils/schema_drift.py (2)
90-121: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winBaseline JSON is re-read and re-parsed from disk on every
parse_session()call.
load_baseline_fields()does file I/O +json.loadsand is invoked unconditionally bydiff_against_baseline(), which per the parser integration snippet runs once per session file. On a session-list page rendering many sessions, this repeats disk I/O + parsing for a file that never changes at runtime. Consider loading/parsing the baseline once (module import or lazily-cached, e.g. viafunctools.lru_cache) instead of on every parse.♻️ Example: cache the parsed baseline
+from functools import lru_cache + + +@lru_cache(maxsize=1) +def load_baseline_fields() -> dict[str, SchemaFieldSpec]: """Load ``schema_baseline.json`` field specs keyed by dotted path.""" raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8"))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@utils/schema_drift.py` around lines 90 - 121, The baseline JSON is being re-read and re-parsed on every call through diff_against_baseline() and load_baseline_fields(), which adds repeated disk I/O for data that does not change at runtime. Cache the parsed baseline once instead of loading it per session, either by memoizing load_baseline_fields() with a lazy cache such as functools.lru_cache or by initializing the baseline at module import, and keep diff_against_baseline() using the cached result.
71-87: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value
collect_field_paths_with_typesand baselineexpected_typeare unused in the drift diff.
diff_against_baselineonly compares field paths, and nothing in this module reads observed types. If type-drift checks aren’t coming next, remove the helper andexpected_typeplumbing for now to keep this focused.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@utils/schema_drift.py` around lines 71 - 87, The observed type-tracking code in collect_field_paths_with_types is not used by diff_against_baseline, so the drift logic should be simplified to only compare field paths. Remove the unused type-collection helper and any expected_type plumbing from the baseline handling unless you are adding type-drift checks in this same change, and keep the remaining symbols centered around collect_field_paths, baseline, and diff_against_baseline.static/js/sessions.js (1)
85-88: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winParallelize independent fetches to reduce latency.
fetchSchemaDriftBannerHtml()and the sessions fetch are independent; awaiting them sequentially adds the schema-report round trip to the workspace load time.⚡ Suggested refactor
- const schemaBannerHtml = await fetchSchemaDriftBannerHtml(); - - const res = await fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`); - state.cachedSessions = await res.json(); + const [schemaBannerHtml, res] = await Promise.all([ + fetchSchemaDriftBannerHtml(), + fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`), + ]); + state.cachedSessions = await res.json();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@static/js/sessions.js` around lines 85 - 88, The workspace load is waiting on two independent requests in sequence, which adds unnecessary latency. Update the sessions-loading flow in sessions.js so fetchSchemaDriftBannerHtml() and the /api/projects/${encodeURIComponent(projectName)}/sessions fetch start in parallel, then await both results together before using schemaBannerHtml and state.cachedSessions. Keep the change localized to the sessions-loading logic and preserve the existing assignment behavior once both promises resolve.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@utils/jsonl_parser.py`:
- Around line 236-239: `parse_session` currently calls
`record_parse_drift(observed_field_paths)` unguarded, so drift tracking can
abort parsing if `diff_against_baseline()` or `load_baseline_fields()` fails on
a missing or malformed `schema_baseline.json`. Update `record_parse_drift` (or
its call site in `parse_session`) to catch and suppress non-fatal drift-tracking
errors, while still allowing `validate_session_dict(...)` and the rest of
parsing to proceed normally; keep the fix localized around `record_parse_drift`,
`diff_against_baseline`, and `load_baseline_fields`.
---
Nitpick comments:
In `@static/js/sessions.js`:
- Around line 85-88: The workspace load is waiting on two independent requests
in sequence, which adds unnecessary latency. Update the sessions-loading flow in
sessions.js so fetchSchemaDriftBannerHtml() and the
/api/projects/${encodeURIComponent(projectName)}/sessions fetch start in
parallel, then await both results together before using schemaBannerHtml and
state.cachedSessions. Keep the change localized to the sessions-loading logic
and preserve the existing assignment behavior once both promises resolve.
In `@utils/schema_drift.py`:
- Around line 90-121: The baseline JSON is being re-read and re-parsed on every
call through diff_against_baseline() and load_baseline_fields(), which adds
repeated disk I/O for data that does not change at runtime. Cache the parsed
baseline once instead of loading it per session, either by memoizing
load_baseline_fields() with a lazy cache such as functools.lru_cache or by
initializing the baseline at module import, and keep diff_against_baseline()
using the cached result.
- Around line 71-87: The observed type-tracking code in
collect_field_paths_with_types is not used by diff_against_baseline, so the
drift logic should be simplified to only compare field paths. Remove the unused
type-collection helper and any expected_type plumbing from the baseline handling
unless you are adding type-drift checks in this same change, and keep the
remaining symbols centered around collect_field_paths, baseline, and
diff_against_baseline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 118cf0f8-8beb-4e8a-ab11-f5c572bfcdac
📒 Files selected for processing (9)
api/schema_report.pyapp.pyschema_baseline.jsonstatic/css/style.cssstatic/js/sessions.jstests/fixtures/jsonl/unknown_field.jsonltests/test_schema_drift.pyutils/jsonl_parser.pyutils/schema_drift.py
…108) Cache schema_baseline.json with lru_cache and make record_parse_drift non-fatal on baseline I/O or parse errors so parsing never aborts. Fetch /api/schema-report after sessions load so the banner reflects drift from the current parse run. Add vitest coverage for banner rendering and fetch ordering; extend pytest for malformed baseline. Raise benchmark baselines for per-entry field-path fingerprinting.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
utils/jsonl_parser.py (1)
178-195: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winPreserve the documented “0 means uncapped” behavior.
schema_drift_sample_limit()documents0as disabling the sampling cap, but this loop only samples whileschema_samples_remaining > 0, so settingCLAUDE_CODE_CHAT_BROWSER_SCHEMA_DRIFT_SAMPLE=0disables drift collection entirely.Proposed fix
- schema_samples_remaining = schema_drift_sample_limit() if is_schema_drift_enabled() else 0 + schema_drift_enabled = is_schema_drift_enabled() + sample_limit = schema_drift_sample_limit() if schema_drift_enabled else 0 + schema_samples_remaining: int | None = ( + None if schema_drift_enabled and sample_limit == 0 else sample_limit + ) @@ - if schema_samples_remaining > 0: + if schema_samples_remaining is None or schema_samples_remaining > 0: observed_field_paths |= collect_field_paths(entry) - schema_samples_remaining -= 1 + if schema_samples_remaining is not None: + schema_samples_remaining -= 1 @@ - if is_schema_drift_enabled() and observed_field_paths: + if schema_drift_enabled and observed_field_paths: record_parse_drift(observed_field_paths)Also applies to: 241-242
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@utils/jsonl_parser.py` around lines 178 - 195, The schema drift sampling logic in the JSONL parser treats `schema_drift_sample_limit()` returning 0 as “don’t sample at all” instead of the documented uncapped mode. Update the sampling condition in the loop that uses `schema_samples_remaining` so that 0 is handled as unlimited collection while positive values still decrement, and apply the same fix to the other schema-drift sampling loop referenced by the same behavior.
♻️ Duplicate comments (1)
utils/schema_drift.py (1)
77-80: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winValidate the baseline root before calling
.get().If
schema_baseline.jsonis valid JSON but not an object,raw.get(...)raisesAttributeError, which is not caught byrecord_parse_drift()and can still abort parsing.Proposed fix
- raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8")) + raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8")) + if not isinstance(raw, dict): + raise ValueError("schema_baseline.json: root must be an object") fields = raw.get("fields", {})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@utils/schema_drift.py` around lines 77 - 80, The baseline parsing in record_parse_drift currently assumes json.loads(BASELINE_PATH.read_text(...)) returns an object and calls .get() immediately, which can raise AttributeError for valid non-object JSON. Add a root-type check right after loading the baseline in record_parse_drift (before accessing fields) and raise a ValueError with the existing schema_baseline.json validation pattern if the root is not a dict; then keep the existing fields validation unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@utils/schema_drift.py`:
- Around line 125-147: The schema drift merge in _last_report is not atomic
because prior_new/merged_new are computed in one critical section and the state
is written in a later one, allowing concurrent calls to drop newly discovered
fields. Update the logic in schema_drift.py around _lock and _last_report so the
read-compute-write of new_fields happens in a single locked section. Keep the
warning logging based on the merged result from that same protected update, and
ensure _last_report is assigned once with the fully merged state.
---
Outside diff comments:
In `@utils/jsonl_parser.py`:
- Around line 178-195: The schema drift sampling logic in the JSONL parser
treats `schema_drift_sample_limit()` returning 0 as “don’t sample at all”
instead of the documented uncapped mode. Update the sampling condition in the
loop that uses `schema_samples_remaining` so that 0 is handled as unlimited
collection while positive values still decrement, and apply the same fix to the
other schema-drift sampling loop referenced by the same behavior.
---
Duplicate comments:
In `@utils/schema_drift.py`:
- Around line 77-80: The baseline parsing in record_parse_drift currently
assumes json.loads(BASELINE_PATH.read_text(...)) returns an object and calls
.get() immediately, which can raise AttributeError for valid non-object JSON.
Add a root-type check right after loading the baseline in record_parse_drift
(before accessing fields) and raise a ValueError with the existing
schema_baseline.json validation pattern if the root is not a dict; then keep the
existing fields validation unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f74ac9b5-8fe0-4ece-91a5-7955c7a52bd6
📒 Files selected for processing (6)
schema_baseline.jsonstatic/js/sessions.jsstatic/js/sessions.test.jstests/test_schema_drift.pyutils/jsonl_parser.pyutils/schema_drift.py
🚧 Files skipped from review as they are similar to previous changes (3)
- static/js/sessions.test.js
- schema_baseline.json
- static/js/sessions.js
There was a problem hiding this comment.
🧹 Nitpick comments (1)
utils/schema_drift.py (1)
44-49: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low valueNegative sample values collapse to "no cap".
max(0, int(raw))maps any negative value to0, which downstream (schema_drift_sample_limit()==0→Noneinparse_session) means unlimited fingerprinting of every record. A user setting-1intending to disable/limit sampling would unexpectedly scan all records. Consider treating negatives as invalid and falling back to the default3instead.♻️ Optional adjustment
raw = os.environ.get("CLAUDE_CODE_CHAT_BROWSER_SCHEMA_DRIFT_SAMPLE", "3").strip() try: - return max(0, int(raw)) + value = int(raw) except ValueError: return 3 + return value if value >= 0 else 3🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@utils/schema_drift.py` around lines 44 - 49, The schema drift sample limit parsing in schema_drift_sample_limit() treats negative values as 0, which later becomes unlimited sampling in parse_session. Update the environment parsing so only non-negative integers are accepted, and make any negative or otherwise invalid input fall back to the default of 3 instead of collapsing to no cap.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@utils/schema_drift.py`:
- Around line 44-49: The schema drift sample limit parsing in
schema_drift_sample_limit() treats negative values as 0, which later becomes
unlimited sampling in parse_session. Update the environment parsing so only
non-negative integers are accepted, and make any negative or otherwise invalid
input fall back to the default of 3 instead of collapsing to no cap.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 126a8690-fcce-46b3-bff9-48ecbea3acd0
📒 Files selected for processing (3)
tests/test_schema_drift.pyutils/jsonl_parser.pyutils/schema_drift.py
🚧 Files skipped from review as they are similar to previous changes (1)
- utils/jsonl_parser.py
Closes #103
Summary
Detect upstream Claude Code JSONL schema drift by fingerprinting field paths during
parse_session()and diffing against a committedschema_baseline.json. New or missing required paths emit warnings on theclaude_code_chat_browser.schema_driftlogger; the web UI shows a dismissible amber banner on the session list page;GET /api/schema-reportreturns{known_fields, new_fields, missing_fields, has_drift}.Closes sprint item 5 (July Week 1 Thursday, 5 pt).
Changes
utils/schema_drift.py— baseline loader, recursive path collection, drift diff, process-wide reportutils/jsonl_parser.py—_collect_field_paths()wrapper; drift check at parse completionschema_baseline.json— 98 known(json_path, expected_type)pairs; onlytypeis requiredapi/schema_report.py—GET /api/schema-reportstatic/js/sessions.js+static/css/style.css— dismissible amber warning banner (sessionStorage fingerprint)tests/fixtures/jsonl/unknown_field.jsonl— syntheticFutureToolXYZ/ unknowntoolkey fixturetests/test_schema_drift.py— 10 tests covering warnings, API, false-positive guard, merge behaviorOut of scope
Test plan
pytest tests/ -k schema -qpytest -q(443 passed)mypy -p api -p utils -p modelsruff check .tests/fixtures/jsonl/unknown_field.jsonl— confirm warning logged and amber banner in UISummary by CodeRabbit
/api/schema-reportendpoint so the UI can retrieve current drift status.