claude-code-chat-browser: schema drift detection for upstream JSONL c… by clean6378-max-it · Pull Request #108 · cppalliance/claude-code-chat-browser

clean6378-max-it · 2026-07-02T15:58:24Z

Closes #103

Summary

Detect upstream Claude Code JSONL schema drift by fingerprinting field paths during parse_session() and diffing against a committed schema_baseline.json. New or missing required paths emit warnings on the claude_code_chat_browser.schema_drift logger; the web UI shows a dismissible amber banner on the session list page; GET /api/schema-report returns {known_fields, new_fields, missing_fields, has_drift}.

Closes sprint item 5 (July Week 1 Thursday, 5 pt).

Changes

utils/schema_drift.py — baseline loader, recursive path collection, drift diff, process-wide report
utils/jsonl_parser.py — _collect_field_paths() wrapper; drift check at parse completion
schema_baseline.json — 98 known (json_path, expected_type) pairs; only type is required
api/schema_report.py — GET /api/schema-report
static/js/sessions.js + static/css/style.css — dismissible amber warning banner (sessionStorage fingerprint)
tests/fixtures/jsonl/unknown_field.jsonl — synthetic FutureToolXYZ / unknown tool key fixture
tests/test_schema_drift.py — 10 tests covering warnings, API, false-positive guard, merge behavior

Out of scope

Auto-updating baseline on every parse (updates are deliberate)
Blocking parse on drift
CLI command (API endpoint chosen instead)

Test plan

pytest tests/ -k schema -q
pytest -q (443 passed)
mypy -p api -p utils -p models
ruff check .
Manual: parse tests/fixtures/jsonl/unknown_field.jsonl — confirm warning logged and amber banner in UI

Summary by CodeRabbit

New Features
- Added a schema drift warning banner in the workspace (lists up to 5 newly detected fields and up to 5 missing required fields).
- Added a /api/schema-report endpoint so the UI can retrieve current drift status.
Bug Fixes
- Banner dismissal is persisted per drift fingerprint to avoid repeated alerts.
- Drift detection is more robust to baseline/reporting issues and won’t interrupt session parsing.
Style
- Updated warning styling for light/dark themes, including alert layout and a dismiss button.

…hanges (#5) Fingerprint known Claude Code JSONL field paths against a committed schema_baseline.json, warn on drift during parsing, expose GET /api/schema-report, and surface a dismissible amber banner on the session list page. Warnings only - parsing is never blocked.

coderabbitai · 2026-07-02T15:58:47Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f0ac298-06be-4a7b-9301-3ed267aff3b7

📥 Commits

Reviewing files that changed from the base of the PR and between f8bfa1c and ae2b493.

📒 Files selected for processing (1)

schema_baseline.json

📝 Walkthrough

Walkthrough

Adds schema drift tracking from JSONL parsing through an API endpoint and session UI banner, backed by a committed baseline, drift tests, and updated benchmarks.

Changes

Schema Drift Detection

Layer / File(s)	Summary
Baseline field catalog `schema_baseline.json`	Defines the committed field-path catalog, including required flags for known JSONL structures.
Schema drift detection module `utils/schema_drift.py`	Collects dotted field paths, loads the baseline, computes known/new/missing sets, accumulates reports, and supports reset/cache clearing.
Parser and API wiring `utils/jsonl_parser.py`, `api/schema_report.py`, `app.py`	Records observed field paths during session parsing, exposes `/api/schema-report`, and registers the new blueprint in the app.
UI drift banner `static/js/sessions.js`, `static/css/style.css`	Fetches the schema report, renders a dismissible banner when drift exists, persists dismissal state, and adds warning styles.
Tests and benchmarks `tests/test_schema_drift.py`, `static/js/sessions.test.js`, `tests/fixtures/jsonl/unknown_field.jsonl`, `benchmarks/baselines.json`	Adds fixture coverage, schema drift tests, UI call-order/banner tests, and updated benchmark baselines.

Estimated code review effort: 4 (Complex) | ~45 minutes

Possibly related PRs

cppalliance/claude-code-chat-browser#38: Both PRs touch utils/jsonl_parser.py around parse_session and parser behavior.

Suggested reviewers: timon0305, wpak-ai

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Linked Issues check	⚠️ Warning	The PR covers drift fingerprinting, warnings, UI banner, API, tests, and fixture, but the baseline omits expected_type pairs requested by `#103`.	Restore expected_type metadata in schema_baseline.json, or update the issue if path-only matching is the intended contract.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: adding schema drift detection for upstream JSONL parsing.
Out of Scope Changes check	✅ Passed	The CSS, JS, API, utils, fixture, tests, and benchmark updates all directly support the schema drift feature; no unrelated changes stand out.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/schema-drift-detection

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

utils/schema_drift.py (2)
90-121: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Baseline JSON is re-read and re-parsed from disk on every parse_session() call.

load_baseline_fields() does file I/O + json.loads and is invoked unconditionally by diff_against_baseline(), which per the parser integration snippet runs once per session file. On a session-list page rendering many sessions, this repeats disk I/O + parsing for a file that never changes at runtime. Consider loading/parsing the baseline once (module import or lazily-cached, e.g. via functools.lru_cache) instead of on every parse.
♻️ Example: cache the parsed baseline
+from functools import lru_cache
+
+
+@lru_cache(maxsize=1)
+def load_baseline_fields() -> dict[str, SchemaFieldSpec]:
     """Load ``schema_baseline.json`` field specs keyed by dotted path."""
     raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8"))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 90 - 121, The baseline JSON is being
re-read and re-parsed on every call through diff_against_baseline() and
load_baseline_fields(), which adds repeated disk I/O for data that does not
change at runtime. Cache the parsed baseline once instead of loading it per
session, either by memoizing load_baseline_fields() with a lazy cache such as
functools.lru_cache or by initializing the baseline at module import, and keep
diff_against_baseline() using the cached result.
71-87: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

collect_field_paths_with_types and baseline expected_type are unused in the drift diff.

diff_against_baseline only compares field paths, and nothing in this module reads observed types. If type-drift checks aren’t coming next, remove the helper and expected_type plumbing for now to keep this focused.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 71 - 87, The observed type-tracking code
in collect_field_paths_with_types is not used by diff_against_baseline, so the
drift logic should be simplified to only compare field paths. Remove the unused
type-collection helper and any expected_type plumbing from the baseline handling
unless you are adding type-drift checks in this same change, and keep the
remaining symbols centered around collect_field_paths, baseline, and
diff_against_baseline.
static/js/sessions.js (1)
85-88: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Parallelize independent fetches to reduce latency.

fetchSchemaDriftBannerHtml() and the sessions fetch are independent; awaiting them sequentially adds the schema-report round trip to the workspace load time.
⚡ Suggested refactor
-        const schemaBannerHtml = await fetchSchemaDriftBannerHtml();
-
-        const res = await fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`);
-        state.cachedSessions = await res.json();
+        const [schemaBannerHtml, res] = await Promise.all([
+            fetchSchemaDriftBannerHtml(),
+            fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`),
+        ]);
+        state.cachedSessions = await res.json();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@static/js/sessions.js` around lines 85 - 88, The workspace load is waiting on
two independent requests in sequence, which adds unnecessary latency. Update the
sessions-loading flow in sessions.js so fetchSchemaDriftBannerHtml() and the
/api/projects/${encodeURIComponent(projectName)}/sessions fetch start in
parallel, then await both results together before using schemaBannerHtml and
state.cachedSessions. Keep the change localized to the sessions-loading logic
and preserve the existing assignment behavior once both promises resolve.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@utils/jsonl_parser.py`:
- Around line 236-239: `parse_session` currently calls
`record_parse_drift(observed_field_paths)` unguarded, so drift tracking can
abort parsing if `diff_against_baseline()` or `load_baseline_fields()` fails on
a missing or malformed `schema_baseline.json`. Update `record_parse_drift` (or
its call site in `parse_session`) to catch and suppress non-fatal drift-tracking
errors, while still allowing `validate_session_dict(...)` and the rest of
parsing to proceed normally; keep the fix localized around `record_parse_drift`,
`diff_against_baseline`, and `load_baseline_fields`.

---

Nitpick comments:
In `@static/js/sessions.js`:
- Around line 85-88: The workspace load is waiting on two independent requests
in sequence, which adds unnecessary latency. Update the sessions-loading flow in
sessions.js so fetchSchemaDriftBannerHtml() and the
/api/projects/${encodeURIComponent(projectName)}/sessions fetch start in
parallel, then await both results together before using schemaBannerHtml and
state.cachedSessions. Keep the change localized to the sessions-loading logic
and preserve the existing assignment behavior once both promises resolve.

In `@utils/schema_drift.py`:
- Around line 90-121: The baseline JSON is being re-read and re-parsed on every
call through diff_against_baseline() and load_baseline_fields(), which adds
repeated disk I/O for data that does not change at runtime. Cache the parsed
baseline once instead of loading it per session, either by memoizing
load_baseline_fields() with a lazy cache such as functools.lru_cache or by
initializing the baseline at module import, and keep diff_against_baseline()
using the cached result.
- Around line 71-87: The observed type-tracking code in
collect_field_paths_with_types is not used by diff_against_baseline, so the
drift logic should be simplified to only compare field paths. Remove the unused
type-collection helper and any expected_type plumbing from the baseline handling
unless you are adding type-drift checks in this same change, and keep the
remaining symbols centered around collect_field_paths, baseline, and
diff_against_baseline.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 118cf0f8-8beb-4e8a-ab11-f5c572bfcdac

📥 Commits

Reviewing files that changed from the base of the PR and between 4345d69 and 8a7f170.

📒 Files selected for processing (9)

api/schema_report.py
app.py
schema_baseline.json
static/css/style.css
static/js/sessions.js
tests/fixtures/jsonl/unknown_field.jsonl
tests/test_schema_drift.py
utils/jsonl_parser.py
utils/schema_drift.py

…108) Cache schema_baseline.json with lru_cache and make record_parse_drift non-fatal on baseline I/O or parse errors so parsing never aborts. Fetch /api/schema-report after sessions load so the banner reflects drift from the current parse run. Add vitest coverage for banner rendering and fetch ordering; extend pytest for malformed baseline. Raise benchmark baselines for per-entry field-path fingerprinting.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

utils/jsonl_parser.py (1)

178-195: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Preserve the documented “0 means uncapped” behavior.

schema_drift_sample_limit() documents 0 as disabling the sampling cap, but this loop only samples while schema_samples_remaining > 0, so setting CLAUDE_CODE_CHAT_BROWSER_SCHEMA_DRIFT_SAMPLE=0 disables drift collection entirely.

Proposed fix

-    schema_samples_remaining = schema_drift_sample_limit() if is_schema_drift_enabled() else 0
+    schema_drift_enabled = is_schema_drift_enabled()
+    sample_limit = schema_drift_sample_limit() if schema_drift_enabled else 0
+    schema_samples_remaining: int | None = (
+        None if schema_drift_enabled and sample_limit == 0 else sample_limit
+    )
@@
-            if schema_samples_remaining > 0:
+            if schema_samples_remaining is None or schema_samples_remaining > 0:
                 observed_field_paths |= collect_field_paths(entry)
-                schema_samples_remaining -= 1
+                if schema_samples_remaining is not None:
+                    schema_samples_remaining -= 1
@@
-    if is_schema_drift_enabled() and observed_field_paths:
+    if schema_drift_enabled and observed_field_paths:
         record_parse_drift(observed_field_paths)

Also applies to: 241-242

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/jsonl_parser.py` around lines 178 - 195, The schema drift sampling
logic in the JSONL parser treats `schema_drift_sample_limit()` returning 0 as
“don’t sample at all” instead of the documented uncapped mode. Update the
sampling condition in the loop that uses `schema_samples_remaining` so that 0 is
handled as unlimited collection while positive values still decrement, and apply
the same fix to the other schema-drift sampling loop referenced by the same
behavior.

♻️ Duplicate comments (1)

utils/schema_drift.py (1)

77-80: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate the baseline root before calling .get().

If schema_baseline.json is valid JSON but not an object, raw.get(...) raises AttributeError, which is not caught by record_parse_drift() and can still abort parsing.

Proposed fix

-        raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8"))
+        raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8"))
+        if not isinstance(raw, dict):
+            raise ValueError("schema_baseline.json: root must be an object")
         fields = raw.get("fields", {})

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 77 - 80, The baseline parsing in
record_parse_drift currently assumes json.loads(BASELINE_PATH.read_text(...))
returns an object and calls .get() immediately, which can raise AttributeError
for valid non-object JSON. Add a root-type check right after loading the
baseline in record_parse_drift (before accessing fields) and raise a ValueError
with the existing schema_baseline.json validation pattern if the root is not a
dict; then keep the existing fields validation unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@utils/schema_drift.py`:
- Around line 125-147: The schema drift merge in _last_report is not atomic
because prior_new/merged_new are computed in one critical section and the state
is written in a later one, allowing concurrent calls to drop newly discovered
fields. Update the logic in schema_drift.py around _lock and _last_report so the
read-compute-write of new_fields happens in a single locked section. Keep the
warning logging based on the merged result from that same protected update, and
ensure _last_report is assigned once with the fully merged state.

---

Outside diff comments:
In `@utils/jsonl_parser.py`:
- Around line 178-195: The schema drift sampling logic in the JSONL parser
treats `schema_drift_sample_limit()` returning 0 as “don’t sample at all”
instead of the documented uncapped mode. Update the sampling condition in the
loop that uses `schema_samples_remaining` so that 0 is handled as unlimited
collection while positive values still decrement, and apply the same fix to the
other schema-drift sampling loop referenced by the same behavior.

---

Duplicate comments:
In `@utils/schema_drift.py`:
- Around line 77-80: The baseline parsing in record_parse_drift currently
assumes json.loads(BASELINE_PATH.read_text(...)) returns an object and calls
.get() immediately, which can raise AttributeError for valid non-object JSON.
Add a root-type check right after loading the baseline in record_parse_drift
(before accessing fields) and raise a ValueError with the existing
schema_baseline.json validation pattern if the root is not a dict; then keep the
existing fields validation unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f74ac9b5-8fe0-4ece-91a5-7955c7a52bd6

📥 Commits

Reviewing files that changed from the base of the PR and between 252429c and aa6b31f.

📒 Files selected for processing (6)

schema_baseline.json
static/js/sessions.js
static/js/sessions.test.js
tests/test_schema_drift.py
utils/jsonl_parser.py
utils/schema_drift.py

🚧 Files skipped from review as they are similar to previous changes (3)

static/js/sessions.test.js
schema_baseline.json
static/js/sessions.js

…e root check (#108)

coderabbitai

🧹 Nitpick comments (1)

utils/schema_drift.py (1)
44-49: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Negative sample values collapse to "no cap".

max(0, int(raw)) maps any negative value to 0, which downstream (schema_drift_sample_limit()==0 → None in parse_session) means unlimited fingerprinting of every record. A user setting -1 intending to disable/limit sampling would unexpectedly scan all records. Consider treating negatives as invalid and falling back to the default 3 instead.
♻️ Optional adjustment
     raw = os.environ.get("CLAUDE_CODE_CHAT_BROWSER_SCHEMA_DRIFT_SAMPLE", "3").strip()
     try:
-        return max(0, int(raw))
+        value = int(raw)
     except ValueError:
         return 3
+    return value if value >= 0 else 3
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 44 - 49, The schema drift sample limit
parsing in schema_drift_sample_limit() treats negative values as 0, which later
becomes unlimited sampling in parse_session. Update the environment parsing so
only non-negative integers are accepted, and make any negative or otherwise
invalid input fall back to the default of 3 instead of collapsing to no cap.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@utils/schema_drift.py`:
- Around line 44-49: The schema drift sample limit parsing in
schema_drift_sample_limit() treats negative values as 0, which later becomes
unlimited sampling in parse_session. Update the environment parsing so only
non-negative integers are accepted, and make any negative or otherwise invalid
input fall back to the default of 3 instead of collapsing to no cap.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 126a8690-fcce-46b3-bff9-48ecbea3acd0

📥 Commits

Reviewing files that changed from the base of the PR and between aa6b31f and f8bfa1c.

📒 Files selected for processing (3)

tests/test_schema_drift.py
utils/jsonl_parser.py
utils/schema_drift.py

🚧 Files skipped from review as they are similar to previous changes (1)

utils/jsonl_parser.py

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread utils/jsonl_parser.py

clean6378-max-it requested a review from timon0305 July 2, 2026 18:15

timon0305 requested changes Jul 2, 2026

View reviewed changes

clean6378-max-it self-assigned this Jul 2, 2026

Address timon's feedback

aa6b31f

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread utils/schema_drift.py Outdated

fix(schema-drift): atomic drift merge, unlimited sample mode, baselin…

f8bfa1c

…e root check (#108)

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

clean6378-max-it requested a review from timon0305 July 2, 2026 19:59

timon0305 approved these changes Jul 2, 2026

View reviewed changes

Comment thread utils/schema_drift.py

fix(schema-drift): remove unused expected_type from baseline (#108)

ae2b493

clean6378-max-it requested a review from wpak-ai July 2, 2026 20:18

wpak-ai approved these changes Jul 2, 2026

View reviewed changes

wpak-ai merged commit 6465d4a into master Jul 2, 2026
16 checks passed

wpak-ai deleted the feat/schema-drift-detection branch July 2, 2026 20:44

Uh oh!

Conversation

clean6378-max-it commented Jul 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Out of scope

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clean6378-max-it commented Jul 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading