Skip to content

claude-code-chat-browser: schema drift detection for upstream JSONL c…#108

Merged
wpak-ai merged 5 commits into
masterfrom
feat/schema-drift-detection
Jul 2, 2026
Merged

claude-code-chat-browser: schema drift detection for upstream JSONL c…#108
wpak-ai merged 5 commits into
masterfrom
feat/schema-drift-detection

Conversation

@clean6378-max-it

@clean6378-max-it clean6378-max-it commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Closes #103

Summary

Detect upstream Claude Code JSONL schema drift by fingerprinting field paths during parse_session() and diffing against a committed schema_baseline.json. New or missing required paths emit warnings on the claude_code_chat_browser.schema_drift logger; the web UI shows a dismissible amber banner on the session list page; GET /api/schema-report returns {known_fields, new_fields, missing_fields, has_drift}.

Closes sprint item 5 (July Week 1 Thursday, 5 pt).

Changes

  • utils/schema_drift.py — baseline loader, recursive path collection, drift diff, process-wide report
  • utils/jsonl_parser.py_collect_field_paths() wrapper; drift check at parse completion
  • schema_baseline.json — 98 known (json_path, expected_type) pairs; only type is required
  • api/schema_report.pyGET /api/schema-report
  • static/js/sessions.js + static/css/style.css — dismissible amber warning banner (sessionStorage fingerprint)
  • tests/fixtures/jsonl/unknown_field.jsonl — synthetic FutureToolXYZ / unknown tool key fixture
  • tests/test_schema_drift.py — 10 tests covering warnings, API, false-positive guard, merge behavior

Out of scope

  • Auto-updating baseline on every parse (updates are deliberate)
  • Blocking parse on drift
  • CLI command (API endpoint chosen instead)

Test plan

  • pytest tests/ -k schema -q
  • pytest -q (443 passed)
  • mypy -p api -p utils -p models
  • ruff check .
  • Manual: parse tests/fixtures/jsonl/unknown_field.jsonl — confirm warning logged and amber banner in UI

Summary by CodeRabbit

  • New Features
    • Added a schema drift warning banner in the workspace (lists up to 5 newly detected fields and up to 5 missing required fields).
    • Added a /api/schema-report endpoint so the UI can retrieve current drift status.
  • Bug Fixes
    • Banner dismissal is persisted per drift fingerprint to avoid repeated alerts.
    • Drift detection is more robust to baseline/reporting issues and won’t interrupt session parsing.
  • Style
    • Updated warning styling for light/dark themes, including alert layout and a dismiss button.

…hanges (#5)

Fingerprint known Claude Code JSONL field paths against a committed
schema_baseline.json, warn on drift during parsing, expose GET
/api/schema-report, and surface a dismissible amber banner on the
session list page. Warnings only - parsing is never blocked.
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2f0ac298-06be-4a7b-9301-3ed267aff3b7

📥 Commits

Reviewing files that changed from the base of the PR and between f8bfa1c and ae2b493.

📒 Files selected for processing (1)
  • schema_baseline.json

📝 Walkthrough

Walkthrough

Adds schema drift tracking from JSONL parsing through an API endpoint and session UI banner, backed by a committed baseline, drift tests, and updated benchmarks.

Changes

Schema Drift Detection

Layer / File(s) Summary
Baseline field catalog
schema_baseline.json
Defines the committed field-path catalog, including required flags for known JSONL structures.
Schema drift detection module
utils/schema_drift.py
Collects dotted field paths, loads the baseline, computes known/new/missing sets, accumulates reports, and supports reset/cache clearing.
Parser and API wiring
utils/jsonl_parser.py, api/schema_report.py, app.py
Records observed field paths during session parsing, exposes /api/schema-report, and registers the new blueprint in the app.
UI drift banner
static/js/sessions.js, static/css/style.css
Fetches the schema report, renders a dismissible banner when drift exists, persists dismissal state, and adds warning styles.
Tests and benchmarks
tests/test_schema_drift.py, static/js/sessions.test.js, tests/fixtures/jsonl/unknown_field.jsonl, benchmarks/baselines.json
Adds fixture coverage, schema drift tests, UI call-order/banner tests, and updated benchmark baselines.

Estimated code review effort: 4 (Complex) | ~45 minutes

Possibly related PRs

Suggested reviewers: timon0305, wpak-ai

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR covers drift fingerprinting, warnings, UI banner, API, tests, and fixture, but the baseline omits expected_type pairs requested by #103. Restore expected_type metadata in schema_baseline.json, or update the issue if path-only matching is the intended contract.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding schema drift detection for upstream JSONL parsing.
Out of Scope Changes check ✅ Passed The CSS, JS, API, utils, fixture, tests, and benchmark updates all directly support the schema drift feature; no unrelated changes stand out.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/schema-drift-detection

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
utils/schema_drift.py (2)

90-121: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Baseline JSON is re-read and re-parsed from disk on every parse_session() call.

load_baseline_fields() does file I/O + json.loads and is invoked unconditionally by diff_against_baseline(), which per the parser integration snippet runs once per session file. On a session-list page rendering many sessions, this repeats disk I/O + parsing for a file that never changes at runtime. Consider loading/parsing the baseline once (module import or lazily-cached, e.g. via functools.lru_cache) instead of on every parse.

♻️ Example: cache the parsed baseline
+from functools import lru_cache
+
+
+@lru_cache(maxsize=1)
+def load_baseline_fields() -> dict[str, SchemaFieldSpec]:
     """Load ``schema_baseline.json`` field specs keyed by dotted path."""
     raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8"))
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 90 - 121, The baseline JSON is being
re-read and re-parsed on every call through diff_against_baseline() and
load_baseline_fields(), which adds repeated disk I/O for data that does not
change at runtime. Cache the parsed baseline once instead of loading it per
session, either by memoizing load_baseline_fields() with a lazy cache such as
functools.lru_cache or by initializing the baseline at module import, and keep
diff_against_baseline() using the cached result.

71-87: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

collect_field_paths_with_types and baseline expected_type are unused in the drift diff.

diff_against_baseline only compares field paths, and nothing in this module reads observed types. If type-drift checks aren’t coming next, remove the helper and expected_type plumbing for now to keep this focused.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 71 - 87, The observed type-tracking code
in collect_field_paths_with_types is not used by diff_against_baseline, so the
drift logic should be simplified to only compare field paths. Remove the unused
type-collection helper and any expected_type plumbing from the baseline handling
unless you are adding type-drift checks in this same change, and keep the
remaining symbols centered around collect_field_paths, baseline, and
diff_against_baseline.
static/js/sessions.js (1)

85-88: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Parallelize independent fetches to reduce latency.

fetchSchemaDriftBannerHtml() and the sessions fetch are independent; awaiting them sequentially adds the schema-report round trip to the workspace load time.

⚡ Suggested refactor
-        const schemaBannerHtml = await fetchSchemaDriftBannerHtml();
-
-        const res = await fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`);
-        state.cachedSessions = await res.json();
+        const [schemaBannerHtml, res] = await Promise.all([
+            fetchSchemaDriftBannerHtml(),
+            fetch(`/api/projects/${encodeURIComponent(projectName)}/sessions`),
+        ]);
+        state.cachedSessions = await res.json();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@static/js/sessions.js` around lines 85 - 88, The workspace load is waiting on
two independent requests in sequence, which adds unnecessary latency. Update the
sessions-loading flow in sessions.js so fetchSchemaDriftBannerHtml() and the
/api/projects/${encodeURIComponent(projectName)}/sessions fetch start in
parallel, then await both results together before using schemaBannerHtml and
state.cachedSessions. Keep the change localized to the sessions-loading logic
and preserve the existing assignment behavior once both promises resolve.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@utils/jsonl_parser.py`:
- Around line 236-239: `parse_session` currently calls
`record_parse_drift(observed_field_paths)` unguarded, so drift tracking can
abort parsing if `diff_against_baseline()` or `load_baseline_fields()` fails on
a missing or malformed `schema_baseline.json`. Update `record_parse_drift` (or
its call site in `parse_session`) to catch and suppress non-fatal drift-tracking
errors, while still allowing `validate_session_dict(...)` and the rest of
parsing to proceed normally; keep the fix localized around `record_parse_drift`,
`diff_against_baseline`, and `load_baseline_fields`.

---

Nitpick comments:
In `@static/js/sessions.js`:
- Around line 85-88: The workspace load is waiting on two independent requests
in sequence, which adds unnecessary latency. Update the sessions-loading flow in
sessions.js so fetchSchemaDriftBannerHtml() and the
/api/projects/${encodeURIComponent(projectName)}/sessions fetch start in
parallel, then await both results together before using schemaBannerHtml and
state.cachedSessions. Keep the change localized to the sessions-loading logic
and preserve the existing assignment behavior once both promises resolve.

In `@utils/schema_drift.py`:
- Around line 90-121: The baseline JSON is being re-read and re-parsed on every
call through diff_against_baseline() and load_baseline_fields(), which adds
repeated disk I/O for data that does not change at runtime. Cache the parsed
baseline once instead of loading it per session, either by memoizing
load_baseline_fields() with a lazy cache such as functools.lru_cache or by
initializing the baseline at module import, and keep diff_against_baseline()
using the cached result.
- Around line 71-87: The observed type-tracking code in
collect_field_paths_with_types is not used by diff_against_baseline, so the
drift logic should be simplified to only compare field paths. Remove the unused
type-collection helper and any expected_type plumbing from the baseline handling
unless you are adding type-drift checks in this same change, and keep the
remaining symbols centered around collect_field_paths, baseline, and
diff_against_baseline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 118cf0f8-8beb-4e8a-ab11-f5c572bfcdac

📥 Commits

Reviewing files that changed from the base of the PR and between 4345d69 and 8a7f170.

📒 Files selected for processing (9)
  • api/schema_report.py
  • app.py
  • schema_baseline.json
  • static/css/style.css
  • static/js/sessions.js
  • tests/fixtures/jsonl/unknown_field.jsonl
  • tests/test_schema_drift.py
  • utils/jsonl_parser.py
  • utils/schema_drift.py

Comment thread utils/jsonl_parser.py
…108)

Cache schema_baseline.json with lru_cache and make record_parse_drift
non-fatal on baseline I/O or parse errors so parsing never aborts.
Fetch /api/schema-report after sessions load so the banner reflects
drift from the current parse run. Add vitest coverage for banner
rendering and fetch ordering; extend pytest for malformed baseline.
Raise benchmark baselines for per-entry field-path fingerprinting.
Comment thread utils/jsonl_parser.py Outdated
Comment thread utils/schema_drift.py Outdated
Comment thread utils/schema_drift.py
Comment thread utils/schema_drift.py
Comment thread static/js/sessions.js Outdated
Comment thread static/js/sessions.js
Comment thread utils/schema_drift.py Outdated
Comment thread utils/jsonl_parser.py Outdated
Comment thread utils/schema_drift.py Outdated
@clean6378-max-it clean6378-max-it self-assigned this Jul 2, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
utils/jsonl_parser.py (1)

178-195: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Preserve the documented “0 means uncapped” behavior.

schema_drift_sample_limit() documents 0 as disabling the sampling cap, but this loop only samples while schema_samples_remaining > 0, so setting CLAUDE_CODE_CHAT_BROWSER_SCHEMA_DRIFT_SAMPLE=0 disables drift collection entirely.

Proposed fix
-    schema_samples_remaining = schema_drift_sample_limit() if is_schema_drift_enabled() else 0
+    schema_drift_enabled = is_schema_drift_enabled()
+    sample_limit = schema_drift_sample_limit() if schema_drift_enabled else 0
+    schema_samples_remaining: int | None = (
+        None if schema_drift_enabled and sample_limit == 0 else sample_limit
+    )
@@
-            if schema_samples_remaining > 0:
+            if schema_samples_remaining is None or schema_samples_remaining > 0:
                 observed_field_paths |= collect_field_paths(entry)
-                schema_samples_remaining -= 1
+                if schema_samples_remaining is not None:
+                    schema_samples_remaining -= 1
@@
-    if is_schema_drift_enabled() and observed_field_paths:
+    if schema_drift_enabled and observed_field_paths:
         record_parse_drift(observed_field_paths)

Also applies to: 241-242

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/jsonl_parser.py` around lines 178 - 195, The schema drift sampling
logic in the JSONL parser treats `schema_drift_sample_limit()` returning 0 as
“don’t sample at all” instead of the documented uncapped mode. Update the
sampling condition in the loop that uses `schema_samples_remaining` so that 0 is
handled as unlimited collection while positive values still decrement, and apply
the same fix to the other schema-drift sampling loop referenced by the same
behavior.
♻️ Duplicate comments (1)
utils/schema_drift.py (1)

77-80: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate the baseline root before calling .get().

If schema_baseline.json is valid JSON but not an object, raw.get(...) raises AttributeError, which is not caught by record_parse_drift() and can still abort parsing.

Proposed fix
-        raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8"))
+        raw = json.loads(BASELINE_PATH.read_text(encoding="utf-8"))
+        if not isinstance(raw, dict):
+            raise ValueError("schema_baseline.json: root must be an object")
         fields = raw.get("fields", {})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 77 - 80, The baseline parsing in
record_parse_drift currently assumes json.loads(BASELINE_PATH.read_text(...))
returns an object and calls .get() immediately, which can raise AttributeError
for valid non-object JSON. Add a root-type check right after loading the
baseline in record_parse_drift (before accessing fields) and raise a ValueError
with the existing schema_baseline.json validation pattern if the root is not a
dict; then keep the existing fields validation unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@utils/schema_drift.py`:
- Around line 125-147: The schema drift merge in _last_report is not atomic
because prior_new/merged_new are computed in one critical section and the state
is written in a later one, allowing concurrent calls to drop newly discovered
fields. Update the logic in schema_drift.py around _lock and _last_report so the
read-compute-write of new_fields happens in a single locked section. Keep the
warning logging based on the merged result from that same protected update, and
ensure _last_report is assigned once with the fully merged state.

---

Outside diff comments:
In `@utils/jsonl_parser.py`:
- Around line 178-195: The schema drift sampling logic in the JSONL parser
treats `schema_drift_sample_limit()` returning 0 as “don’t sample at all”
instead of the documented uncapped mode. Update the sampling condition in the
loop that uses `schema_samples_remaining` so that 0 is handled as unlimited
collection while positive values still decrement, and apply the same fix to the
other schema-drift sampling loop referenced by the same behavior.

---

Duplicate comments:
In `@utils/schema_drift.py`:
- Around line 77-80: The baseline parsing in record_parse_drift currently
assumes json.loads(BASELINE_PATH.read_text(...)) returns an object and calls
.get() immediately, which can raise AttributeError for valid non-object JSON.
Add a root-type check right after loading the baseline in record_parse_drift
(before accessing fields) and raise a ValueError with the existing
schema_baseline.json validation pattern if the root is not a dict; then keep the
existing fields validation unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f74ac9b5-8fe0-4ece-91a5-7955c7a52bd6

📥 Commits

Reviewing files that changed from the base of the PR and between 252429c and aa6b31f.

📒 Files selected for processing (6)
  • schema_baseline.json
  • static/js/sessions.js
  • static/js/sessions.test.js
  • tests/test_schema_drift.py
  • utils/jsonl_parser.py
  • utils/schema_drift.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • static/js/sessions.test.js
  • schema_baseline.json
  • static/js/sessions.js

Comment thread utils/schema_drift.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
utils/schema_drift.py (1)

44-49: 🚀 Performance & Scalability | 🔵 Trivial | 💤 Low value

Negative sample values collapse to "no cap".

max(0, int(raw)) maps any negative value to 0, which downstream (schema_drift_sample_limit()==0None in parse_session) means unlimited fingerprinting of every record. A user setting -1 intending to disable/limit sampling would unexpectedly scan all records. Consider treating negatives as invalid and falling back to the default 3 instead.

♻️ Optional adjustment
     raw = os.environ.get("CLAUDE_CODE_CHAT_BROWSER_SCHEMA_DRIFT_SAMPLE", "3").strip()
     try:
-        return max(0, int(raw))
+        value = int(raw)
     except ValueError:
         return 3
+    return value if value >= 0 else 3
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@utils/schema_drift.py` around lines 44 - 49, The schema drift sample limit
parsing in schema_drift_sample_limit() treats negative values as 0, which later
becomes unlimited sampling in parse_session. Update the environment parsing so
only non-negative integers are accepted, and make any negative or otherwise
invalid input fall back to the default of 3 instead of collapsing to no cap.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@utils/schema_drift.py`:
- Around line 44-49: The schema drift sample limit parsing in
schema_drift_sample_limit() treats negative values as 0, which later becomes
unlimited sampling in parse_session. Update the environment parsing so only
non-negative integers are accepted, and make any negative or otherwise invalid
input fall back to the default of 3 instead of collapsing to no cap.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 126a8690-fcce-46b3-bff9-48ecbea3acd0

📥 Commits

Reviewing files that changed from the base of the PR and between aa6b31f and f8bfa1c.

📒 Files selected for processing (3)
  • tests/test_schema_drift.py
  • utils/jsonl_parser.py
  • utils/schema_drift.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • utils/jsonl_parser.py

Comment thread utils/schema_drift.py
@clean6378-max-it clean6378-max-it requested a review from wpak-ai July 2, 2026 20:18
@wpak-ai wpak-ai merged commit 6465d4a into master Jul 2, 2026
16 checks passed
@wpak-ai wpak-ai deleted the feat/schema-drift-detection branch July 2, 2026 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

claude-code-chat-browser: Schema drift detection — mechanism to detect upstream Claude Code JSONL changes

3 participants