Skip to content

[BUG-P1] store_scan_result idempotency broken — scan_timestamp in result_hash causes cache bloat #82

Description

@Wolfvin

Summary

Follow-up to PR #73 (which closed #31). The store_scan_result method is correctly implemented and the P0 bug is fixed, but the idempotency claim is misleading due to scan_timestamp being included in the result_hash.

Root cause

In scripts/persistent_registry.py:store_scan_result(), the result_hash is computed from:

subset = {
    "files_scanned": files_scanned,
    "frontend_counts": {...},
    "backend_counts": {...},
    "frameworks": frameworks,
    "scan_timestamp": time.time(),   # <-- this changes every call
    "total_symbols": total_symbols,
}
result_json = json.dumps(subset, ..., sort_keys=True)
result_hash = hashlib.sha256(result_json.encode("utf-8")).hexdigest()

The scan_timestamp field is included in the hashed JSON, so it changes on every call (even 1 second apart). The idempotency check:

existing = conn.execute(
    "SELECT 1 FROM analysis_cache WHERE command = ? AND file_set_hash = ? AND result_hash = ?",
    ("scan", file_set_hash, result_hash),
).fetchone()

...will NOT find an existing row because result_hash is different each time. A new row is inserted on every scan.

Impact

  • Cache bloat: every codelens scan inserts a new row into analysis_cache. After 1000 scans (e.g., CI runs over a year), the table has 1000 rows for the same file set.
  • Misleading test: test_store_scan_result_is_idempotent passes only because it patches time.time() to return a fixed value. In production, idempotency does NOT hold.
  • Trend tracking works (each row has a different timestamp), but deduplication does not.

Severity

P1 (not P0): the P0 bug from #31 IS fixed (method exists, analysis_cache is populated, sqlite_persisted flag is set, success message prints). The idempotency issue is a cache-bloat concern, not a correctness concern.

Suggested fix

Exclude scan_timestamp from the result_hash computation. Use a separate deterministic hash that only covers the stable subset:

# Build the stable subset (no timestamp) for hashing
stable_subset = {
    "files_scanned": files_scanned,
    "frontend_counts": {...},
    "backend_counts": {...},
    "frameworks": frameworks,
    "total_symbols": total_symbols,
}
stable_json = json.dumps(stable_subset, ensure_ascii=False, default=str, sort_keys=True)
result_hash = hashlib.sha256(stable_json.encode("utf-8")).hexdigest()

# Build the full subset (with timestamp) for storage
subset = {**stable_subset, "scan_timestamp": time.time()}
result_json = json.dumps(subset, ensure_ascii=False, default=str, sort_keys=True)

This way:

  • result_hash is stable across re-scans of the same file set with the same counts → idempotency check finds existing row → no duplicate insert
  • scan_timestamp is still stored in the row (for trend tracking) but doesn't affect dedup

Acceptance criteria

  • result_hash computation excludes scan_timestamp
  • scan_timestamp still stored in result_json
  • test_store_scan_result_is_idempotent passes WITHOUT patching time.time() (remove the patch, call twice with real time, assert 1 row)
  • New test: test_store_scan_result_updates_timestamp_on_rescan — call twice, assert 1 row but timestamp column updated to latest

Files

  • scripts/persistent_registry.py (store_scan_result method, ~L700-780)
  • tests/test_persistent_registry.py (TestStoreScanResult class, update idempotency test + add new test)

Related

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions