Summary
Follow-up to PR #73 (which closed #31). The store_scan_result method is correctly implemented and the P0 bug is fixed, but the idempotency claim is misleading due to scan_timestamp being included in the result_hash.
Root cause
In scripts/persistent_registry.py:store_scan_result(), the result_hash is computed from:
subset = {
"files_scanned": files_scanned,
"frontend_counts": {...},
"backend_counts": {...},
"frameworks": frameworks,
"scan_timestamp": time.time(), # <-- this changes every call
"total_symbols": total_symbols,
}
result_json = json.dumps(subset, ..., sort_keys=True)
result_hash = hashlib.sha256(result_json.encode("utf-8")).hexdigest()
The scan_timestamp field is included in the hashed JSON, so it changes on every call (even 1 second apart). The idempotency check:
existing = conn.execute(
"SELECT 1 FROM analysis_cache WHERE command = ? AND file_set_hash = ? AND result_hash = ?",
("scan", file_set_hash, result_hash),
).fetchone()
...will NOT find an existing row because result_hash is different each time. A new row is inserted on every scan.
Impact
- Cache bloat: every
codelens scan inserts a new row into analysis_cache. After 1000 scans (e.g., CI runs over a year), the table has 1000 rows for the same file set.
- Misleading test:
test_store_scan_result_is_idempotent passes only because it patches time.time() to return a fixed value. In production, idempotency does NOT hold.
- Trend tracking works (each row has a different timestamp), but deduplication does not.
Severity
P1 (not P0): the P0 bug from #31 IS fixed (method exists, analysis_cache is populated, sqlite_persisted flag is set, success message prints). The idempotency issue is a cache-bloat concern, not a correctness concern.
Suggested fix
Exclude scan_timestamp from the result_hash computation. Use a separate deterministic hash that only covers the stable subset:
# Build the stable subset (no timestamp) for hashing
stable_subset = {
"files_scanned": files_scanned,
"frontend_counts": {...},
"backend_counts": {...},
"frameworks": frameworks,
"total_symbols": total_symbols,
}
stable_json = json.dumps(stable_subset, ensure_ascii=False, default=str, sort_keys=True)
result_hash = hashlib.sha256(stable_json.encode("utf-8")).hexdigest()
# Build the full subset (with timestamp) for storage
subset = {**stable_subset, "scan_timestamp": time.time()}
result_json = json.dumps(subset, ensure_ascii=False, default=str, sort_keys=True)
This way:
result_hash is stable across re-scans of the same file set with the same counts → idempotency check finds existing row → no duplicate insert
scan_timestamp is still stored in the row (for trend tracking) but doesn't affect dedup
Acceptance criteria
Files
scripts/persistent_registry.py (store_scan_result method, ~L700-780)
tests/test_persistent_registry.py (TestStoreScanResult class, update idempotency test + add new test)
Related
Summary
Follow-up to PR #73 (which closed #31). The
store_scan_resultmethod is correctly implemented and the P0 bug is fixed, but the idempotency claim is misleading due toscan_timestampbeing included in theresult_hash.Root cause
In
scripts/persistent_registry.py:store_scan_result(), theresult_hashis computed from:The
scan_timestampfield is included in the hashed JSON, so it changes on every call (even 1 second apart). The idempotency check:...will NOT find an existing row because
result_hashis different each time. A new row is inserted on every scan.Impact
codelens scaninserts a new row intoanalysis_cache. After 1000 scans (e.g., CI runs over a year), the table has 1000 rows for the same file set.test_store_scan_result_is_idempotentpasses only because it patchestime.time()to return a fixed value. In production, idempotency does NOT hold.Severity
P1 (not P0): the P0 bug from #31 IS fixed (method exists,
analysis_cacheis populated,sqlite_persistedflag is set, success message prints). The idempotency issue is a cache-bloat concern, not a correctness concern.Suggested fix
Exclude
scan_timestampfrom theresult_hashcomputation. Use a separate deterministic hash that only covers the stable subset:This way:
result_hashis stable across re-scans of the same file set with the same counts → idempotency check finds existing row → no duplicate insertscan_timestampis still stored in the row (for trend tracking) but doesn't affect dedupAcceptance criteria
result_hashcomputation excludesscan_timestampscan_timestampstill stored inresult_jsontest_store_scan_result_is_idempotentpasses WITHOUT patchingtime.time()(remove the patch, call twice with real time, assert 1 row)test_store_scan_result_updates_timestamp_on_rescan— call twice, assert 1 row buttimestampcolumn updated to latestFiles
scripts/persistent_registry.py(store_scan_resultmethod, ~L700-780)tests/test_persistent_registry.py(TestStoreScanResultclass, update idempotency test + add new test)Related
store_scan_result, closed [BUG-01] scan: pr.store_scan_result(result) calls non-existent method — SQLite analysis_cache never populated #31)