[BUG-P1] store_scan_result idempotency broken — scan_timestamp in result_hash causes cache bloat

## Summary

Follow-up to PR #73 (which closed #31). The `store_scan_result` method is correctly implemented and the P0 bug is fixed, but the **idempotency claim is misleading** due to `scan_timestamp` being included in the `result_hash`.

## Root cause

In `scripts/persistent_registry.py:store_scan_result()`, the `result_hash` is computed from:

```python
subset = {
    "files_scanned": files_scanned,
    "frontend_counts": {...},
    "backend_counts": {...},
    "frameworks": frameworks,
    "scan_timestamp": time.time(),   # <-- this changes every call
    "total_symbols": total_symbols,
}
result_json = json.dumps(subset, ..., sort_keys=True)
result_hash = hashlib.sha256(result_json.encode("utf-8")).hexdigest()
```

The `scan_timestamp` field is included in the hashed JSON, so it changes on every call (even 1 second apart). The idempotency check:

```python
existing = conn.execute(
    "SELECT 1 FROM analysis_cache WHERE command = ? AND file_set_hash = ? AND result_hash = ?",
    ("scan", file_set_hash, result_hash),
).fetchone()
```

...will NOT find an existing row because `result_hash` is different each time. A new row is inserted on every scan.

## Impact

- **Cache bloat**: every `codelens scan` inserts a new row into `analysis_cache`. After 1000 scans (e.g., CI runs over a year), the table has 1000 rows for the same file set.
- **Misleading test**: `test_store_scan_result_is_idempotent` passes only because it patches `time.time()` to return a fixed value. In production, idempotency does NOT hold.
- **Trend tracking works** (each row has a different timestamp), but deduplication does not.

## Severity

**P1** (not P0): the P0 bug from #31 IS fixed (method exists, `analysis_cache` is populated, `sqlite_persisted` flag is set, success message prints). The idempotency issue is a cache-bloat concern, not a correctness concern.

## Suggested fix

Exclude `scan_timestamp` from the `result_hash` computation. Use a separate deterministic hash that only covers the stable subset:

```python
# Build the stable subset (no timestamp) for hashing
stable_subset = {
    "files_scanned": files_scanned,
    "frontend_counts": {...},
    "backend_counts": {...},
    "frameworks": frameworks,
    "total_symbols": total_symbols,
}
stable_json = json.dumps(stable_subset, ensure_ascii=False, default=str, sort_keys=True)
result_hash = hashlib.sha256(stable_json.encode("utf-8")).hexdigest()

# Build the full subset (with timestamp) for storage
subset = {**stable_subset, "scan_timestamp": time.time()}
result_json = json.dumps(subset, ensure_ascii=False, default=str, sort_keys=True)
```

This way:
- `result_hash` is stable across re-scans of the same file set with the same counts → idempotency check finds existing row → no duplicate insert
- `scan_timestamp` is still stored in the row (for trend tracking) but doesn't affect dedup

## Acceptance criteria

- [ ] `result_hash` computation excludes `scan_timestamp`
- [ ] `scan_timestamp` still stored in `result_json`
- [ ] `test_store_scan_result_is_idempotent` passes WITHOUT patching `time.time()` (remove the patch, call twice with real time, assert 1 row)
- [ ] New test: `test_store_scan_result_updates_timestamp_on_rescan` — call twice, assert 1 row but `timestamp` column updated to latest

## Files

- `scripts/persistent_registry.py` (`store_scan_result` method, ~L700-780)
- `tests/test_persistent_registry.py` (`TestStoreScanResult` class, update idempotency test + add new test)

## Related

- PR #73 (implemented `store_scan_result`, closed #31)
- Issue #31 (original P0 bug)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG-P1] store_scan_result idempotency broken — scan_timestamp in result_hash causes cache bloat #82

Summary

Root cause

Impact

Severity

Suggested fix

Acceptance criteria

Files

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[BUG-P1] store_scan_result idempotency broken — scan_timestamp in result_hash causes cache bloat #82

Description

Summary

Root cause

Impact

Severity

Suggested fix

Acceptance criteria

Files

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions