Strengthen verifier-discipline test to assert held-out scores and gate_action (follow-up to #87)

### Summary

Follow-up to #87 (which closed #67). The `TestVerifierDiscipline.test_gate_rejects_reward_hacking_edit` test added there is a useful smoke test, but its coverage is narrower than the name implies. This issue proposes a small strengthening so the test actually exercises the verifier-discipline invariant end-to-end.

Credit to @Tanmay9223 for the original test in #87 — this is purely an enhancement on top of it, not a complaint.

### What #87's test currently guarantees

`consolidate()` really does run the live replay → score → reject path (train/val split, baseline val scoring, trial val scoring, rejection when `cand_score > base_score` is false — `skillopt_sleep/consolidate.py`). The test confirms that a hand-crafted bad edit lands in `rejected_edits` with `accepted=False`. That part is real and worth keeping.

### Gaps worth closing

1. **The bad generalization is constructed, not discovered.** `MockRewardHackingBackend` hard-codes "returns the train reference on the shortcut task, but `"placeholder URL"` on the held-out task," so the held-out score collapses from 1.0 → 0.0 by construction. The test proves the gate rejects an *obviously* worse candidate; it does not exercise a subtle reward-hack (train/val both nudge up while generalization drops).

2. **The assertion doesn't pin the scores.** It checks `accepted=False` and that a rejected edit exists, but never asserts the baseline/candidate held-out scores. A gate that rejected for the wrong reason would still pass.

3. **`gate_action` / the vendored gate aren't covered.** In `consolidate()`, `accepted` is computed from a local `final_score > base_gate_score` comparison; `evaluate_gate(...)`'s `action` only populates the `gate_action` field and doesn't drive `accepted`. So a regression inside the vendored `evaluate_gate()` would not be caught by this test.

### Proposed enhancement

- Assert the concrete held-out baseline/candidate scores (e.g. baseline `1.0`, candidate `0.0`) so the rejection is verified to be *score-driven*, not incidental.
- Add a **paired** case: one edit that genuinely improves the held-out slice → expect `accepted=True` **and** `gate_action == "accept"`; one that regresses it → expect `accepted=False` **and** `gate_action == "reject"`. Asserting `gate_action` (not just `accepted`) brings `evaluate_gate()` under test.
- Optional: a "looks better on train, worse on held-out" case to approximate a real reward-hack rather than an all-or-nothing one.

### Why it matters

Verifier discipline (only adopt edits that improve a held-out slice) is the core safety property of the sleep loop. Making the test assert scores + `gate_action` turns it from a smoke test into a real guard against gate regressions.

`good first issue` — scoped to `tests/test_sleep_engine.py`, no engine changes required.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strengthen verifier-discipline test to assert held-out scores and gate_action (follow-up to #87) #94

Summary

What #87's test currently guarantees

Gaps worth closing

Proposed enhancement

Why it matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Strengthen verifier-discipline test to assert held-out scores and gate_action (follow-up to #87) #94

Description

Summary

What #87's test currently guarantees

Gaps worth closing

Proposed enhancement

Why it matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions