Summary
Follow-up to #87 (which closed #67). The TestVerifierDiscipline.test_gate_rejects_reward_hacking_edit test added there is a useful smoke test, but its coverage is narrower than the name implies. This issue proposes a small strengthening so the test actually exercises the verifier-discipline invariant end-to-end.
Credit to @Tanmay9223 for the original test in #87 — this is purely an enhancement on top of it, not a complaint.
What #87's test currently guarantees
consolidate() really does run the live replay → score → reject path (train/val split, baseline val scoring, trial val scoring, rejection when cand_score > base_score is false — skillopt_sleep/consolidate.py). The test confirms that a hand-crafted bad edit lands in rejected_edits with accepted=False. That part is real and worth keeping.
Gaps worth closing
-
The bad generalization is constructed, not discovered. MockRewardHackingBackend hard-codes "returns the train reference on the shortcut task, but "placeholder URL" on the held-out task," so the held-out score collapses from 1.0 → 0.0 by construction. The test proves the gate rejects an obviously worse candidate; it does not exercise a subtle reward-hack (train/val both nudge up while generalization drops).
-
The assertion doesn't pin the scores. It checks accepted=False and that a rejected edit exists, but never asserts the baseline/candidate held-out scores. A gate that rejected for the wrong reason would still pass.
-
gate_action / the vendored gate aren't covered. In consolidate(), accepted is computed from a local final_score > base_gate_score comparison; evaluate_gate(...)'s action only populates the gate_action field and doesn't drive accepted. So a regression inside the vendored evaluate_gate() would not be caught by this test.
Proposed enhancement
- Assert the concrete held-out baseline/candidate scores (e.g. baseline
1.0, candidate 0.0) so the rejection is verified to be score-driven, not incidental.
- Add a paired case: one edit that genuinely improves the held-out slice → expect
accepted=True and gate_action == "accept"; one that regresses it → expect accepted=False and gate_action == "reject". Asserting gate_action (not just accepted) brings evaluate_gate() under test.
- Optional: a "looks better on train, worse on held-out" case to approximate a real reward-hack rather than an all-or-nothing one.
Why it matters
Verifier discipline (only adopt edits that improve a held-out slice) is the core safety property of the sleep loop. Making the test assert scores + gate_action turns it from a smoke test into a real guard against gate regressions.
good first issue — scoped to tests/test_sleep_engine.py, no engine changes required.
Summary
Follow-up to #87 (which closed #67). The
TestVerifierDiscipline.test_gate_rejects_reward_hacking_edittest added there is a useful smoke test, but its coverage is narrower than the name implies. This issue proposes a small strengthening so the test actually exercises the verifier-discipline invariant end-to-end.Credit to @Tanmay9223 for the original test in #87 — this is purely an enhancement on top of it, not a complaint.
What #87's test currently guarantees
consolidate()really does run the live replay → score → reject path (train/val split, baseline val scoring, trial val scoring, rejection whencand_score > base_scoreis false —skillopt_sleep/consolidate.py). The test confirms that a hand-crafted bad edit lands inrejected_editswithaccepted=False. That part is real and worth keeping.Gaps worth closing
The bad generalization is constructed, not discovered.
MockRewardHackingBackendhard-codes "returns the train reference on the shortcut task, but"placeholder URL"on the held-out task," so the held-out score collapses from 1.0 → 0.0 by construction. The test proves the gate rejects an obviously worse candidate; it does not exercise a subtle reward-hack (train/val both nudge up while generalization drops).The assertion doesn't pin the scores. It checks
accepted=Falseand that a rejected edit exists, but never asserts the baseline/candidate held-out scores. A gate that rejected for the wrong reason would still pass.gate_action/ the vendored gate aren't covered. Inconsolidate(),acceptedis computed from a localfinal_score > base_gate_scorecomparison;evaluate_gate(...)'sactiononly populates thegate_actionfield and doesn't driveaccepted. So a regression inside the vendoredevaluate_gate()would not be caught by this test.Proposed enhancement
1.0, candidate0.0) so the rejection is verified to be score-driven, not incidental.accepted=Trueandgate_action == "accept"; one that regresses it → expectaccepted=Falseandgate_action == "reject". Assertinggate_action(not justaccepted) bringsevaluate_gate()under test.Why it matters
Verifier discipline (only adopt edits that improve a held-out slice) is the core safety property of the sleep loop. Making the test assert scores +
gate_actionturns it from a smoke test into a real guard against gate regressions.good first issue— scoped totests/test_sleep_engine.py, no engine changes required.