Skip to content

Strengthen verifier-discipline test to assert held-out scores and gate_action (follow-up to #87) #94

Description

@Yif-Yang

Summary

Follow-up to #87 (which closed #67). The TestVerifierDiscipline.test_gate_rejects_reward_hacking_edit test added there is a useful smoke test, but its coverage is narrower than the name implies. This issue proposes a small strengthening so the test actually exercises the verifier-discipline invariant end-to-end.

Credit to @Tanmay9223 for the original test in #87 — this is purely an enhancement on top of it, not a complaint.

What #87's test currently guarantees

consolidate() really does run the live replay → score → reject path (train/val split, baseline val scoring, trial val scoring, rejection when cand_score > base_score is false — skillopt_sleep/consolidate.py). The test confirms that a hand-crafted bad edit lands in rejected_edits with accepted=False. That part is real and worth keeping.

Gaps worth closing

  1. The bad generalization is constructed, not discovered. MockRewardHackingBackend hard-codes "returns the train reference on the shortcut task, but "placeholder URL" on the held-out task," so the held-out score collapses from 1.0 → 0.0 by construction. The test proves the gate rejects an obviously worse candidate; it does not exercise a subtle reward-hack (train/val both nudge up while generalization drops).

  2. The assertion doesn't pin the scores. It checks accepted=False and that a rejected edit exists, but never asserts the baseline/candidate held-out scores. A gate that rejected for the wrong reason would still pass.

  3. gate_action / the vendored gate aren't covered. In consolidate(), accepted is computed from a local final_score > base_gate_score comparison; evaluate_gate(...)'s action only populates the gate_action field and doesn't drive accepted. So a regression inside the vendored evaluate_gate() would not be caught by this test.

Proposed enhancement

  • Assert the concrete held-out baseline/candidate scores (e.g. baseline 1.0, candidate 0.0) so the rejection is verified to be score-driven, not incidental.
  • Add a paired case: one edit that genuinely improves the held-out slice → expect accepted=True and gate_action == "accept"; one that regresses it → expect accepted=False and gate_action == "reject". Asserting gate_action (not just accepted) brings evaluate_gate() under test.
  • Optional: a "looks better on train, worse on held-out" case to approximate a real reward-hack rather than an all-or-nothing one.

Why it matters

Verifier discipline (only adopt edits that improve a held-out slice) is the core safety property of the sleep loop. Making the test assert scores + gate_action turns it from a smoke test into a real guard against gate regressions.

good first issue — scoped to tests/test_sleep_engine.py, no engine changes required.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions