<|meta|> reasoning (verify · redirect · state confidence). We never tell it
what to write. Instead we compare two of its own rollouts — meta on vs. off, gold vs. decoy,
before vs. after — and turn the difference into a reward. The contrast is the teacher.
That self-supervised loop is the engine of self-evolving reasoning.
The north-star is self-evolving reasoning: a model that gets better at math by learning about its own thinking, using only contrasts it can draw from its own rollouts.
Every reward we try is one self-distillation signal — a contrast mined from the model's own attempts, never an external teacher or an injected answer. A gold-conditioned teacher (SDPO) can prime the meta, but imitating it suppresses the epistemic hedging that powers real reasoning — our contrast-from-self-rollouts never primes, so that voice is preserved.
One picture drives everything below. A horizontal axis: the left end is the decoy (a wrong answer), the right end is the gold (the correct answer). A dot marks the model's current belief — how much it favors gold over decoy. As reasoning streams, the dot moves.
We tried five self-distillation signals — PMI, CF, gm, asym_cf and PMI-shift — each a different contrast mined from the model's own rollouts. PMI-shift won: reward the causal gold−decoy belief shift across the meta block. Full method cards, formulas and the experiment ladder live in the archive ↗.
The computation, token by token. PMI and gm run on a shared illustrative example (7 − 3·(2 − 5), sign slip → decoy −2, caught → gold 16); the PMI-shift tab replays a real rollout step by step.
The old base-vs-meta comparison was confounded — different SFT and RL data (the old-base mystery ↗). The twin base = pmishift minus only the meta mechanism: same data, same hyperparameters, same grader, audited.
Δ = pmishift − base at shared val steps (held-in)
| gs | base | pmishift |
|---|---|---|
| 25 | 0.659 | 0.680 |
| 50 | 0.669 | 0.676 |
| 75 | 0.589 | 0.693 |
| 100 | 0.642 | 0.716 |
| 125 | 0.626 | 0.722 |
| 150 | — | 0.717 |
| 175 | 0.619 | 0.738 |
| 200 | — | 0.752 |
| 225 | — | 0.753 |
| 250 | — | 0.746 |
| 275 | — | 0.752 |
| 300 | — | 0.740 |
✓ same val problems: both arms use verl_val_meta_mix.parquet (identical file)
✓ same grader: math_verify _check_correctness, ±1 scale, both modes (verl_sdc.py:1249,1601; rewards.py:927)
✓ non-meta RL hyperparams identical: lr / batch64 / mini8 / n=8 / temp0.6 / clip0.2–0.28 / kl0 / len4096 / 300steps (both inherit verl_e4_selfdistill)
✓ VANILLA_GRPO clean: early-return, no teacher/meta path leaks into base (verl_sdc.py:2996, verl_sdc_utils.py:615)
✓ SFT twins 1:1: SFT-1 4264=4264 rows metadata-identical, only messages differ (meta stripped); SFT-2 1763 twin, same wrong-prefix loss-masking, same vocab 151671
✓ extraction confound ruled out: meta-strip regrade 0% flip (230 rows)
☐ held-out gs300 1030 eval PENDING (make-or-break, esp. AIME)
☐ single seed per arm (base val noisy: gs75 dip 0.589)
☐ SFT-1 LR-schedule nuance: pipeline base = 1-epoch fully-annealed cosine vs meta = epoch-1 snapshot of a 3-epoch schedule (same steps/data/peak-lr; direction favors BASE → conservative for the pro-meta claim)
☐ save_freq 5 vs 10 (operational only)
This does not overturn the held-out results above — the gs300 held-out eval is the make-or-break, especially AIME.
The earlier meta-vs-base comparison used a confounded base (different SFT and RL data), so its gaps are not attributable to the meta mechanism; the matched-base twin above removes exactly that.
One line survives every re-grade: gsm8k ≈ base everywhere (old base 93.4 vs new matched-base 89–92 — a lineage/measurement artifact, not capability loss).
→ Full result archive: experiment ladder, 1030 tables, grader fragility, mechanisms ↗
Conditional reward — the asym_cf gate and PMI-shift's bidirectional belief signal — reduces the harm meta does on hard problems.
No robust absolute win yet: meta ≈ base on easy/medium but collapses on hard (AIME 0.37 → 0.13). New (PRELIMINARY · held-in val · held-out gs300 pending): vs the matched, audited twin, pmishift leads at every same-step val point — the gs300 held-out eval decides.
(a) preserve base capability (drop easy-only meta data); (b) inject independence (tools / code-check / multi-sample) to break the self-verification ceiling.