Priming-free self-distillation

Self-Evolving Reasoning via Metacognition

A model improves its own reasoning by distilling a learning signal from contrasts in its own attempts — no external teacher, no injected priming.
The idea. The model writes <|meta|> reasoning (verify · redirect · state confidence). We never tell it what to write. Instead we compare two of its own rollouts — meta on vs. off, gold vs. decoy, before vs. after — and turn the difference into a reward. The contrast is the teacher. That self-supervised loop is the engine of self-evolving reasoning.
Figure D · The loop — self-evolving, without a teacher
no external teacher writes the meta
The engine. The model generates two attempts (meta on / meta off), a contrast scores the difference, the reward updates the model, and we loop. Over iterations the model's resting belief inches from wrong toward right. The crossed-out teacher / priming is the point: the signal comes entirely from the model's own rollouts = priming-free self-distillation.

What we are training

The north-star is self-evolving reasoning: a model that gets better at math by learning about its own thinking, using only contrasts it can draw from its own rollouts.

Every reward we try is one self-distillation signal — a contrast mined from the model's own attempts, never an external teacher or an injected answer. A gold-conditioned teacher (SDPO) can prime the meta, but imitating it suppresses the epistemic hedging that powers real reasoning — our contrast-from-self-rollouts never primes, so that voice is preserved.

The belief bar

One picture drives everything below. A horizontal axis: the left end is the decoy (a wrong answer), the right end is the gold (the correct answer). A dot marks the model's current belief — how much it favors gold over decoy. As reasoning streams, the dot moves.

gold / correct decoy / wrong <|meta|> block plain reasoning
Figure A · The model's belief = logp(gold) − logp(decoy)
The dot's position is log p(gold) − log p(decoy). Right of center = leaning correct, left = leaning wrong. Every figure on this page reads belief off this same axis.

Method

We tried five self-distillation signals — PMI, CF, gm, asym_cf and PMI-shift — each a different contrast mined from the model's own rollouts. PMI-shift won: reward the causal gold−decoy belief shift across the meta block. Full method cards, formulas and the experiment ladder live in the archive ↗.

Reward calculator — on one real sentence

The computation, token by token. PMI and gm run on a shared illustrative example (7 − 3·(2 − 5), sign slip → decoy −2, caught → gold 16); the PMI-shift tab replays a real rollout step by step.

body (sign slip) "2 − 5 = −3, 3·(−3) = −9, 7 − 9 = −2"
<|meta|> "wait — I'm SUBTRACTING 3·(2−5) = −9, so 7 − (−9) = 7 + 9 = 16, not 7 − 9. Sign error."
continuation "So the answer is 16."
PMI · with-arm vs without-arm on the model's OWN continuation
illustrative per-token log-probs · directionally correct
What each reward catches vs misses. Flip the meta to decorative ("confidence 0.78, looks right, will verify"): PMI stays positive (the model's own answer still flows) — that is its always-on weakness — while gm ≈ 0 and shift ≈ 0 (gold and decoy move together; the gap doesn't move). Only gm and PMI-shift tell a real catch from decoration.

Matched-base: the de-confounded twin

PRELIMINARY · held-in val · held-out gs300 pending
The base twin is still training in progress, gs220; next val gs225; target gs300.

The old base-vs-meta comparison was confounded — different SFT and RL data (the old-base mystery ↗). The twin base = pmishift minus only the meta mechanism: same data, same hyperparameters, same grader, audited.

val accuracy by global step — matched twins, held-in
faint thin lines = other reward arms (gm_rlsd, asymcf_v2, stage3b) — not the base twin; the headline claim is vs base only

Δ = pmishift − base at shared val steps (held-in)

PRELIMINARY · held-in val · held-out gs300 pending. pmishift above base at every same-step val point; gap grows . Macro accuracy over 9 val domains (verl_val_meta_mix), runs merged by global_step.
▸ Raw per-step val numbers (fallback table)
gsbasepmishift
250.6590.680
500.6690.676
750.5890.693
1000.6420.716
1250.6260.722
1500.717
1750.6190.738
2000.752
2250.753
2500.746
2750.752
3000.740
gs175 same-step per-domain accuracy — base (gray) vs pmishift (green)
gs175 same-step per-domain accuracy: pmishift ahead on all 8 domains shown (8 of the 9 val domains); largest gains led by number_theory (+0.26) and intermediate_algebra (+0.16); precalculus and counting_and_probability tie at +0.14. Sorted by gap. PRELIMINARY · held-in val · held-out gs300 pending.
what pmishift actually does — gs300, 230 rollouts (4k eval), regex-cue based (approximate)
The aggregate meta-vs-no-meta accuracy is a Simpson-style artifact. Emission rises monotonically with difficulty (10%→98%), so pooled accuracy compares easy no-meta problems against hard meta ones. Stratified, the picture flips by band: on mid-hard problems (Q3) meta-bearing rollouts WIN (+0.16); on the hardest band meta is nearly the only behavior. Small strata (n≤11) are noisy; length is a difficulty proxy; the causal verdict stays with the same-problem twin comparison above.
Meta fires on ~51% of rollouts (avg_meta_blocks 1.25); among meta rollouts, verify dominates (categories overlap, so bars don't sum to 1). PRELIMINARY · held-in val · held-out gs300 pending.

Audit ledger — is the twin really matched?

✓ Verified (audited)

✓ same val problems: both arms use verl_val_meta_mix.parquet (identical file)
✓ same grader: math_verify _check_correctness, ±1 scale, both modes (verl_sdc.py:1249,1601; rewards.py:927)
✓ non-meta RL hyperparams identical: lr / batch64 / mini8 / n=8 / temp0.6 / clip0.2–0.28 / kl0 / len4096 / 300steps (both inherit verl_e4_selfdistill)
✓ VANILLA_GRPO clean: early-return, no teacher/meta path leaks into base (verl_sdc.py:2996, verl_sdc_utils.py:615)
✓ SFT twins 1:1: SFT-1 4264=4264 rows metadata-identical, only messages differ (meta stripped); SFT-2 1763 twin, same wrong-prefix loss-masking, same vocab 151671
✓ extraction confound ruled out: meta-strip regrade 0% flip (230 rows)

☐ Open

☐ held-out gs300 1030 eval PENDING (make-or-break, esp. AIME)
☐ single seed per arm (base val noisy: gs75 dip 0.589)
☐ SFT-1 LR-schedule nuance: pipeline base = 1-epoch fully-annealed cosine vs meta = epoch-1 snapshot of a 3-epoch schedule (same steps/data/peak-lr; direction favors BASE → conservative for the pro-meta claim)
☐ save_freq 5 vs 10 (operational only)

This does not overturn the held-out results above — the gs300 held-out eval is the make-or-break, especially AIME.

Previous results — superseded

The earlier meta-vs-base comparison used a confounded base (different SFT and RL data), so its gaps are not attributable to the meta mechanism; the matched-base twin above removes exactly that.

One line survives every re-grade: gsm8k ≈ base everywhere (old base 93.4 vs new matched-base 89–92 — a lineage/measurement artifact, not capability loss).

→ Full result archive: experiment ladder, 1030 tables, grader fragility, mechanisms ↗

Honest status / now

✓ Confirmed

Conditional reward — the asym_cf gate and PMI-shift's bidirectional belief signal — reduces the harm meta does on hard problems.

✗ Open

No robust absolute win yet: meta ≈ base on easy/medium but collapses on hard (AIME 0.37 → 0.13). New (PRELIMINARY · held-in val · held-out gs300 pending): vs the matched, audited twin, pmishift leads at every same-step val point — the gs300 held-out eval decides.