JavaScript is off. The figures on this page are drawn with JavaScript and the formulas are typeset with KaTeX, so both need scripting. The text, tables, results and reasoning below are all readable without it; formulas appear as their LaTeX source (e.g. logp(gold) − logp(decoy)).

Math typesetting didn't load. A content-delivery network for KaTeX was unreachable, so formulas are shown as plain LaTeX source. The surrounding explanations and numbers are unaffected.

Priming-free self-distillation

Self-Evolving Reasoning via Metacognition

A model improves its own reasoning by distilling a learning signal from contrasts in its own attempts — no external teacher, no injected priming.

The idea. The model writes <|meta|> reasoning (verify · redirect · state confidence). We never tell it what to write. Instead we compare two of its own rollouts — meta on vs. off, gold vs. decoy, before vs. after — and turn the difference into a reward. The contrast is the teacher. That self-supervised loop is the engine of self-evolving reasoning.

Figure D · The loop — self-evolving, without a teacher

no external teacher writes the meta

The engine. The model generates two attempts (meta on / meta off), a contrast scores the difference, the reward updates the model, and we loop. Over iterations the model's resting belief inches from wrong toward right. The crossed-out teacher / priming is the point: the signal comes entirely from the model's own rollouts = priming-free self-distillation.

What we are training

The north-star is self-evolving reasoning: a model that gets better at math by learning about its own thinking, using only contrasts it can draw from its own rollouts.

Every reward we try — PMI, CF, gm, PMI-shift, asym_cf — is one self-distillation signal: a different contrast the model mines from its own attempts. None of them imports an external teacher or primes the model with the answer. The novelty is exactly this: the learning signal for "good metacognition" is distilled from the model itself.

Why not a gold-conditioned teacher (SDPO)? A teacher that has seen the answer can prime the meta, but imitating it suppresses the model's epistemic hedging (the wait / hmm / let me recheck) — and that hedging is what powers real reasoning. Our contrast-from-self-rollouts never primes, so the epistemic voice is preserved. That is the difference between priming and self-distillation.

The belief bar

One picture drives everything below. A horizontal axis: the left end is the decoy (a wrong answer), the right end is the gold (the correct answer). A dot marks the model's current belief — how much it favors gold over decoy. As reasoning streams, the dot moves.

gold / correct decoy / wrong <|meta|> block plain reasoning

Figure A · The model's belief = logp(gold) − logp(decoy)

The dot's position is $\;\log p(\text{gold}) - \log p(\text{decoy})\;$. Right of center = leaning correct, left = leaning wrong. Every figure on this page reads belief off this same axis.

Contrastive = compare two worlds

This is the heart. The reward is not "did the model do meta?" — it is "what difference did the meta make?" To get that, we run the same problem twice and contrast the two rollouts.

Figure B · Two worlds — meta OFF vs. meta ON (the CF counterfactual)

reward = (✓ with meta) − (✗ without meta)

Same reasoning, two endings. Without meta the belief dot drifts to red and the answer is wrong (✗). With meta, the moment the blue meta block appears the dot leaps to green (✓). The bracket is the credit: the meta is rewarded for the outcome difference it caused — not for existing.

The lens got sharper

Our methodological through-line: the same rollout, measured four progressively sharper ways. Each step isolates causation a little more — the same contrastive tool, aimed more precisely.

Figure C · Contrastive evolution — correlation → outcome → preference → causal shift

The through-line. correlation → outcome → answer-preference → causal belief-shift. Same tool (contrast), progressively isolating causation — and every step reads the model's own rollouts, never an external label.

What makes a meta "good"?

A good meta is one that moves the belief dot. Most meta in our data does not — it decorates an already-correct answer. Both panels below are illustrative.

BAD META · decoration (the data is full of this)

If 2x − y = 5 and x + 2y = 5, what is x?

→ solve: x = 3 ✓ (already correct)

<|meta|> confidence: 0.78 · the approach seems right · verify by working backwards <|/meta|>

● belief dot does not move — meta added nothing

GOOD META · illustrative · catches a slip

Compute 7 − 3(2 − 5).

→ first pass: 7 − 3·(−3) = 7 − 9 = −2 ✗ (sign slip)

<|meta|> check: substitute back — 3·(2−5) = −9, so 7 − (−9) = 16, not −2. correct it. <|/meta|>

→ corrected: 16 ✓

✓ belief dot moves red → green — meta did real work

Real saves like the right panel are rare in our data (hence "illustrative"). The reward methods below all exist to find and amplify the rare good meta while not paying for the common decoration.

The five self-distillation signals

Each method is one way to mine a contrast from the model's own rollouts. Formula + a one-line intuition + a colored Why (mechanism) callout — the Why is the substance.

The three rewards differ on two axes — TARGET × CONTRAST

meta on / off

meta vs placebo
(condition)

before vs after meta
(time)

own
continuation

PMIdoes the meta lift MY own following answer?

—

gold vs
decoy

—

gmdoes meta favor gold over a decoy?

PMI-shiftdid the gold−decoy gap move?

columns = CONTRAST axis (what is held against what) rows = TARGET (whose probability is scored)

Same contrastive idea, three knobs. PMI targets the model's own continuation under meta on/off; gm and PMI-shift both target the gold–decoy gap, but gm contrasts a condition (meta vs placebo) while PMI-shift contrasts time (before vs after the meta).

⚠ "PMI" means two different things on this page. In the PMI method it is the with-meta vs without-meta log-prob of the model's own continuation. In PMI-shift, "PMI_open / PMI_close" is the gold−decoy belief gap measured at two positions of one rollout. Same name, different quantities — keep them apart.

PMI

self-distill: own continuation, meta on vs offalways-on

Intent reward meta that makes the model's OWN following answer likelier.

$$R = \operatorname{mean\_min}_{t\in C}\Big[\log p(c_t\mid \text{prefix}+\text{meta}) - \log p(c_t\mid \text{prefix})\Big]\;-\;\text{placebo}$$

Scores the model's OWN continuation (e.g. "So the answer is 16"), not a teacher-forced gold answer: with-arm = prefix+meta+continuation, without-arm = prefix+continuation, over the byte-identical C-span. On the worked example, the slip-correcting meta lifts every continuation token → positive PMI.

Why · Always-onA confident meta raises the whole continuation everywhere, so PMI fires even for decorative meta; it never checks whether the meta was NEEDED, so emission drifts to ~0.96 and breaks problems the model already solved. Held-out Δ +0.010.

CF ★

self-distill: meta ON vs OFFselectivity

Intent reward meta only when it changes the OUTCOME.

$$R_{\text{meta}} \propto \mathrm{acc}(\text{with meta}) - \mathrm{acc}(\text{without meta}),\quad \text{per group; over-penalty on already-solved}$$

The counterfactual contrast (Figure B): does the meta actually flip the model's own outcome?

Why · SelectivityAlready-solved ⇒ acc_with = acc_without ⇒ zero reward for emitting ⇒ over-penalty suppresses waste ⇒ meta fires only on the ~20% where it flips wrong→right. Best held-out Δ +0.040 (only significant +), emit 0.20, hard +0.052. Small because the meta content is still correlated with the model's own errors — the ceiling (mechanism 5).

gm — additive vs multiplicative

self-distill: gold vs decoy × meta vs placeboalways-on coupling

Intent self-distill via a decoy contrast; additive keeps meta as an independent head, multiplicative shackles it to correctness.

$$gm = \big[\log p(\text{gold}\mid\text{meta}) - \log p(\text{gold}\mid\text{plac})\big] - \big[\log p(\text{decoy}\mid\text{meta}) - \log p(\text{decoy}\mid\text{plac})\big]$$ $$\text{additive: } R = R_{\text{corr}} \oplus gm \qquad\qquad \text{multiplicative: } R_{\text{meta}} = R_{\text{corr}}\cdot \exp\!\big(\mathrm{sign}(A_{\text{corr}})\cdot gm\big)$$

Teacher-forces both answers and asks: which does the meta favor — gold (16) or the decoy (−2)? On the worked example the slip-correcting meta lifts 16 a lot and −2 little/negative → the difference-in-differences is positive. A decorative meta lifts both equally → gm ≈ 0. Two failure modes, one per variant.

(A) Additive · Always-onThe independent head keeps the meta signal separate, so epistemic hedging is preserved. But with no necessity check, meta still fires everywhere — held-out Δ −0.003 (neutral). Failure is mechanism (2), not coupling.

(B) Multiplicative · Correctness-couplingShackling meta to the correctness sign punishes good meta on wrong rollouts → the model drops hedging and asserts confidently → epistemic verbalization collapses (epi-words/meta 0.2 vs additive 1.4) → reasoning degrades (SDPO). Held-out Δ −0.029 (net-harm).

asym_cf now

self-distill: asymmetric ON/OFF gatealways-on inert-centering

Intent a GATE — penalize harmful meta harder than reward helpful meta (β > α).

$$R_{\text{gate}} = \alpha\cdot\max(0,\,c_1-c_0)\;-\;\beta\cdot\max(0,\,c_0-c_1)\;-\;\gamma\cdot\mathbf{1}[c_0\approx c_1\approx 1],\qquad \beta>\alpha$$

An asymmetric counterfactual: harmful meta costs more than helpful meta earns, so the drift to "meta everywhere" is taxed.

Why · Inert-centering (bug, now fixed)The gate value is group-CONSTANT; centered within its own member mask it averaged to itself → centered to 0 → the suppression signal silently vanished (rmeta_neg_rate = 0), so emit kept drifting up to 0.92. Fix: center over the WHOLE group (non-emitting rows carry 0, breaking the symmetry) → the negative survives → emit stops drifting up. Confirmed live; suppression still weak (γ↑ next).

PMI-shift now

self-distill: before vs after the metaform ≠ behavior ceiling

Intent reward meta that causally MOVES belief toward the right answer.

$$\text{PMI}_t = \sum \big[\log p(\text{gold}) - \log p(\text{decoy})\big]@t,\qquad \text{shift} = \text{PMI}_{\text{close}} - \text{PMI}_{\text{open}}$$ $$\text{reward sign-reversal:}\quad (\text{decoy}\!\to\!\text{gold}) \;+\;,\qquad (\text{gold}\!\to\!\text{decoy}) \;-$$

A temporal contrast of the model's own belief: the gold−decoy gap at meta-OPEN vs meta-CLOSE. On the worked example the body's sign slip leaves the dot on −2 (gap negative); the meta fixes the sign and the dot leaps to 16 (gap positive) — a decoy→gold reversal, the rewarded case (Figure C, step 4).

Why · breaks Form ≠ Behavior (content level)It measures the meta's CAUSAL belief update before vs after the meta. Re-derivation / decoration produces no shift; a genuine catch produces a decoy→gold reversal. Rewarding only the reversal targets real cognition, not shape. Live: bidirectional signal (neg_rate ~0.10), acc maintained.

Why · Self-verification ceiling (residual)The meta is still generated by the SAME model that erred, so it inherits the blind spot — the shift can confirm rather than catch. Before/after differencing only PARTLY cancels this: it removes the shared baseline, not the shared blind spot.

Reward calculator — on one real sentence

The belief bar gives the intuition; this shows the computation. One shared illustrative example flows through all three rewards, token by token. Problem: 7 − 3·(2 − 5). The body makes a sign slip leaning to the decoy −2; the meta catches it and the continuation lands on gold 16.

body (sign slip) "2 − 5 = −3, 3·(−3) = −9, 7 − 9 = −2"

<|meta|> "wait — I'm SUBTRACTING 3·(2−5) = −9, so 7 − (−9) = 7 + 9 = 16, not 7 − 9. Sign error."

continuation "So the answer is 16."

PMI · with-arm vs without-arm on the model's OWN continuation

illustrative per-token log-probs · directionally correct

What each reward catches vs misses. Flip the meta to decorative ("confidence 0.78, looks right, will verify"): PMI stays positive (the model's own answer still flows) — that is its always-on weakness — while gm ≈ 0 and shift ≈ 0 (gold and decoy move together; the gap doesn't move). Only gm and PMI-shift tell a real catch from decoration.

Results

The sweet spot

"Meta helps on medium" is a result, not the intent: meta HURTS on easy (decoration), HELPS on medium (catches slips), HURTS on hard / AIME (derails). Reward shaping must select for the middle.

Experiment ladder

math_verify-graded on the 1030-eval; held-out Δ = acc_with − acc_without on the 594 confidence-rv set.

← swipe to scroll the table →

Name	Contrast (self-distill)	Result	Why
Base GRPO	final-answer correctness only (no meta)	gsm8k 0.93 · aime 0.37 math500 0.63–0.72 (grader-sensitive)	reference
e21 (Meta-CoT)	meta inside <think> + GRPO on correctness	aime 0.13 gsm8k 0.93 (tie) · math500 0.64–0.76	always-on aime collapses 0.37→0.13 (robust); overall gap within grader noise
r10v2 / SDC	KL toward a gold-conditioned (priming) teacher	steering AUC ≈ 0.49 (≈ chance)	coupling priming suppresses epistemic (SDPO); not self-distill
PMI	gold vs no-meta	held-out Δ +0.010 emit 0.96	always-on
CF ★	meta ON vs OFF; over-penalty on solved	held-out Δ +0.040 (best) emit 0.20 · hard +0.052	selectivity
cf_group	score the presence / shape of meta	meta emission → 0 (full abstention)	form ≠ behavior CF correctly found form added nothing
gm additive	gold vs decoy × meta vs placebo; R = R_corr ⊕ gm	held-out Δ −0.003 (neutral)	always-on epistemic preserved, no net gain
gm multiplicative	R_meta = R_corr · exp(sign(A_corr)·gm)	held-out Δ −0.029 (net-harm)	coupling epi-words/meta 0.2 vs 1.4
asym_cf now	asymmetric ON/OFF gate, β>α	fix confirmed blocks drift; suppression weak (γ↑)	always-on inert
PMI-shift now	gold vs decoy × before vs after meta	live: bidirectional neg_rate ~0.10 · acc maintained	form ≠ behavior ceiling

Is meta below base? The overall gap is not robust

The sign of the base − meta overall gap flips depending on the grader (the math500 LaTeX checker swings ±10pp). So no single overall number is trustworthy. Below: base vs e21r-meta (matched data) under three consistent graders.

Grader	base	e21r-meta	overall gap
old check_correctness	0.770	0.817	meta +4.7pp
completion·math_verify	0.813	0.762	base +5.0pp
union (either correct)	0.815	0.821	−0.7pp (tie)

Honest reading: meta ≈ base on easy/medium; the only robust degradation is on hard (aime). The trustworthy reward-quality signal is the paired held-out Δ below, not these fragile absolutes.

Difficulty: where meta collapses

gsm8k ≈ tie · math500 within grader noise · aime collapses 0.13 vs base 0.37 (robust across every grader — the one clear degradation).

Held-out Δ ranking

CF +0.040
only significant +

PMI +0.010

gm-add −0.003

gm-mult −0.029

The grader is unreliable on LaTeX answers — in both directions. The math500 checker scored correct completions wrong (e.g. \frac{14}{3}, (3,π/2)) and sometimes the reverse. On base, math500 ranges 0.63 (old checker) → 0.72 (completion·math_verify). Because of this ±10pp swing, every absolute 1030 number carries a grading caveat — which is exactly why the overall base-vs-meta gap above isn't robust, and why we lean on the paired held-out Δ and the grader-independent aime collapse instead.

Why it succeeded / failed

Seven recurring mechanisms. Every method maps to ≥1. Colors are reused across the page.

Selectivity ✅

Emit meta only when needed → helps. (CF)

Always-on ❌

No necessity check → meta fires everywhere → breaks already-solved problems. (PMI, gm)

Form ≠ Behavior ❌

Reward the SHAPE of meta, not real cognition → decoration with no causal substance. (cf_group)

Correctness-coupling ❌

Tie meta reward to the correctness sign → over-confident meta → suppresses epistemic hedging → worse. (gm mult; SDPO)

Self-verification ceiling ❌

Meta comes from the SAME model that erred → inherits the blind spot → confirms the error. (root cause)

Easy data ❌

SFT on easy / decorative meta degrades the base's hard-math capability + no save signal. (rv data)

Inert-centering ⚙️

A group-CONSTANT reward, centered within its own group, averages to itself → signal vanishes silently. (asym_cf, fixed)

▸ The data we trained on (SFT evolution)

base_sft legacy · ~5k · meta stripped

Plain solutions.

v8 series matched difficulty

v8_base_matched (6329 rows, hard 24%) vs v8_meta_inside (4264 rows, hard 31%, meta inside <think>). e21 trained here.

rv_redirect_verify 1763 rows · hard 0%

Redirect / verify scenarios + confidence labels. CF / gm / asym_cf trained here. Mechanism (6): all easy/medium.

Honest status / now

✓ Confirmed

Conditional reward — the asym_cf gate (drift blocked) and PMI-shift (bidirectional belief signal) — REDUCES the harm meta does on hard problems and stops the derail. The gate fix is live; suppression is still weak (γ↑ next).

✗ Open

Meta is roughly on par with base on easy/medium (gsm8k tie, math500 within grader noise) but robustly collapses on hard (AIME 0.37 → 0.13). That AIME drop is a CAPABILITY gap (the easy-only meta SFT degraded hard-math ability) plus the self-verification ceiling — reward shaping cannot create a capability the model lacks. We do not yet show a robust absolute WIN anywhere. No overclaiming.

→ Next levers (beyond reward shaping)

(a) preserve base capability — train meta WITHOUT easy-only data / from a strong base.
(b) inject INDEPENDENCE — external tools, code-check, multi-sample consistency — to break the self-verification ceiling.

▸ Why the ceiling caps everything

Meta is generated by the SAME model that erred, drawn from the SAME distribution, so it inherits the blind spot and tends to confirm the error rather than catch it. CF's +0.04 is small for exactly this reason: even perfectly selective, self-distilled meta is only as good as the self-check the model can produce. Independence (tools / external checks / consistency across samples) is the lever reward shaping cannot reach.