JavaScript is off. The figures on this page are drawn with JavaScript, so they need scripting. The text, tables, results and reasoning below are all readable without it.

Result archive

Metacog — result archive

Note. Historical results & deep-dives. The headline comparison lives on the main page — these earlier comparisons used a CONFOUNDED base (different SFT/RL data) and are superseded by the matched-base twin.

Contrastive = compare two worlds

This is the heart. The reward is not "did the model do meta?" — it is "what difference did the meta make?" To get that, we run the same problem twice and contrast the two rollouts.

Figure B · Two worlds — meta OFF vs. meta ON (the CF counterfactual)

reward = (✓ with meta) − (✗ without meta)

Same reasoning, two endings. Without meta the belief dot drifts to red and the answer is wrong (✗). With meta, the moment the blue meta block appears the dot leaps to green (✓). The bracket is the credit: the meta is rewarded for the outcome difference it caused — not for existing.

The lens got sharper

Our methodological through-line: the same rollout, measured four progressively sharper ways. Each step isolates causation a little more — the same contrastive tool, aimed more precisely.

Figure C · Contrastive evolution — correlation → outcome → preference → causal shift

The through-line. correlation → outcome → answer-preference → causal belief-shift. Same tool (contrast), progressively isolating causation — and every step reads the model's own rollouts, never an external label.

What makes a meta "good"?

A good meta is one that moves the belief dot. Most meta in our data does not — it decorates an already-correct answer. Both panels below are illustrative.

BAD META · decoration (the data is full of this)

If 2x − y = 5 and x + 2y = 5, what is x?

→ solve: x = 3 ✓ (already correct)

<|meta|> confidence: 0.78 · the approach seems right · verify by working backwards <|/meta|>

● belief dot does not move — meta added nothing

GOOD META · illustrative · catches a slip

Compute 7 − 3(2 − 5).

→ first pass: 7 − 3·(−3) = 7 − 9 = −2 ✗ (sign slip)

<|meta|> check: substitute back — 3·(2−5) = −9, so 7 − (−9) = 16, not −2. correct it. <|/meta|>

→ corrected: 16 ✓

✓ belief dot moves red → green — meta did real work

Real saves like the right panel are rare in our data (hence "illustrative"). The reward methods below all exist to find and amplify the rare good meta while not paying for the common decoration.

The five self-distillation signals — full detail

Each method is one way to mine a contrast from the model's own rollouts. Formula + a one-line intuition + a colored Why (mechanism) callout — the Why is the substance.

The three rewards differ on two axes — TARGET × CONTRAST

meta on / off

meta vs placebo
(condition)

before vs after meta
(time)

own
continuation

PMIdoes the meta lift MY own following answer?

—

gold vs
decoy

—

gmdoes meta favor gold over a decoy?

PMI-shiftdid the gold−decoy gap move?

columns = CONTRAST axis (what is held against what) rows = TARGET (whose probability is scored)

Same contrastive idea, three knobs. PMI targets the model's own continuation under meta on/off; gm and PMI-shift both target the gold–decoy gap, but gm contrasts a condition (meta vs placebo) while PMI-shift contrasts time (before vs after the meta).

⚠ Two "PMI"s: the PMI method scores the model's own continuation (meta on/off); PMI-shift's PMI_open/PMI_close is the gold−decoy gap at two positions of one rollout.

PMI

self-distill: own continuation, meta on vs offalways-on

Intent reward meta that makes the model's OWN following answer likelier.

$$R = \operatorname{mean\_min}_{t\in C}\Big[\log p(c_t\mid \text{prefix}+\text{meta}) - \log p(c_t\mid \text{prefix})\Big]\;-\;\text{placebo}$$

Scores the model's own continuation over the byte-identical span, with-meta vs without-meta; the slip-correcting meta lifts every token → positive PMI.

Why · Always-onNo necessity check: confident meta lifts everything, emission drifts to ~0.96 and breaks already-solved problems. Held-out Δ +0.010.

CF ★

self-distill: meta ON vs OFFselectivity

Intent reward meta only when it changes the OUTCOME.

$$R_{\text{meta}} \propto \mathrm{acc}(\text{with meta}) - \mathrm{acc}(\text{without meta}),\quad \text{per group; over-penalty on already-solved}$$

The counterfactual contrast (Figure B): does the meta actually flip the model's own outcome?

Why · SelectivityZero reward on already-solved + over-penalty ⇒ meta fires only where it flips wrong→right (emit 0.20). Best held-out Δ +0.040 (hard +0.052); capped by the self-verification ceiling (mechanism 5).

gm — additive vs multiplicative

self-distill: gold vs decoy × meta vs placeboalways-on coupling

Intent self-distill via a decoy contrast; additive keeps meta as an independent head, multiplicative shackles it to correctness.

$$gm = \big[\log p(\text{gold}\mid\text{meta}) - \log p(\text{gold}\mid\text{plac})\big] - \big[\log p(\text{decoy}\mid\text{meta}) - \log p(\text{decoy}\mid\text{plac})\big]$$ $$\text{additive: } R = R_{\text{corr}} \oplus gm \qquad\qquad \text{multiplicative: } R_{\text{meta}} = R_{\text{corr}}\cdot \exp\!\big(\mathrm{sign}(A_{\text{corr}})\cdot gm\big)$$

Teacher-force both answers: a real catch lifts gold, not the decoy (gm > 0); decoration lifts both equally (gm ≈ 0). Two variants, two failure modes.

(A) Additive · Always-onThe independent head keeps the meta signal separate, so epistemic hedging is preserved. But with no necessity check, meta still fires everywhere — held-out Δ −0.003 (neutral). Failure is mechanism (2), not coupling.

(B) Multiplicative · Correctness-couplingCoupling meta to the correctness sign collapses epistemic hedging (epi-words/meta 0.2 vs 1.4) → held-out Δ −0.029 (net-harm; SDPO effect).

asym_cf now

self-distill: asymmetric ON/OFF gatealways-on inert-centering

Intent a GATE — penalize harmful meta harder than reward helpful meta (β > α).

$$R_{\text{gate}} = \alpha\cdot\max(0,\,c_1-c_0)\;-\;\beta\cdot\max(0,\,c_0-c_1)\;-\;\gamma\cdot\mathbf{1}[c_0\approx c_1\approx 1],\qquad \beta>\alpha$$

An asymmetric counterfactual: harmful meta costs more than helpful meta earns, so the drift to "meta everywhere" is taxed.

Why · Inert-centering (bug, now fixed)Group-constant gate centered within its own mask averaged to 0 — the suppression signal silently vanished (rmeta_neg_rate = 0, emit → 0.92). Fixed by whole-group centering; suppression still weak (γ↑ next).

PMI-shift now

self-distill: before vs after the metaform ≠ behavior ceiling

Intent reward meta that causally MOVES belief toward the right answer.

$$\text{PMI}_t = \sum \big[\log p(\text{gold}) - \log p(\text{decoy})\big]@t,\qquad \text{shift} = \text{PMI}_{\text{close}} - \text{PMI}_{\text{open}}$$ $$\text{reward sign-reversal:}\quad (\text{decoy}\!\to\!\text{gold}) \;+\;,\qquad (\text{gold}\!\to\!\text{decoy}) \;-$$

The gold−decoy gap at meta-OPEN vs meta-CLOSE: a decoy→gold reversal is the rewarded case. Step through a real rollout in the calculator on the main page.

Why · breaks Form ≠ Behavior (content level)It measures the meta's CAUSAL belief update before vs after the meta. Re-derivation / decoration produces no shift; a genuine catch produces a decoy→gold reversal. Rewarding only the reversal targets real cognition, not shape. Live: bidirectional signal (neg_rate ~0.10), acc maintained.

Why · Self-verification ceiling (residual)The meta is still generated by the SAME model that erred, so it inherits the blind spot — the shift can confirm rather than catch. Before/after differencing only PARTLY cancels this: it removes the shared baseline, not the shared blind spot.

Results

The sweet spot

"Meta helps on medium" is a result, not the intent — reward shaping must select for the middle.

Experiment ladder

math_verify-graded on the 1030-eval; held-out Δ = acc_with − acc_without on the 594 confidence-rv set.

← swipe to scroll the table →

Name	Contrast (self-distill)	Result	Why
Base GRPO	final-answer correctness only (no meta)	gsm8k 0.93 · aime 0.37 math500 0.63–0.72 (grader-sensitive)	reference
e21 (Meta-CoT)	meta inside <think> + GRPO on correctness	aime 0.13 gsm8k 0.93 (tie) · math500 0.64–0.76	always-on aime collapses 0.37→0.13 (robust); overall gap within grader noise
r10v2 / SDC	KL toward a gold-conditioned (priming) teacher	steering AUC ≈ 0.49 (≈ chance)	coupling priming suppresses epistemic (SDPO); not self-distill
PMI	gold vs no-meta	held-out Δ +0.010 emit 0.96	always-on
CF ★	meta ON vs OFF; over-penalty on solved	held-out Δ +0.040 (best) emit 0.20 · hard +0.052	selectivity
cf_group	score the presence / shape of meta	meta emission → 0 (full abstention)	form ≠ behavior CF correctly found form added nothing
gm additive	gold vs decoy × meta vs placebo; R = R_corr ⊕ gm	held-out Δ −0.003 (neutral)	always-on epistemic preserved, no net gain
gm multiplicative	R_meta = R_corr · exp(sign(A_corr)·gm)	held-out Δ −0.029 (net-harm)	coupling epi-words/meta 0.2 vs 1.4
asym_cf now	asymmetric ON/OFF gate, β>α	fix confirmed blocks drift; suppression weak (γ↑)	always-on inert
PMI-shift now	gold vs decoy × before vs after meta	held-out Δ +0.019 (2nd) 1030 overall 0.786 (closest to base) · gsm8k 0.94 ≥ base ↳ matched-base twin: above base at every same-step val point (PRELIMINARY, held-in) →	selectivity ceiling narrows the gap most; hard-math still lost

Absolute accuracy on 1030, by method & step (math_verify re-grade)

Same 1030 benchmark, math_verify re-grade. None beats old Base GRPO overall; the gap is all hard-math; PMI-shift narrows it most (−4.3pp) at matched step 300. But this base carries a lineage confound — see the matched-base twin on the main page.

← swipe to scroll the table →

Method	step	gsm8k	math500	aime	overall	vs base
Base GRPO	300	0.934	0.752	0.367	0.829	reference
CF	130	0.920	0.604	0.100	0.743	−8.6pp
PMI	200	0.930	0.632	0.133	0.762	−6.7pp
gm additive	170	0.930	0.608	0.200	0.752	−7.7pp
gm multiplicative	301	0.928	0.638	0.233	0.767	−6.2pp
PMI-shift ★	300	0.940	0.666	0.233	0.786	−4.3pp

Caveats: steps differ (not a clean method ranking; the clean signal is the paired held-out Δ) and aime is n=30 (±1 problem = ±3.3pp). Robust: gsm8k ≈ base everywhere; every method collapses on aime vs base 0.37.

Is meta below base? The overall gap is not robust

The base − meta gap's sign flips with the grader (math500 LaTeX checker swings ±10pp): three consistent graders below.

Grader	base	e21r-meta	overall gap
old check_correctness	0.770	0.817	meta +4.7pp
completion·math_verify	0.813	0.762	base +5.0pp
union (either correct)	0.815	0.821	−0.7pp (tie)

Honest reading: meta ≈ base on easy/medium; the only robust degradation is on hard (aime). The trustworthy reward-quality signal is the paired held-out Δ below, not these fragile absolutes.

Difficulty: where meta collapses

gsm8k ≈ tie · math500 within grader noise · aime collapses 0.13 vs base 0.37 (robust across every grader — the one clear degradation).

Held-out Δ ranking

CF +0.040
only significant +

PMI-shift +0.019
2nd · step 300

PMI +0.010

gm-add −0.003

gm-mult −0.029

The math500 checker mis-scores LaTeX answers in both directions (base math500 ranges 0.63–0.72), so every absolute 1030 number carries a grading caveat — hence we lean on the paired held-out Δ and the grader-independent aime collapse.

Why does the new base look "worse" than the old base GRPO? A measurement artifact, not capability loss.

metric	old base GRPO (0410)	new matched-base
gsm8k	0.934	0.89–0.92
overall	77.0% (1030 held-out: gsm500+math500+aime30, 16k)	0.62 macro (9-domain held-in incl. precalc 0.24 / int-alg 0.38)

The old base also used a different SFT (legacy base_sft 4996) and different RL data (redirect_base 2935, 0% gsm8k) — the old meta-vs-base comparison was confounded; the matched-base twin (↗) removes exactly that.

Why it succeeded / failed

Seven recurring mechanisms. Every method maps to ≥1. Colors are reused across the page.

Selectivity ✅

Emit meta only when needed → helps. (CF)

Always-on ❌

No necessity check → meta fires everywhere → breaks already-solved problems. (PMI, gm)

Form ≠ Behavior ❌

Reward the SHAPE of meta, not real cognition → decoration with no causal substance. (cf_group)

Correctness-coupling ❌

Tie meta reward to the correctness sign → over-confident meta → suppresses epistemic hedging → worse. (gm mult; SDPO)

Self-verification ceiling ❌

Meta comes from the SAME model that erred → inherits the blind spot → confirms the error. (root cause)

Easy data ❌

SFT on easy / decorative meta degrades the base's hard-math capability + no save signal. (rv data)

Inert-centering ⚙️

A group-CONSTANT reward, centered within its own group, averages to itself → signal vanishes silently. (asym_cf, fixed)

▸ The data we trained on (SFT evolution)

base_sft legacy · ~5k · meta stripped

Plain solutions.

v8 series matched difficulty

v8_base_matched (6329 rows, hard 24%) vs v8_meta_inside (4264 rows, hard 31%, meta inside <think>). e21 trained here.

rv_redirect_verify 1763 rows · hard 0%

Redirect / verify scenarios + confidence labels. CF / gm / asym_cf trained here. Mechanism (6): all easy/medium.