Result archive

Metacog — result archive

Note. Historical results & deep-dives. The headline comparison lives on the main page — these earlier comparisons used a CONFOUNDED base (different SFT/RL data) and are superseded by the matched-base twin.

Contrastive = compare two worlds

This is the heart. The reward is not "did the model do meta?" — it is "what difference did the meta make?" To get that, we run the same problem twice and contrast the two rollouts.

Figure B · Two worlds — meta OFF vs. meta ON (the CF counterfactual)
reward = (✓ with meta) − (✗ without meta)
Same reasoning, two endings. Without meta the belief dot drifts to red and the answer is wrong (✗). With meta, the moment the blue meta block appears the dot leaps to green (✓). The bracket is the credit: the meta is rewarded for the outcome difference it caused — not for existing.

The lens got sharper

Our methodological through-line: the same rollout, measured four progressively sharper ways. Each step isolates causation a little more — the same contrastive tool, aimed more precisely.

Figure C · Contrastive evolution — correlation → outcome → preference → causal shift
The through-line. correlation → outcome → answer-preference → causal belief-shift. Same tool (contrast), progressively isolating causation — and every step reads the model's own rollouts, never an external label.

What makes a meta "good"?

A good meta is one that moves the belief dot. Most meta in our data does not — it decorates an already-correct answer. Both panels below are illustrative.

BAD META · decoration (the data is full of this)
If 2x − y = 5 and x + 2y = 5, what is x?
→ solve: x = 3 ✓ (already correct)
<|meta|> confidence: 0.78 · the approach seems right · verify by working backwards <|/meta|>
● belief dot does not move — meta added nothing
GOOD META · illustrative · catches a slip
Compute 7 − 3(2 − 5).
→ first pass: 7 − 3·(−3) = 7 − 9 = −2 ✗ (sign slip)
<|meta|> check: substitute back — 3·(2−5) = −9, so 7 − (−9) = 16, not −2. correct it. <|/meta|>
→ corrected: 16
✓ belief dot moves red → green — meta did real work

Real saves like the right panel are rare in our data (hence "illustrative"). The reward methods below all exist to find and amplify the rare good meta while not paying for the common decoration.

The five self-distillation signals — full detail

Each method is one way to mine a contrast from the model's own rollouts. Formula + a one-line intuition + a colored Why (mechanism) callout — the Why is the substance.

The three rewards differ on two axes — TARGET × CONTRAST
meta on / off
meta vs placebo
(condition)
before vs after meta
(time)
own
continuation
PMIdoes the meta lift MY own following answer?
gold vs
decoy
gmdoes meta favor gold over a decoy?
PMI-shiftdid the gold−decoy gap move?
columns = CONTRAST axis (what is held against what) rows = TARGET (whose probability is scored)
Same contrastive idea, three knobs. PMI targets the model's own continuation under meta on/off; gm and PMI-shift both target the golddecoy gap, but gm contrasts a condition (meta vs placebo) while PMI-shift contrasts time (before vs after the meta).
⚠ Two "PMI"s: the PMI method scores the model's own continuation (meta on/off); PMI-shift's PMIopen/PMIclose is the golddecoy gap at two positions of one rollout.

PMI

self-distill: own continuation, meta on vs offalways-on
Intent reward meta that makes the model's OWN following answer likelier.
$$R = \operatorname{mean\_min}_{t\in C}\Big[\log p(c_t\mid \text{prefix}+\text{meta}) - \log p(c_t\mid \text{prefix})\Big]\;-\;\text{placebo}$$

Scores the model's own continuation over the byte-identical span, with-meta vs without-meta; the slip-correcting meta lifts every token → positive PMI.

Why · Always-onNo necessity check: confident meta lifts everything, emission drifts to ~0.96 and breaks already-solved problems. Held-out Δ +0.010.

CF ★

self-distill: meta ON vs OFFselectivity
Intent reward meta only when it changes the OUTCOME.
$$R_{\text{meta}} \propto \mathrm{acc}(\text{with meta}) - \mathrm{acc}(\text{without meta}),\quad \text{per group; over-penalty on already-solved}$$

The counterfactual contrast (Figure B): does the meta actually flip the model's own outcome?

Why · SelectivityZero reward on already-solved + over-penalty ⇒ meta fires only where it flips wrong→right (emit 0.20). Best held-out Δ +0.040 (hard +0.052); capped by the self-verification ceiling (mechanism 5).

gm — additive vs multiplicative

self-distill: gold vs decoy × meta vs placeboalways-on coupling
Intent self-distill via a decoy contrast; additive keeps meta as an independent head, multiplicative shackles it to correctness.
$$gm = \big[\log p(\text{gold}\mid\text{meta}) - \log p(\text{gold}\mid\text{plac})\big] - \big[\log p(\text{decoy}\mid\text{meta}) - \log p(\text{decoy}\mid\text{plac})\big]$$ $$\text{additive: } R = R_{\text{corr}} \oplus gm \qquad\qquad \text{multiplicative: } R_{\text{meta}} = R_{\text{corr}}\cdot \exp\!\big(\mathrm{sign}(A_{\text{corr}})\cdot gm\big)$$

Teacher-force both answers: a real catch lifts gold, not the decoy (gm > 0); decoration lifts both equally (gm ≈ 0). Two variants, two failure modes.

(A) Additive · Always-onThe independent head keeps the meta signal separate, so epistemic hedging is preserved. But with no necessity check, meta still fires everywhere — held-out Δ −0.003 (neutral). Failure is mechanism (2), not coupling.
(B) Multiplicative · Correctness-couplingCoupling meta to the correctness sign collapses epistemic hedging (epi-words/meta 0.2 vs 1.4) → held-out Δ −0.029 (net-harm; SDPO effect).

asym_cf now

self-distill: asymmetric ON/OFF gatealways-on inert-centering
Intent a GATE — penalize harmful meta harder than reward helpful meta (β > α).
$$R_{\text{gate}} = \alpha\cdot\max(0,\,c_1-c_0)\;-\;\beta\cdot\max(0,\,c_0-c_1)\;-\;\gamma\cdot\mathbf{1}[c_0\approx c_1\approx 1],\qquad \beta>\alpha$$

An asymmetric counterfactual: harmful meta costs more than helpful meta earns, so the drift to "meta everywhere" is taxed.

Why · Inert-centering (bug, now fixed)Group-constant gate centered within its own mask averaged to 0 — the suppression signal silently vanished (rmeta_neg_rate = 0, emit → 0.92). Fixed by whole-group centering; suppression still weak (γ↑ next).

PMI-shift now

self-distill: before vs after the metaform ≠ behavior ceiling
Intent reward meta that causally MOVES belief toward the right answer.
$$\text{PMI}_t = \sum \big[\log p(\text{gold}) - \log p(\text{decoy})\big]@t,\qquad \text{shift} = \text{PMI}_{\text{close}} - \text{PMI}_{\text{open}}$$ $$\text{reward sign-reversal:}\quad (\text{decoy}\!\to\!\text{gold}) \;+\;,\qquad (\text{gold}\!\to\!\text{decoy}) \;-$$

The golddecoy gap at meta-OPEN vs meta-CLOSE: a decoy→gold reversal is the rewarded case. Step through a real rollout in the calculator on the main page.

Why · breaks Form ≠ Behavior (content level)It measures the meta's CAUSAL belief update before vs after the meta. Re-derivation / decoration produces no shift; a genuine catch produces a decoy→gold reversal. Rewarding only the reversal targets real cognition, not shape. Live: bidirectional signal (neg_rate ~0.10), acc maintained.
Why · Self-verification ceiling (residual)The meta is still generated by the SAME model that erred, so it inherits the blind spot — the shift can confirm rather than catch. Before/after differencing only PARTLY cancels this: it removes the shared baseline, not the shared blind spot.

Results

The sweet spot

"Meta helps on medium" is a result, not the intent — reward shaping must select for the middle.

problem difficulty → meta's effect on accuracy 0 (no effect) easy decoration → HURTS medium catches slips → HELPS hard / AIME derails → HURTS

Experiment ladder

math_verify-graded on the 1030-eval; held-out Δ = acc_with − acc_without on the 594 confidence-rv set.

← swipe to scroll the table →

NameContrast (self-distill)ResultWhy
Base GRPO final-answer correctness only (no meta) gsm8k 0.93 · aime 0.37
math500 0.63–0.72 (grader-sensitive)
reference
e21 (Meta-CoT) meta inside <think> + GRPO on correctness aime 0.13
gsm8k 0.93 (tie) · math500 0.64–0.76
always-on
aime collapses 0.37→0.13 (robust); overall gap within grader noise
r10v2 / SDC KL toward a gold-conditioned (priming) teacher steering AUC ≈ 0.49
(≈ chance)
coupling
priming suppresses epistemic (SDPO); not self-distill
PMI gold vs no-meta held-out Δ +0.010
emit 0.96
always-on
CF ★ meta ON vs OFF; over-penalty on solved held-out Δ +0.040 (best)
emit 0.20 · hard +0.052
selectivity
cf_group score the presence / shape of meta meta emission → 0
(full abstention)
form ≠ behavior
CF correctly found form added nothing
gm additive gold vs decoy × meta vs placebo; R = R_corr ⊕ gm held-out Δ −0.003
(neutral)
always-on
epistemic preserved, no net gain
gm multiplicative R_meta = R_corr · exp(sign(A_corr)·gm) held-out Δ −0.029
(net-harm)
coupling
epi-words/meta 0.2 vs 1.4
asym_cf now asymmetric ON/OFF gate, β>α fix confirmed
blocks drift; suppression weak (γ↑)
always-on inert
PMI-shift now gold vs decoy × before vs after meta held-out Δ +0.019 (2nd)
1030 overall 0.786 (closest to base) · gsm8k 0.94 ≥ base
↳ matched-base twin: above base at every same-step val point (PRELIMINARY, held-in) →
selectivity ceiling
narrows the gap most; hard-math still lost

Absolute accuracy on 1030, by method & step (math_verify re-grade)

Same 1030 benchmark, math_verify re-grade. None beats old Base GRPO overall; the gap is all hard-math; PMI-shift narrows it most (−4.3pp) at matched step 300. But this base carries a lineage confound — see the matched-base twin on the main page.

← swipe to scroll the table →

Methodstepgsm8kmath500aimeoverallvs base
Base GRPO3000.9340.7520.3670.829reference
CF1300.9200.6040.1000.743−8.6pp
PMI2000.9300.6320.1330.762−6.7pp
gm additive1700.9300.6080.2000.752−7.7pp
gm multiplicative3010.9280.6380.2330.767−6.2pp
PMI-shift ★3000.9400.6660.2330.786−4.3pp

Caveats: steps differ (not a clean method ranking; the clean signal is the paired held-out Δ) and aime is n=30 (±1 problem = ±3.3pp). Robust: gsm8k ≈ base everywhere; every method collapses on aime vs base 0.37.

Is meta below base? The overall gap is not robust

The base − meta gap's sign flips with the grader (math500 LaTeX checker swings ±10pp): three consistent graders below.

Graderbasee21r-metaoverall gap
old check_correctness0.7700.817meta +4.7pp
completion·math_verify0.8130.762base +5.0pp
union (either correct)0.8150.821−0.7pp (tie)

Honest reading: meta ≈ base on easy/medium; the only robust degradation is on hard (aime). The trustworthy reward-quality signal is the paired held-out Δ below, not these fragile absolutes.

Difficulty: where meta collapses

gsm8k ≈ tie · math500 within grader noise · aime collapses 0.13 vs base 0.37 (robust across every grader — the one clear degradation).

~95 ~55 10 gsm8k (easy) math500 (medium) aime (hard) base 37 meta 13 base meta

Held-out Δ ranking

CF +0.040
only significant +
PMI-shift +0.019
2nd · step 300
PMI +0.010
gm-add −0.003
gm-mult −0.029
The math500 checker mis-scores LaTeX answers in both directions (base math500 ranges 0.63–0.72), so every absolute 1030 number carries a grading caveat — hence we lean on the paired held-out Δ and the grader-independent aime collapse.
Why does the new base look "worse" than the old base GRPO? A measurement artifact, not capability loss.
metricold base GRPO (0410)new matched-base
gsm8k0.9340.89–0.92
overall77.0% (1030 held-out: gsm500+math500+aime30, 16k) 0.62 macro (9-domain held-in incl. precalc 0.24 / int-alg 0.38)

The old base also used a different SFT (legacy base_sft 4996) and different RL data (redirect_base 2935, 0% gsm8k) — the old meta-vs-base comparison was confounded; the matched-base twin () removes exactly that.

Why it succeeded / failed

Seven recurring mechanisms. Every method maps to ≥1. Colors are reused across the page.

1
Selectivity

Emit meta only when needed → helps. (CF)

2
Always-on

No necessity check → meta fires everywhere → breaks already-solved problems. (PMI, gm)

3
Form ≠ Behavior

Reward the SHAPE of meta, not real cognition → decoration with no causal substance. (cf_group)

4
Correctness-coupling

Tie meta reward to the correctness sign → over-confident meta → suppresses epistemic hedging → worse. (gm mult; SDPO)

5
Self-verification ceiling

Meta comes from the SAME model that erred → inherits the blind spot → confirms the error. (root cause)

6
Easy data

SFT on easy / decorative meta degrades the base's hard-math capability + no save signal. (rv data)

7
Inert-centering ⚙️

A group-CONSTANT reward, centered within its own group, averages to itself → signal vanishes silently. (asym_cf, fixed)

▸ The data we trained on (SFT evolution)
base_sft legacy · ~5k · meta stripped

Plain solutions.

v8 series matched difficulty

v8_base_matched (6329 rows, hard 24%) vs v8_meta_inside (4264 rows, hard 31%, meta inside <think>). e21 trained here.

rv_redirect_verify 1763 rows · hard 0%

Redirect / verify scenarios + confidence labels. CF / gm / asym_cf trained here. Mechanism (6): all easy/medium.