This is the heart. The reward is not "did the model do meta?" — it is "what difference did the meta make?" To get that, we run the same problem twice and contrast the two rollouts.
Our methodological through-line: the same rollout, measured four progressively sharper ways. Each step isolates causation a little more — the same contrastive tool, aimed more precisely.
A good meta is one that moves the belief dot. Most meta in our data does not — it decorates an already-correct answer. Both panels below are illustrative.
Real saves like the right panel are rare in our data (hence "illustrative"). The reward methods below all exist to find and amplify the rare good meta while not paying for the common decoration.
Each method is one way to mine a contrast from the model's own rollouts. Formula + a one-line intuition + a colored Why (mechanism) callout — the Why is the substance.
Scores the model's own continuation over the byte-identical span, with-meta vs without-meta; the slip-correcting meta lifts every token → positive PMI.
The counterfactual contrast (Figure B): does the meta actually flip the model's own outcome?
Teacher-force both answers: a real catch lifts gold, not the decoy (gm > 0); decoration lifts both equally (gm ≈ 0). Two variants, two failure modes.
An asymmetric counterfactual: harmful meta costs more than helpful meta earns, so the drift to "meta everywhere" is taxed.
The gold−decoy gap at meta-OPEN vs meta-CLOSE: a decoy→gold reversal is the rewarded case. Step through a real rollout in the calculator on the main page.
"Meta helps on medium" is a result, not the intent — reward shaping must select for the middle.
math_verify-graded on the 1030-eval; held-out Δ = acc_with − acc_without on the 594 confidence-rv set.
← swipe to scroll the table →
| Name | Contrast (self-distill) | Result | Why |
|---|---|---|---|
| Base GRPO | final-answer correctness only (no meta) | gsm8k 0.93 · aime 0.37 math500 0.63–0.72 (grader-sensitive) |
reference |
| e21 (Meta-CoT) | meta inside <think> + GRPO on correctness | aime 0.13 gsm8k 0.93 (tie) · math500 0.64–0.76 |
always-on aime collapses 0.37→0.13 (robust); overall gap within grader noise |
| r10v2 / SDC | KL toward a gold-conditioned (priming) teacher | steering AUC ≈ 0.49 (≈ chance) |
coupling priming suppresses epistemic (SDPO); not self-distill |
| PMI | gold vs no-meta | held-out Δ +0.010 emit 0.96 |
always-on |
| CF ★ | meta ON vs OFF; over-penalty on solved | held-out Δ +0.040 (best) emit 0.20 · hard +0.052 |
selectivity |
| cf_group | score the presence / shape of meta | meta emission → 0 (full abstention) |
form ≠ behavior CF correctly found form added nothing |
| gm additive | gold vs decoy × meta vs placebo; R = R_corr ⊕ gm | held-out Δ −0.003 (neutral) |
always-on epistemic preserved, no net gain |
| gm multiplicative | R_meta = R_corr · exp(sign(A_corr)·gm) | held-out Δ −0.029 (net-harm) |
coupling epi-words/meta 0.2 vs 1.4 |
| asym_cf now | asymmetric ON/OFF gate, β>α | fix confirmed blocks drift; suppression weak (γ↑) |
always-on inert |
| PMI-shift now | gold vs decoy × before vs after meta | held-out Δ +0.019 (2nd) 1030 overall 0.786 (closest to base) · gsm8k 0.94 ≥ base ↳ matched-base twin: above base at every same-step val point (PRELIMINARY, held-in) → |
selectivity ceiling narrows the gap most; hard-math still lost |
Same 1030 benchmark, math_verify re-grade. None beats old Base GRPO overall; the gap is all hard-math; PMI-shift narrows it most (−4.3pp) at matched step 300. But this base carries a lineage confound — see the matched-base twin on the main page.
← swipe to scroll the table →
| Method | step | gsm8k | math500 | aime | overall | vs base |
|---|---|---|---|---|---|---|
| Base GRPO | 300 | 0.934 | 0.752 | 0.367 | 0.829 | reference |
| CF | 130 | 0.920 | 0.604 | 0.100 | 0.743 | −8.6pp |
| PMI | 200 | 0.930 | 0.632 | 0.133 | 0.762 | −6.7pp |
| gm additive | 170 | 0.930 | 0.608 | 0.200 | 0.752 | −7.7pp |
| gm multiplicative | 301 | 0.928 | 0.638 | 0.233 | 0.767 | −6.2pp |
| PMI-shift ★ | 300 | 0.940 | 0.666 | 0.233 | 0.786 | −4.3pp |
Caveats: steps differ (not a clean method ranking; the clean signal is the paired held-out Δ) and aime is n=30 (±1 problem = ±3.3pp). Robust: gsm8k ≈ base everywhere; every method collapses on aime vs base 0.37.
The base − meta gap's sign flips with the grader (math500 LaTeX checker swings ±10pp): three consistent graders below.
| Grader | base | e21r-meta | overall gap |
|---|---|---|---|
| old check_correctness | 0.770 | 0.817 | meta +4.7pp |
| completion·math_verify | 0.813 | 0.762 | base +5.0pp |
| union (either correct) | 0.815 | 0.821 | −0.7pp (tie) |
Honest reading: meta ≈ base on easy/medium; the only robust degradation is on hard (aime). The trustworthy reward-quality signal is the paired held-out Δ below, not these fragile absolutes.
gsm8k ≈ tie · math500 within grader noise · aime collapses 0.13 vs base 0.37 (robust across every grader — the one clear degradation).
| metric | old base GRPO (0410) | new matched-base |
|---|---|---|
| gsm8k | 0.934 | 0.89–0.92 |
| overall | 77.0% (1030 held-out: gsm500+math500+aime30, 16k) | 0.62 macro (9-domain held-in incl. precalc 0.24 / int-alg 0.38) |
The old base also used a different SFT (legacy base_sft 4996) and different RL data (redirect_base 2935, 0% gsm8k) — the old meta-vs-base comparison was confounded; the matched-base twin (↗) removes exactly that.
Seven recurring mechanisms. Every method maps to ≥1. Colors are reused across the page.
Emit meta only when needed → helps. (CF)
No necessity check → meta fires everywhere → breaks already-solved problems. (PMI, gm)
Reward the SHAPE of meta, not real cognition → decoration with no causal substance. (cf_group)
Tie meta reward to the correctness sign → over-confident meta → suppresses epistemic hedging → worse. (gm mult; SDPO)
Meta comes from the SAME model that erred → inherits the blind spot → confirms the error. (root cause)
SFT on easy / decorative meta degrades the base's hard-math capability + no save signal. (rv data)
A group-CONSTANT reward, centered within its own group, averages to itself → signal vanishes silently. (asym_cf, fixed)
Plain solutions.
v8_base_matched (6329 rows, hard 24%) vs v8_meta_inside (4264 rows, hard 31%, meta inside <think>). e21 trained here.
Redirect / verify scenarios + confidence labels. CF / gm / asym_cf trained here. Mechanism (6): all easy/medium.