<|meta|> reasoning (verify · redirect · state confidence). We never tell it
what to write. Instead we compare two of its own rollouts — meta on vs. off, gold vs. decoy,
before vs. after — and turn the difference into a reward. The contrast is the teacher.
That self-supervised loop is the engine of self-evolving reasoning.
The north-star is self-evolving reasoning: a model that gets better at math by learning about its own thinking, using only contrasts it can draw from its own rollouts.
Every reward we try — PMI, CF, gm, PMI-shift, asym_cf — is one self-distillation signal: a different contrast the model mines from its own attempts. None of them imports an external teacher or primes the model with the answer. The novelty is exactly this: the learning signal for "good metacognition" is distilled from the model itself.
One picture drives everything below. A horizontal axis: the left end is the decoy (a wrong answer), the right end is the gold (the correct answer). A dot marks the model's current belief — how much it favors gold over decoy. As reasoning streams, the dot moves.
This is the heart. The reward is not "did the model do meta?" — it is "what difference did the meta make?" To get that, we run the same problem twice and contrast the two rollouts.
Our methodological through-line: the same rollout, measured four progressively sharper ways. Each step isolates causation a little more — the same contrastive tool, aimed more precisely.
A good meta is one that moves the belief dot. Most meta in our data does not — it decorates an already-correct answer. Both panels below are illustrative.
Real saves like the right panel are rare in our data (hence "illustrative"). The reward methods below all exist to find and amplify the rare good meta while not paying for the common decoration.
Each method is one way to mine a contrast from the model's own rollouts. Formula + a one-line intuition + a colored Why (mechanism) callout — the Why is the substance.
Scores the model's OWN continuation (e.g. "So the answer is 16"), not a teacher-forced gold answer: with-arm = prefix+meta+continuation, without-arm = prefix+continuation, over the byte-identical C-span. On the worked example, the slip-correcting meta lifts every continuation token → positive PMI.
The counterfactual contrast (Figure B): does the meta actually flip the model's own outcome?
Teacher-forces both answers and asks: which does the meta favor — gold (16) or the decoy (−2)? On the worked example the slip-correcting meta lifts 16 a lot and −2 little/negative → the difference-in-differences is positive. A decorative meta lifts both equally → gm ≈ 0. Two failure modes, one per variant.
An asymmetric counterfactual: harmful meta costs more than helpful meta earns, so the drift to "meta everywhere" is taxed.
A temporal contrast of the model's own belief: the gold−decoy gap at meta-OPEN vs meta-CLOSE. On the worked example the body's sign slip leaves the dot on −2 (gap negative); the meta fixes the sign and the dot leaps to 16 (gap positive) — a decoy→gold reversal, the rewarded case (Figure C, step 4).
The belief bar gives the intuition; this shows the computation. One shared illustrative example flows through all three rewards, token by token. Problem: 7 − 3·(2 − 5). The body makes a sign slip leaning to the decoy −2; the meta catches it and the continuation lands on gold 16.
"Meta helps on medium" is a result, not the intent: meta HURTS on easy (decoration), HELPS on medium (catches slips), HURTS on hard / AIME (derails). Reward shaping must select for the middle.
math_verify-graded on the 1030-eval; held-out Δ = acc_with − acc_without on the 594 confidence-rv set.
← swipe to scroll the table →
| Name | Contrast (self-distill) | Result | Why |
|---|---|---|---|
| Base GRPO | final-answer correctness only (no meta) | gsm8k 0.93 · aime 0.37 math500 0.63–0.72 (grader-sensitive) |
reference |
| e21 (Meta-CoT) | meta inside <think> + GRPO on correctness | aime 0.13 gsm8k 0.93 (tie) · math500 0.64–0.76 |
always-on aime collapses 0.37→0.13 (robust); overall gap within grader noise |
| r10v2 / SDC | KL toward a gold-conditioned (priming) teacher | steering AUC ≈ 0.49 (≈ chance) |
coupling priming suppresses epistemic (SDPO); not self-distill |
| PMI | gold vs no-meta | held-out Δ +0.010 emit 0.96 |
always-on |
| CF ★ | meta ON vs OFF; over-penalty on solved | held-out Δ +0.040 (best) emit 0.20 · hard +0.052 |
selectivity |
| cf_group | score the presence / shape of meta | meta emission → 0 (full abstention) |
form ≠ behavior CF correctly found form added nothing |
| gm additive | gold vs decoy × meta vs placebo; R = R_corr ⊕ gm | held-out Δ −0.003 (neutral) |
always-on epistemic preserved, no net gain |
| gm multiplicative | R_meta = R_corr · exp(sign(A_corr)·gm) | held-out Δ −0.029 (net-harm) |
coupling epi-words/meta 0.2 vs 1.4 |
| asym_cf now | asymmetric ON/OFF gate, β>α | fix confirmed blocks drift; suppression weak (γ↑) |
always-on inert |
| PMI-shift now | gold vs decoy × before vs after meta | live: bidirectional neg_rate ~0.10 · acc maintained |
form ≠ behavior ceiling |
The sign of the base − meta overall gap flips depending on the grader (the math500 LaTeX checker swings ±10pp). So no single overall number is trustworthy. Below: base vs e21r-meta (matched data) under three consistent graders.
| Grader | base | e21r-meta | overall gap |
|---|---|---|---|
| old check_correctness | 0.770 | 0.817 | meta +4.7pp |
| completion·math_verify | 0.813 | 0.762 | base +5.0pp |
| union (either correct) | 0.815 | 0.821 | −0.7pp (tie) |
Honest reading: meta ≈ base on easy/medium; the only robust degradation is on hard (aime). The trustworthy reward-quality signal is the paired held-out Δ below, not these fragile absolutes.
gsm8k ≈ tie · math500 within grader noise · aime collapses 0.13 vs base 0.37 (robust across every grader — the one clear degradation).
Seven recurring mechanisms. Every method maps to ≥1. Colors are reused across the page.
Emit meta only when needed → helps. (CF)
No necessity check → meta fires everywhere → breaks already-solved problems. (PMI, gm)
Reward the SHAPE of meta, not real cognition → decoration with no causal substance. (cf_group)
Tie meta reward to the correctness sign → over-confident meta → suppresses epistemic hedging → worse. (gm mult; SDPO)
Meta comes from the SAME model that erred → inherits the blind spot → confirms the error. (root cause)
SFT on easy / decorative meta degrades the base's hard-math capability + no save signal. (rv data)
A group-CONSTANT reward, centered within its own group, averages to itself → signal vanishes silently. (asym_cf, fixed)
Plain solutions.
v8_base_matched (6329 rows, hard 24%) vs v8_meta_inside (4264 rows, hard 31%, meta inside <think>). e21 trained here.
Redirect / verify scenarios + confidence labels. CF / gm / asym_cf trained here. Mechanism (6): all easy/medium.
Conditional reward — the asym_cf gate (drift blocked) and PMI-shift (bidirectional belief signal) — REDUCES the harm meta does on hard problems and stops the derail. The gate fix is live; suppression is still weak (γ↑ next).
Meta is roughly on par with base on easy/medium (gsm8k tie, math500 within grader noise) but robustly collapses on hard (AIME 0.37 → 0.13). That AIME drop is a CAPABILITY gap (the easy-only meta SFT degraded hard-math ability) plus the self-verification ceiling — reward shaping cannot create a capability the model lacks. We do not yet show a robust absolute WIN anywhere. No overclaiming.
(a) preserve base capability — train meta WITHOUT easy-only data / from a strong base.
(b) inject INDEPENDENCE — external tools, code-check, multi-sample consistency — to break the self-verification ceiling.
Meta is generated by the SAME model that erred, drawn from the SAME distribution, so it inherits the blind spot and tends to confirm the error rather than catch it. CF's +0.04 is small for exactly this reason: even perfectly selective, self-distilled meta is only as good as the self-check the model can produce. Independence (tools / external checks / consistency across samples) is the lever reward shaping cannot reach.