Large Reasoning Models (LRMs) often reach a correct solution before their long Chain-of-Thought trace ends, yet continue with redundant verification, repeated attempts, or unnecessary exploration that wastes computation and can even overturn the correct answer. We frame this behavior as a latent productive-to-redundant transition and show that it is directly reflected in hidden states: around first-correct-solution (FCS) boundaries, late-layer representations separate efficient from overthinking tokens, while boundary-permutation and position-control baselines collapse. Based on this signal, we propose ROM, a model-agnostic streaming intervention framework that monitors frozen LRMs with a lightweight hidden-state detector and intervenes at well-formed reasoning boundaries. Counterfactual Self-Correction (CSC) augments supervision with balanced wrong→correct trajectories, preserving useful pre-FCS correction while labeling only post-FCS continuation as redundant. Across MATH500, GSM8K, AIME25, and MMLU-Pro, ROM improves the overall tradeoff on both Qwen3-8B and DeepSeek-R1-Distill-Qwen-32B (DS-32B): on Qwen3-8B, accuracy rises from 74.47% to 74.78% with response length reduced from 4262 to 3107 tokens; on DS-32B, accuracy rises from 68.60% to 68.72% with length reduced from 3062 to 2319 tokens. The same FCS-derived supervision transfers across scale and training origin. ROM is compatible with L1, removing another 20.9–21.6% tokens at zero accuracy loss, generalizes to open-ended MMLU-Pro (+1.56 pp, 35.4% shorter), and reduces wall-clock latency by 46.5%.
Existing methods approximate the productive→redundant transition through length-RL objectives, hand-crafted decoding signals, or chunk-level answer-correctness. ROM directly supervises the boundary itself from token-level FCS labels, while remaining backbone-portable and operating at every token.
| Method | Control Signal | Backbone Portable | Token Level | Direct Transition Supervision |
|---|---|---|---|---|
| L1 | RL length reward | ✗ | ✗ | ✗ |
| O1-Pruner | RL/SFT length | ✗ | ✗ | ✗ |
| Certaindex | Answer stability | ✓ | ✗ | ✗ |
| DEER | Transition confidence | ✓ | ✗ | ✗ |
| EAT | Entropy trajectory | ✓ | ✗ | ✗ |
| SyncThink | </think> attention | ✓ | ✓ | ✗ |
| RCPD | Termination-token dynamics | ✓ | ✓ | ✗ |
| Reasoning Probing | Hidden states (chunk-level answer correctness) | ✓ | ✗ | ✗ |
| ROM (Ours) | Hidden states (online latent boundary) | ✓ | ✓ | ✓ |
We align MATH500 traces at each overthinking boundary, take the last 20 efficient and first 20 overthinking tokens, and probe Qwen3-8B's late-layer hidden states with logistic regression. The two phases are linearly separable at 85.9% accuracy / AUROC 0.928 despite often discussing the same content; probe scores rise sharply at the true boundary but stay flat under permuted alignment; and the signal is not a position shortcut (position-only classifier reaches only 50.3%, and position-residualized hidden states still separate at 86.4%).
(a) t-SNE of late-layer hidden states separates efficient (pre-FCS) from overthinking (post-FCS) tokens.
(b) Boundary-aligned probe scores rise sharply at the true FCS; permuted alignment stays flat.
(c) Position-only and permuted-boundary controls collapse to ~50%, ruling out a position shortcut.
ROM has four components: (1) latent boundary supervision that converts attempt-level correctness into token-level labels at the First-Correct-Solution (FCS) boundary; (2) Counterfactual Self-Correction (CSC), which synthesizes balanced wrong→correct trajectories so the detector learns a semantic boundary around sufficiency rather than an attempt-index shortcut; (3) a streaming detector on frozen late-layer hidden states (Qwen3-8B L32/36; DS-32B L56/64) that emits a per-token overthinking score $p_t$ in lockstep with decoding; and (4) boundary-aware intervention that, once $p_t$ crosses the threshold, backtracks to the nearest clean sentence/solution boundary and prompts a final answer.
Evaluated on MATH500, GSM8K, AIME25, and MMLU-Pro across two backbones: Qwen3-8B (RL post-trained) and DeepSeek-R1-Distill-Qwen-32B (DS-32B, distilled, 4× larger). Acc = accuracy (%), SL = response length (tokens), SE = Acc/SL×100. Bold = best per backbone, underlined = second best. Higher is better for Acc and SE; lower is better for SL.
| Backbone | Method | MATH500 | GSM8K | AIME25 | MMLU-Pro | Overall | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Acc | SL | SE | Acc | SL | SE | Acc | SL | SE | Acc | SL | SE | Acc | SL | SE | ||
| Qwen3-8B | Vanilla | 89.00 | 4297 | 2.07 | 100.00 | 2041 | 4.90 | 32.22 | 7869 | 0.41 | 76.67 | 2840 | 2.70 | 74.47 | 4262 | 2.52 |
| L1 | 89.00 | 2603 | 3.42 | 100.00 | 1259 | 7.94 | 36.67 | 6489 | 0.57 | 71.43 | 1401 | 5.10 | 74.27 | 2938 | 4.26 | |
| EAT | 89.00 | 4297 | 2.07 | 100.00 | 1594 | 6.27 | 31.11 | 6719 | 0.46 | 76.67 | 2052 | 3.74 | 74.20 | 3666 | 3.14 | |
| Cut2048 | 81.67 | 2562 | 3.19 | 98.33 | 1990 | 4.94 | 24.44 | 3663 | 0.67 | 79.05 | 2231 | 3.54 | 70.87 | 2612 | 3.09 | |
| ROM | 90.00 | 3013 | 2.99 | 100.00 | 1118 | 8.94 | 32.22 | 6698 | 0.48 | 76.19 | 2160 | 3.53 | 74.60 | 3247 | 3.99 | |
| ROMCSC | 88.33 | 2784 | 3.17 | 100.00 | 1060 | 9.43 | 32.22 | 6708 | 0.48 | 78.57 | 1875 | 4.19 | 74.78 | 3107 | 4.32 | |
| DS-32B | Vanilla | 83.33 | 3481 | 2.39 | 92.49 | 459 | 20.15 | 33.33 | 6940 | 0.48 | 65.24 | 1368 | 4.77 | 68.60 | 3062 | 6.95 |
| EAT | 72.67 | 1809 | 4.02 | 91.66 | 478 | 19.18 | 30.00 | 6587 | 0.46 | 61.90 | 1064 | 5.82 | 64.06 | 2485 | 7.37 | |
| Cut2048 | 76.67 | 3479 | 2.20 | 92.49 | 457 | 20.24 | 21.11 | 6937 | 0.30 | 61.90 | 1365 | 4.54 | 63.04 | 3060 | 6.82 | |
| ROM | 78.00 | 2390 | 3.26 | 92.49 | 455 | 20.33 | 26.67 | 5469 | 0.49 | 69.05 | 587 | 11.76 | 66.55 | 2225 | 8.96 | |
| ROMCSC | 80.33 | 2310 | 3.48 | 92.49 | 428 | 21.61 | 31.11 | 5901 | 0.53 | 70.95 | 638 | 11.12 | 68.72 | 2319 | 9.18 | |
Latent-boundary control gives the strongest accuracy–efficiency tradeoff. ROMCSC raises Qwen3-8B SE from 2.52 to 4.32 (74.47%→74.78% accuracy), and gives DS-32B the best overall SE (9.18) and accuracy (68.72%). The same FCS-derived supervision transfers across a 4× scale gap and across RL-post-trained vs. distilled origins.
End-to-end latency. On GSM8K with Qwen3-8B, ROMCSC reduces wall-clock time by 46.5% (53.3→28.5 s) with only +4.7% per-token overhead.
64 non-numerical MMLU-Pro problems with answer options removed; Qwen3-8B, $n{=}3$, judged by GPT-4o on the post-</think> answer. SE = Acc/SL×100. The signal transfers to free-form answers, suggesting it tracks hidden-state phase rather than answer-format match.
| Metric | Vanilla | ROMCSC | Δ |
|---|---|---|---|
| Accuracy (%) | 80.21±0.90 | 81.77±0.90 | +1.56 pp |
| Response Length | 2457±16 | 1587±25 | −35.4% |
| SE | 3.27 | 5.15 | +57.7% |
Stacking ROMCSC on L1-Qwen3-8B-Max (a Qwen3-8B already RL-finetuned for length control) on MATH500 / GSM8K (40 problems each, $n{=}3$). Per-instance latent-boundary detection is orthogonal to global length budgets.
| Metric | MATH500 | GSM8K | ||||
|---|---|---|---|---|---|---|
| L1 | +ROMCSC | Δ | L1 | +ROMCSC | Δ | |
| Acc (%) | 90.83 | 90.83 | 0.00 pp | 100.00 | 100.00 | 0.00 pp |
| SL | 2684 | 2105 | −21.6% | 1197 | 947 | −20.9% |
| SE | 3.38 | 4.31 | +27.5% | 8.35 | 10.56 | +26.4% |
ROM is not sensitive to the choice of layer or threshold, and the latent-monitoring overhead is small enough that the shorter decoded sequence yields a net wall-clock speedup at deployment.
(a) Layer sensitivity. Re-training the head across late-to-final layers stays within 0.40 pp (Qwen3-8B) / 1.50 pp (DS-32B) of Vanilla while cutting 20–26% of tokens.
(b) Threshold sensitivity. On GSM8K, thresholds 0.4–0.7 preserve 100% accuracy while token savings change smoothly from 56% to 40%.
(c) End-to-end latency. On GSM8K with Qwen3-8B, ROMCSC reduces wall-clock time by 46.5% (53.3→28.5 s) with only +4.7% per-token overhead.
@misc{wang2026romrealtimeoverthinkingmitigation,
title={ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention},
author={Xinyan Wang and Xiaogeng Liu and Chaowei Xiao},
year={2026},
eprint={2603.22016},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.22016},
}