ROM:
Real-time Overthinking Mitigation via Streaming Detection and Intervention

1University of Wisconsin–Madison, 2Johns Hopkins University

Abstract

Large Reasoning Models (LRMs) often reach a correct solution before their long Chain-of-Thought trace ends, yet continue with redundant verification, repeated attempts, or unnecessary exploration that wastes computation and can even overturn the correct answer. We frame this behavior as a latent productive-to-redundant transition and show that it is directly reflected in hidden states: around first-correct-solution (FCS) boundaries, late-layer representations separate efficient from overthinking tokens, while boundary-permutation and position-control baselines collapse. Based on this signal, we propose ROM, a model-agnostic streaming intervention framework that monitors frozen LRMs with a lightweight hidden-state detector and intervenes at well-formed reasoning boundaries. Counterfactual Self-Correction (CSC) augments supervision with balanced wrong→correct trajectories, preserving useful pre-FCS correction while labeling only post-FCS continuation as redundant. Across MATH500, GSM8K, AIME25, and MMLU-Pro, ROM improves the overall tradeoff on both Qwen3-8B and DeepSeek-R1-Distill-Qwen-32B (DS-32B): on Qwen3-8B, accuracy rises from 74.47% to 74.78% with response length reduced from 4262 to 3107 tokens; on DS-32B, accuracy rises from 68.60% to 68.72% with length reduced from 3062 to 2319 tokens. The same FCS-derived supervision transfers across scale and training origin. ROM is compatible with L1, removing another 20.9–21.6% tokens at zero accuracy loss, generalizes to open-ended MMLU-Pro (+1.56 pp, 35.4% shorter), and reduces wall-clock latency by 46.5%.

How ROM Differs from Prior Work

Existing methods approximate the productive→redundant transition through length-RL objectives, hand-crafted decoding signals, or chunk-level answer-correctness. ROM directly supervises the boundary itself from token-level FCS labels, while remaining backbone-portable and operating at every token.

Method Control Signal Backbone Portable Token Level Direct Transition Supervision
L1RL length reward
O1-PrunerRL/SFT length
CertaindexAnswer stability
DEERTransition confidence
EATEntropy trajectory
SyncThink</think> attention
RCPDTermination-token dynamics
Reasoning ProbingHidden states (chunk-level answer correctness)
ROM (Ours)Hidden states (online latent boundary)

Motivation: The Productive→Redundant Boundary Is a Decodable Latent Event

We align MATH500 traces at each overthinking boundary, take the last 20 efficient and first 20 overthinking tokens, and probe Qwen3-8B's late-layer hidden states with logistic regression. The two phases are linearly separable at 85.9% accuracy / AUROC 0.928 despite often discussing the same content; probe scores rise sharply at the true boundary but stay flat under permuted alignment; and the signal is not a position shortcut (position-only classifier reaches only 50.3%, and position-residualized hidden states still separate at 86.4%).

t-SNE of late-layer hidden states around the FCS boundary

(a) t-SNE of late-layer hidden states separates efficient (pre-FCS) from overthinking (post-FCS) tokens.

Probe score aligned at FCS boundary

(b) Boundary-aligned probe scores rise sharply at the true FCS; permuted alignment stays flat.

Position-control ablation

(c) Position-only and permuted-boundary controls collapse to ~50%, ruling out a position shortcut.

Method

ROM has four components: (1) latent boundary supervision that converts attempt-level correctness into token-level labels at the First-Correct-Solution (FCS) boundary; (2) Counterfactual Self-Correction (CSC), which synthesizes balanced wrong→correct trajectories so the detector learns a semantic boundary around sufficiency rather than an attempt-index shortcut; (3) a streaming detector on frozen late-layer hidden states (Qwen3-8B L32/36; DS-32B L56/64) that emits a per-token overthinking score $p_t$ in lockstep with decoding; and (4) boundary-aware intervention that, once $p_t$ crosses the threshold, backtracks to the nearest clean sentence/solution boundary and prompts a final answer.

ROM framework overview

Main Results

Evaluated on MATH500, GSM8K, AIME25, and MMLU-Pro across two backbones: Qwen3-8B (RL post-trained) and DeepSeek-R1-Distill-Qwen-32B (DS-32B, distilled, 4× larger). Acc = accuracy (%), SL = response length (tokens), SE = Acc/SL×100. Bold = best per backbone, underlined = second best. Higher is better for Acc and SE; lower is better for SL.

Backbone Method MATH500 GSM8K AIME25 MMLU-Pro Overall
AccSLSE AccSLSE AccSLSE AccSLSE AccSLSE
Qwen3-8B Vanilla 89.0042972.07 100.0020414.90 32.2278690.41 76.6728402.70 74.4742622.52
L1 89.0026033.42 100.0012597.94 36.6764890.57 71.4314015.10 74.2729384.26
EAT 89.0042972.07 100.0015946.27 31.1167190.46 76.6720523.74 74.2036663.14
Cut2048 81.6725623.19 98.3319904.94 24.4436630.67 79.0522313.54 70.8726123.09
ROM 90.0030132.99 100.0011188.94 32.2266980.48 76.1921603.53 74.6032473.99
ROMCSC 88.3327843.17 100.0010609.43 32.2267080.48 78.5718754.19 74.7831074.32
DS-32B Vanilla 83.3334812.39 92.4945920.15 33.3369400.48 65.2413684.77 68.6030626.95
EAT 72.6718094.02 91.6647819.18 30.0065870.46 61.9010645.82 64.0624857.37
Cut2048 76.6734792.20 92.4945720.24 21.1169370.30 61.9013654.54 63.0430606.82
ROM 78.0023903.26 92.4945520.33 26.6754690.49 69.0558711.76 66.5522258.96
ROMCSC 80.3323103.48 92.4942821.61 31.1159010.53 70.9563811.12 68.7223199.18

Latent-boundary control gives the strongest accuracy–efficiency tradeoff. ROMCSC raises Qwen3-8B SE from 2.52 to 4.32 (74.47%→74.78% accuracy), and gives DS-32B the best overall SE (9.18) and accuracy (68.72%). The same FCS-derived supervision transfers across a 4× scale gap and across RL-post-trained vs. distilled origins.

End-to-end latency. On GSM8K with Qwen3-8B, ROMCSC reduces wall-clock time by 46.5% (53.3→28.5 s) with only +4.7% per-token overhead.

Open-Ended MMLU-Pro

64 non-numerical MMLU-Pro problems with answer options removed; Qwen3-8B, $n{=}3$, judged by GPT-4o on the post-</think> answer. SE = Acc/SL×100. The signal transfers to free-form answers, suggesting it tracks hidden-state phase rather than answer-format match.

Metric Vanilla ROMCSC Δ
Accuracy (%) 80.21±0.90 81.77±0.90 +1.56 pp
Response Length 2457±16 1587±25 −35.4%
SE 3.27 5.15 +57.7%

Compatibility with L1

Stacking ROMCSC on L1-Qwen3-8B-Max (a Qwen3-8B already RL-finetuned for length control) on MATH500 / GSM8K (40 problems each, $n{=}3$). Per-instance latent-boundary detection is orthogonal to global length budgets.

Metric MATH500 GSM8K
L1 +ROMCSC Δ L1 +ROMCSC Δ
Acc (%) 90.83 90.83 0.00 pp 100.00 100.00 0.00 pp
SL 2684 2105 −21.6% 1197 947 −20.9%
SE 3.38 4.31 +27.5% 8.35 10.56 +26.4%

Robustness and Latency

ROM is not sensitive to the choice of layer or threshold, and the latent-monitoring overhead is small enough that the shorter decoded sequence yields a net wall-clock speedup at deployment.

Layer sensitivity

(a) Layer sensitivity. Re-training the head across late-to-final layers stays within 0.40 pp (Qwen3-8B) / 1.50 pp (DS-32B) of Vanilla while cutting 20–26% of tokens.

Threshold sensitivity

(b) Threshold sensitivity. On GSM8K, thresholds 0.4–0.7 preserve 100% accuracy while token savings change smoothly from 56% to 40%.

End-to-end latency

(c) End-to-end latency. On GSM8K with Qwen3-8B, ROMCSC reduces wall-clock time by 46.5% (53.3→28.5 s) with only +4.7% per-token overhead.

BibTeX

@misc{wang2026romrealtimeoverthinkingmitigation,
      title={ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention},
      author={Xinyan Wang and Xiaogeng Liu and Chaowei Xiao},
      year={2026},
      eprint={2603.22016},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.22016},
}