ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention

Abstract

Large Reasoning Models (LRMs) often reach a correct solution before their long Chain-of-Thought trace ends, yet continue with redundant verification, repeated attempts, or unnecessary exploration that wastes computation and can even overturn the correct answer. We frame this behavior as a latent productive-to-redundant transition and show that it is directly reflected in hidden states: around first-correct-solution (FCS) boundaries, late-layer representations separate efficient from overthinking tokens, while boundary-permutation and position-control baselines collapse. Based on this signal, we propose ROM, a model-agnostic streaming intervention framework that monitors frozen LRMs with a lightweight hidden-state detector and intervenes at well-formed reasoning boundaries. Counterfactual Self-Correction (CSC) augments supervision with balanced wrong→correct trajectories, preserving useful pre-FCS correction while labeling only post-FCS continuation as redundant. Across MATH500, GSM8K, AIME25, and MMLU-Pro, ROM improves the overall tradeoff on both Qwen3-8B and DeepSeek-R1-Distill-Qwen-32B (DS-32B): on Qwen3-8B, accuracy rises from 74.47% to 74.78% with response length reduced from 4262 to 3107 tokens; on DS-32B, accuracy rises from 68.60% to 68.72% with length reduced from 3062 to 2319 tokens. The same FCS-derived supervision transfers across scale and training origin. ROM is compatible with L1, removing another 20.9–21.6% tokens at zero accuracy loss, generalizes to open-ended MMLU-Pro (+1.56 pp, 35.4% shorter), and reduces wall-clock latency by 46.5%.

How ROM Differs from Prior Work

Existing methods approximate the productive→redundant transition through length-RL objectives, hand-crafted decoding signals, or chunk-level answer-correctness. ROM directly supervises the boundary itself from token-level FCS labels, while remaining backbone-portable and operating at every token.

Method	Control Signal	Backbone Portable	Token Level	Direct Transition Supervision
L1	RL length reward	✗	✗	✗
O1-Pruner	RL/SFT length	✗	✗	✗
Certaindex	Answer stability	✓	✗	✗
DEER	Transition confidence	✓	✗	✗
EAT	Entropy trajectory	✓	✗	✗
SyncThink	`</think>` attention	✓	✓	✗
RCPD	Termination-token dynamics	✓	✓	✗
Reasoning Probing	Hidden states (chunk-level answer correctness)	✓	✗	✗
ROM (Ours)	Hidden states (online latent boundary)	✓	✓	✓

Motivation: The Productive→Redundant Boundary Is a Decodable Latent Event

We align MATH500 traces at each overthinking boundary, take the last 20 efficient and first 20 overthinking tokens, and probe Qwen3-8B's late-layer hidden states with logistic regression. The two phases are linearly separable at 85.9% accuracy / AUROC 0.928 despite often discussing the same content; probe scores rise sharply at the true boundary but stay flat under permuted alignment; and the signal is not a position shortcut (position-only classifier reaches only 50.3%, and position-residualized hidden states still separate at 86.4%).

t-SNE of late-layer hidden states around the FCS boundary

(a) t-SNE of late-layer hidden states separates efficient (pre-FCS) from overthinking (post-FCS) tokens.

(b) Boundary-aligned probe scores rise sharply at the true FCS; permuted alignment stays flat.

(c) Position-only and permuted-boundary controls collapse to ~50%, ruling out a position shortcut.

Method

ROM has four components: (1) latent boundary supervision that converts attempt-level correctness into token-level labels at the First-Correct-Solution (FCS) boundary; (2) Counterfactual Self-Correction (CSC), which synthesizes balanced wrong→correct trajectories so the detector learns a semantic boundary around sufficiency rather than an attempt-index shortcut; (3) a streaming detector on frozen late-layer hidden states (Qwen3-8B L32/36; DS-32B L56/64) that emits a per-token overthinking score $p_t$ in lockstep with decoding; and (4) boundary-aware intervention that, once $p_t$ crosses the threshold, backtracks to the nearest clean sentence/solution boundary and prompts a final answer.

Main Results

Evaluated on MATH500, GSM8K, AIME25, and MMLU-Pro across two backbones: Qwen3-8B (RL post-trained) and DeepSeek-R1-Distill-Qwen-32B (DS-32B, distilled, 4× larger). Acc = accuracy (%), SL = response length (tokens), SE = Acc/SL×100. Bold = best per backbone, underlined = second best. Higher is better for Acc and SE; lower is better for SL.

Backbone	Method	MATH500			GSM8K			AIME25			MMLU-Pro			Overall
Backbone	Method	Acc	SL	SE	Acc	SL	SE	Acc	SL	SE	Acc	SL	SE	Acc	SL	SE
Qwen3-8B	Vanilla	89.00	4297	2.07	100.00	2041	4.90	32.22	7869	0.41	76.67	2840	2.70	74.47	4262	2.52
	L1	89.00	2603	3.42	100.00	1259	7.94	36.67	6489	0.57	71.43	1401	5.10	74.27	2938	4.26
	EAT	89.00	4297	2.07	100.00	1594	6.27	31.11	6719	0.46	76.67	2052	3.74	74.20	3666	3.14
	Cut₂₀₄₈	81.67	2562	3.19	98.33	1990	4.94	24.44	3663	0.67	79.05	2231	3.54	70.87	2612	3.09
	ROM	90.00	3013	2.99	100.00	1118	8.94	32.22	6698	0.48	76.19	2160	3.53	74.60	3247	3.99
	ROM_CSC	88.33	2784	3.17	100.00	1060	9.43	32.22	6708	0.48	78.57	1875	4.19	74.78	3107	4.32
DS-32B	Vanilla	83.33	3481	2.39	92.49	459	20.15	33.33	6940	0.48	65.24	1368	4.77	68.60	3062	6.95
	EAT	72.67	1809	4.02	91.66	478	19.18	30.00	6587	0.46	61.90	1064	5.82	64.06	2485	7.37
	Cut₂₀₄₈	76.67	3479	2.20	92.49	457	20.24	21.11	6937	0.30	61.90	1365	4.54	63.04	3060	6.82
	ROM	78.00	2390	3.26	92.49	455	20.33	26.67	5469	0.49	69.05	587	11.76	66.55	2225	8.96
	ROM_CSC	80.33	2310	3.48	92.49	428	21.61	31.11	5901	0.53	70.95	638	11.12	68.72	2319	9.18

Latent-boundary control gives the strongest accuracy–efficiency tradeoff. ROM_CSC raises Qwen3-8B SE from 2.52 to 4.32 (74.47%→74.78% accuracy), and gives DS-32B the best overall SE (9.18) and accuracy (68.72%). The same FCS-derived supervision transfers across a 4× scale gap and across RL-post-trained vs. distilled origins.

End-to-end latency. On GSM8K with Qwen3-8B, ROM_CSC reduces wall-clock time by 46.5% (53.3→28.5 s) with only +4.7% per-token overhead.

Open-Ended MMLU-Pro

64 non-numerical MMLU-Pro problems with answer options removed; Qwen3-8B, $n{=}3$, judged by GPT-4o on the post-</think> answer. SE = Acc/SL×100. The signal transfers to free-form answers, suggesting it tracks hidden-state phase rather than answer-format match.

Metric	Vanilla	ROM_CSC	Δ
Accuracy (%)	80.21±0.90	81.77±0.90	+1.56 pp
Response Length	2457±16	1587±25	−35.4%
SE	3.27	5.15	+57.7%

Compatibility with L1

Stacking ROM_CSC on L1-Qwen3-8B-Max (a Qwen3-8B already RL-finetuned for length control) on MATH500 / GSM8K (40 problems each, $n{=}3$). Per-instance latent-boundary detection is orthogonal to global length budgets.

Metric	MATH500			GSM8K
Metric	L1	+ROM_CSC	Δ	L1	+ROM_CSC	Δ
Acc (%)	90.83	90.83	0.00 pp	100.00	100.00	0.00 pp
SL	2684	2105	−21.6%	1197	947	−20.9%
SE	3.38	4.31	+27.5%	8.35	10.56	+26.4%

Robustness and Latency

ROM is not sensitive to the choice of layer or threshold, and the latent-monitoring overhead is small enough that the shorter decoded sequence yields a net wall-clock speedup at deployment.

(a) Layer sensitivity. Re-training the head across late-to-final layers stays within 0.40 pp (Qwen3-8B) / 1.50 pp (DS-32B) of Vanilla while cutting 20–26% of tokens.

(b) Threshold sensitivity. On GSM8K, thresholds 0.4–0.7 preserve 100% accuracy while token savings change smoothly from 56% to 40%.

(c) End-to-end latency. On GSM8K with Qwen3-8B, ROM_CSC reduces wall-clock time by 46.5% (53.3→28.5 s) with only +4.7% per-token overhead.

BibTeX

@misc{wang2026romrealtimeoverthinkingmitigation,
      title={ROM: Real-time Overthinking Mitigation via Streaming Detection and Intervention},
      author={Xinyan Wang and Xiaogeng Liu and Chaowei Xiao},
      year={2026},
      eprint={2603.22016},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.22016},
}