Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States
Abstract
Reinforcement learning for large language models suffers from a capability ceiling due to reliance on expanding action histories rather than compact Markov states, which can be overcome by explicitly incorporating structured state representations to enable genuine reasoning capabilities.
Reinforcement learning (RL) has become a standard paradigm for post-training and aligning Large Language Models (LLMs), yet recent evidence suggests it faces a persistent "capability ceiling": unlike classical RL systems that discover novel strategies, RL for LLMs often acts as a mere refiner of patterns already latent in pre-trained weights. In this work, we identify a fundamental structural bottleneck: while classical RL relies on compact, informative Markov states, current LLM post-training formulations are tethered to an ever-expanding history of actions. We revisit a classical principle long central to RL yet absent from LLM post-training: explicit Markov states. Theoretically, we provide rigorous guarantees demonstrating that leveraging estimated Markov states can significantly reduce sample complexity. Empirically, we show that introducing Markov states consistently breaks the performance boundaries of standard RL post-training across a suite of complex logic puzzles. Our findings suggest that moving beyond "history-as-state" modeling in favor of structured Markovian representations is essential for unlocking open-ended discovery and genuinely new reasoning capabilities in Generative AI.
Community
We found that RL post-training for LLMs hits a persistent capability ceiling because models condition on ever-growing action histories instead of compact Markov states — we proposed reintroducing explicit Markov state estimation into the training loop, where a learned transition model compresses the history into a sufficient statistic at each step. This simple mechanism has theoretical guarantees (exponential reduction in sample complexity from action-history space to compact state space) and also performs well empirically, beating standard action-sequence RL by massive margins on Sokoban (2.5% → 76.1%), Futoshiki (0.2% → 75.0%), and Sudoku (92.3% → 97.1%), with strong out-of-distribution generalization to boot.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reinforcement-aware Knowledge Distillation for LLM Reasoning (2026)
- Think Longer to Explore Deeper: Learn to Explore In-Context via Length-Incentivized Reinforcement Learning (2026)
- Reinforcement Learning with Promising Tokens for Large Language Models (2026)
- Resource-Efficient Reinforcement for Reasoning Large Language Models via Dynamic One-Shot Policy Refinement (2026)
- Scaling In-Context Online Learning Capability of LLMs via Cross-Episode Meta-RL (2026)
- LACONIC: Length-Aware Constrained Reinforcement Learning for LLM (2026)
- Clipping-Free Policy Optimization for Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper