How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
Abstract
Teacher-student cooperation data synthesis framework addresses stylistic divergence in synthetic data for improved model fine-tuning performance.
A widely adopted strategy for model enhancement is to use synthetic data generated by a stronger model for supervised fine-tuning (SFT). However, for emerging reasoning models like Qwen3-8B, this approach often fails to improve reasoning capabilities and can even lead to a substantial drop in performance. In this work, we identify substantial stylistic divergence between teacher generated data and the distribution of student as a major factor impacting SFT. To bridge this gap, we propose a Teacher-Student Cooperation Data Synthesis framework (TESSY), which interleaves teacher and student models to alternately generate style and non-style tokens. Consequently, TESSY produces synthetic sequences that inherit the advanced reasoning capabilities of the teacher while maintaining stylistic consistency with the distribution of the student. In experiments on code generation using GPT-OSS-120B as the teacher, fine-tuning Qwen3-8B on teacher-generated data leads to performance drops of 3.25% on LiveCodeBench-Pro and 10.02% on OJBench, whereas TESSY achieves improvements of 11.25% and 6.68%.
Community
🚀 Motivation
Training reasoning models (e.g., Qwen3) is highly sensitive to the data distribution. We observe that:
❗ Using off-policy data (e.g., directly from a strong teacher model) for SFT can lead to severe catastrophic forgetting, especially for complex reasoning tasks.
💡 Key Idea
To address this critical issue, we propose TESSY, a novel Teacher–Student Cooperative Data Synthesis framework designed to generate on-policy training data. Instead of relying on a teacher model to fully generate training samples, TESSY decouples the generation process into two distinct parts:
- 🧠 Teacher model → specializes in generating capability tokens.
- ✍️ Student model → focuses on generating style tokens (e.g., Hmm, Wait...).
This cooperative approach ensures:
- Alignment with student distribution (on-policy): The synthesized data is tailored to the student model's own generation patterns.
- Preservation of teacher reasoning quality: The teacher's advanced reasoning capabilities are effectively leveraged and maintained.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Student-in-the-Loop Chain-of-Thought Distillation via Generation-Time Selection (2026)
- On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning (2026)
- Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs? (2026)
- Teaching Language Models How to Code Like Learners: Conversational Serialization for Student Simulation (2026)
- HintMR: Eliciting Stronger Mathematical Reasoning in Small Language Models (2026)
- Draft-Thinking: Learning Efficient Reasoning in Long Chain-of-Thought LLMs (2026)
- Learning from Partial Chain-of-Thought via Truncated-Reasoning Self-Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.14164 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper