DeepSeek-V3.1 Math RL (Tinker)
This LoRA adaptor is a fine-tuned version of deepseek-ai/DeepSeek-V3.1 using Reinforcement Learning (RL) on the Hendrycks MATH dataset. It was trained using the Tinker framework.
Model Details
- Base Model: deepseek-ai/DeepSeek-V3.1
- Training Method: Reinforcement Learning (likely Group Relative Policy Optimization / PPO-group based on group size configuration)
- Dataset: Hendrycks MATH (Training split, filtering out MATH-500 test set)
- Language: English
- Task: Mathematical Problem Solving
Training Configuration
The model was trained with the following hyperparameters:
- Environment: math (Hendrycks MATH)
- Group Size: 16 (Number of samples generated per prompt for advantage estimation)
- Groups Per Batch: 64
- Learning Rate: 2e-4
- Max Generation Tokens: 512
- Framework: Tinker Cookbook
Usage
Prompt Format
The model expects the problem statement followed by a specific instruction suffix and relies on a few-shot prompting strategy during training.
Suffix: " Write your answer in \boxed{} format."
Example:
- Problem: “How many r’s are in strawberry?”
- Prompt: problem + " Write your answer in \boxed{} format."
Metrics during training

Test on MATH-500 datasets:
