GPRA on DeepSeek-R1-Distill-Qwen-1.5B
This repository contains LoRA adapter checkpoints trained with PRIME (Process Reinforcement through Implicit Rewards) integrated into the GRPO framework on the Open-RS2 dataset.
Method Overview
We integrate PRIME's implicit process reward model (PRM) into Tina's GRPO + LoRA training pipeline. The key idea is to provide token-level dense rewards in addition to the sparse outcome reward (correct/incorrect), enabling better credit assignment during RL training.
Architecture
GPU 0: Policy (LoRA training) + Implicit PRM (full-parameter, CPU time-sharing) + Reference (frozen)
GPU 1: vLLM rollout engine
PRIME Advantage Formula (Eq. 7)
The combined advantage at token position t is:
A_t = A_outcome + A_process_t
A_outcome = (r_o(y) - mean(r_o)) / std(r_o) # Standard GRPO
A_process_t = sum_{s=t}^{T} normalized_r_phi(y_s) # Token-level cumulative sum
Where the implicit process reward is:
r_phi(y_t) = beta * log(pi_phi(y_t|y<t) / pi_ref(y_t|y<t))
The PRM is updated online with binary cross-entropy loss using outcome labels (0/1 accuracy).
Training Details
Base Model
- Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
- Dataset: knoveleng/open-rs (7,000 samples)
LoRA Configuration
| Parameter | Value |
|---|---|
| Rank (r) | 32 |
| Alpha | 128 |
| Dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, down_proj, up_proj, gate_proj |
| Trainable params | ~37M |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Max steps | 1500 |
| Per-device batch size | 6 |
| Gradient accumulation | 4 |
| Effective batch size | 6 x 4 / 6 gen = 4 prompts/step |
| Learning rate | 1e-6 (cosine with min_lr_rate=0.1) |
| Warmup ratio | 0.1 (~150 steps) |
| Num generations (K) | 6 |
| Temperature | 0.7 |
| Max prompt length | 512 |
| Max completion length | 3584 |
| Precision | BF16 |
| Seed | 42 |
PRIME-Specific Hyperparameters
| Parameter | Value |
|---|---|
| PRM beta | 0.05 |
| PRM learning rate | 1e-6 |
| PRM training | Full-parameter (CPU time-sharing) |
| PRM reference | Frozen (same init as PRM) |
| PRM micro-batch size | 2 |
| Process advantage std threshold | 1e-2 |
Reward Functions
| Function | Weight |
|---|---|
| Accuracy (math-verify) | 1.0 |
| Format (think tags) | 1.0 |
Hardware
- Training: 2x NVIDIA A100-SXM4-80GB (RunPod)
- GPU 0: Policy LoRA + PRM (CPU-GPU time-sharing)
- GPU 1: vLLM rollout
- Training speed: ~150 seconds/step
- Evaluation: 1x NVIDIA L40 (48GB)
Checkpoints
| Checkpoint | Steps | Description |
|---|---|---|
| checkpoint-50 | 50 | Early training (warmup phase) |
| checkpoint-100 | 100 | Warmup phase |
| checkpoint-150 | 150 | Warmup ending |
| checkpoint-200 | 200 | Post-warmup |
| checkpoint-250 | 250 | Mid training |
| checkpoint-300 | 300 | Mid training |
| checkpoint-350 | 350 | Mid training |
| checkpoint-400 | 400 | Mid-late training |
| checkpoint-450 | 450 | Mid-late training |
| checkpoint-500 | 500 | Late training |
Training Logs
Full training curves available on Weights & Biases:
- WandB Project: Tina_train_model
Key metrics tracked:
rewards/accuracy_reward: Math problem accuracy (0/1)prm/ce_loss: Implicit PRM binary cross-entropy lossprm/classification_acc: PRM ability to predict correct/incorrecttrain/loss: Policy gradient losstrain/kl: KL divergence from reference policytrain/grad_norm: Gradient norm
Usage
Install Dependencies
pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.7.2 transformers==4.48.2 peft==0.14.0
pip install math-verify latex2sympy2-extended
Load and Merge LoRA Adapter
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base_model = AutoModelForCausalLM.from_pretrained(
"deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "whalexdfsa/open-rs2-PRIME/checkpoint-500")
model = model.merge_and_unload()
tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
Inference with vLLM (Recommended)
from vllm import LLM, SamplingParams
# After merging and saving to a local directory
llm = LLM(model="path/to/merged_model", dtype="bfloat16", max_model_len=32768)
sampling = SamplingParams(max_tokens=32768, temperature=0.6, top_p=0.95)
prompt = "Solve: What is the sum of all positive integers n such that n^2 - 19n + 99 is a perfect square?"
messages = [{"role": "user", "content": prompt}]
formatted = llm.get_tokenizer().apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output = llm.generate([formatted], sampling)
print(output[0].outputs[0].text)
Run Benchmark Evaluation
python scripts/eval/eval_prime.py \
--repo_id whalexdfsa/open-rs2-PRIME \
--checkpoints checkpoint-300 checkpoint-500
Code Repository
Full training and evaluation code: https://github.com/LYF22034/open-rs2-PRIME
Key files:
tina/post_train_hf/implicit_prm.py- Implicit PRM moduletina/post_train_hf/grpo_trainer.py- Modified GRPO trainer with PRIME integrationtina/post_train_hf/grpo.py- Training entry pointscripts/eval/eval_prime.py- Benchmark evaluation scriptrecipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/train_model_open_rs2_prime.yaml- Training config
References
@article{wang2025tina,
title={Tina: Tiny Reasoning Models via LoRA},
author={Wang, Shangshang and Zheng, Julian and Chia, Yee Whye},
journal={arXiv preprint arXiv:2504.15777},
year={2025}
}
@article{cui2025prime,
title={Process Reinforcement through Implicit Rewards},
author={Cui, Ganqu and Li, Lifan and Xiang, Bingxiang and Wang, Yuling and others},
journal={arXiv preprint arXiv:2502.01456},
year={2025}
}
@misc{deepseek-r1,
title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
author={DeepSeek-AI},
year={2025}
}
License
This project is released under the Apache 2.0 License, following the base model and Tina framework licenses.
- Downloads last month
- -
Model tree for whalexdfsa/open-rs2-GPRA
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B