GPRA on DeepSeek-R1-Distill-Qwen-1.5B

This repository contains LoRA adapter checkpoints trained with PRIME (Process Reinforcement through Implicit Rewards) integrated into the GRPO framework on the Open-RS2 dataset.

Method Overview

We integrate PRIME's implicit process reward model (PRM) into Tina's GRPO + LoRA training pipeline. The key idea is to provide token-level dense rewards in addition to the sparse outcome reward (correct/incorrect), enabling better credit assignment during RL training.

Architecture

GPU 0: Policy (LoRA training) + Implicit PRM (full-parameter, CPU time-sharing) + Reference (frozen)
GPU 1: vLLM rollout engine

PRIME Advantage Formula (Eq. 7)

The combined advantage at token position t is:

A_t = A_outcome + A_process_t

A_outcome = (r_o(y) - mean(r_o)) / std(r_o)          # Standard GRPO
A_process_t = sum_{s=t}^{T} normalized_r_phi(y_s)     # Token-level cumulative sum

Where the implicit process reward is:

r_phi(y_t) = beta * log(pi_phi(y_t|y<t) / pi_ref(y_t|y<t))

The PRM is updated online with binary cross-entropy loss using outcome labels (0/1 accuracy).

Training Details

Base Model

Model: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Dataset: knoveleng/open-rs (7,000 samples)

LoRA Configuration

Parameter	Value
Rank (r)	32
Alpha	128
Dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj, down_proj, up_proj, gate_proj
Trainable params	~37M

Training Hyperparameters

Parameter	Value
Max steps	1500
Per-device batch size	6
Gradient accumulation	4
Effective batch size	6 x 4 / 6 gen = 4 prompts/step
Learning rate	1e-6 (cosine with min_lr_rate=0.1)
Warmup ratio	0.1 (~150 steps)
Num generations (K)	6
Temperature	0.7
Max prompt length	512
Max completion length	3584
Precision	BF16
Seed	42

PRIME-Specific Hyperparameters

Parameter	Value
PRM beta	0.05
PRM learning rate	1e-6
PRM training	Full-parameter (CPU time-sharing)
PRM reference	Frozen (same init as PRM)
PRM micro-batch size	2
Process advantage std threshold	1e-2

Reward Functions

Function	Weight
Accuracy (math-verify)	1.0
Format (think tags)	1.0

Hardware

Training: 2x NVIDIA A100-SXM4-80GB (RunPod)
- GPU 0: Policy LoRA + PRM (CPU-GPU time-sharing)
- GPU 1: vLLM rollout
Training speed: ~150 seconds/step
Evaluation: 1x NVIDIA L40 (48GB)

Checkpoints

Checkpoint	Steps	Description
checkpoint-50	50	Early training (warmup phase)
checkpoint-100	100	Warmup phase
checkpoint-150	150	Warmup ending
checkpoint-200	200	Post-warmup
checkpoint-250	250	Mid training
checkpoint-300	300	Mid training
checkpoint-350	350	Mid training
checkpoint-400	400	Mid-late training
checkpoint-450	450	Mid-late training
checkpoint-500	500	Late training

Training Logs

Full training curves available on Weights & Biases:

WandB Project: Tina_train_model

Key metrics tracked:

rewards/accuracy_reward: Math problem accuracy (0/1)
prm/ce_loss: Implicit PRM binary cross-entropy loss
prm/classification_acc: PRM ability to predict correct/incorrect
train/loss: Policy gradient loss
train/kl: KL divergence from reference policy
train/grad_norm: Gradient norm

Usage

Install Dependencies

pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.7.2 transformers==4.48.2 peft==0.14.0
pip install math-verify latex2sympy2-extended

Load and Merge LoRA Adapter

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "whalexdfsa/open-rs2-PRIME/checkpoint-500")
model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

Inference with vLLM (Recommended)

from vllm import LLM, SamplingParams

# After merging and saving to a local directory
llm = LLM(model="path/to/merged_model", dtype="bfloat16", max_model_len=32768)
sampling = SamplingParams(max_tokens=32768, temperature=0.6, top_p=0.95)

prompt = "Solve: What is the sum of all positive integers n such that n^2 - 19n + 99 is a perfect square?"
messages = [{"role": "user", "content": prompt}]
formatted = llm.get_tokenizer().apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output = llm.generate([formatted], sampling)
print(output[0].outputs[0].text)

Run Benchmark Evaluation

python scripts/eval/eval_prime.py \
    --repo_id whalexdfsa/open-rs2-PRIME \
    --checkpoints checkpoint-300 checkpoint-500

Code Repository

Full training and evaluation code: https://github.com/LYF22034/open-rs2-PRIME

Key files:

tina/post_train_hf/implicit_prm.py - Implicit PRM module
tina/post_train_hf/grpo_trainer.py - Modified GRPO trainer with PRIME integration
tina/post_train_hf/grpo.py - Training entry point
scripts/eval/eval_prime.py - Benchmark evaluation script
recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/train_model_open_rs2_prime.yaml - Training config

References

@article{wang2025tina,
  title={Tina: Tiny Reasoning Models via LoRA},
  author={Wang, Shangshang and Zheng, Julian and Chia, Yee Whye},
  journal={arXiv preprint arXiv:2504.15777},
  year={2025}
}

@article{cui2025prime,
  title={Process Reinforcement through Implicit Rewards},
  author={Cui, Ganqu and Li, Lifan and Xiang, Bingxiang and Wang, Yuling and others},
  journal={arXiv preprint arXiv:2502.01456},
  year={2025}
}

@misc{deepseek-r1,
  title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
  author={DeepSeek-AI},
  year={2025}
}

License

This project is released under the Apache 2.0 License, following the base model and Tina framework licenses.

Downloads last month: -

Model tree for whalexdfsa/open-rs2-GPRA

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Adapter

(213)

this model

Dataset used to train whalexdfsa/open-rs2-GPRA

Papers for whalexdfsa/open-rs2-GPRA

Tina: Tiny Reasoning Models via LoRA

Paper • 2504.15777 • Published Apr 22, 2025 • 57

Process Reinforcement through Implicit Rewards

Paper • 2502.01456 • Published Feb 3, 2025 • 62