GPRA on DeepSeek-R1-Distill-Qwen-1.5B

This repository contains LoRA adapter checkpoints trained with PRIME (Process Reinforcement through Implicit Rewards) integrated into the GRPO framework on the Open-RS2 dataset.

Method Overview

We integrate PRIME's implicit process reward model (PRM) into Tina's GRPO + LoRA training pipeline. The key idea is to provide token-level dense rewards in addition to the sparse outcome reward (correct/incorrect), enabling better credit assignment during RL training.

Architecture

GPU 0: Policy (LoRA training) + Implicit PRM (full-parameter, CPU time-sharing) + Reference (frozen)
GPU 1: vLLM rollout engine

PRIME Advantage Formula (Eq. 7)

The combined advantage at token position t is:

A_t = A_outcome + A_process_t

A_outcome = (r_o(y) - mean(r_o)) / std(r_o)          # Standard GRPO
A_process_t = sum_{s=t}^{T} normalized_r_phi(y_s)     # Token-level cumulative sum

Where the implicit process reward is:

r_phi(y_t) = beta * log(pi_phi(y_t|y<t) / pi_ref(y_t|y<t))

The PRM is updated online with binary cross-entropy loss using outcome labels (0/1 accuracy).

Training Details

Base Model

LoRA Configuration

Parameter Value
Rank (r) 32
Alpha 128
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, down_proj, up_proj, gate_proj
Trainable params ~37M

Training Hyperparameters

Parameter Value
Max steps 1500
Per-device batch size 6
Gradient accumulation 4
Effective batch size 6 x 4 / 6 gen = 4 prompts/step
Learning rate 1e-6 (cosine with min_lr_rate=0.1)
Warmup ratio 0.1 (~150 steps)
Num generations (K) 6
Temperature 0.7
Max prompt length 512
Max completion length 3584
Precision BF16
Seed 42

PRIME-Specific Hyperparameters

Parameter Value
PRM beta 0.05
PRM learning rate 1e-6
PRM training Full-parameter (CPU time-sharing)
PRM reference Frozen (same init as PRM)
PRM micro-batch size 2
Process advantage std threshold 1e-2

Reward Functions

Function Weight
Accuracy (math-verify) 1.0
Format (think tags) 1.0

Hardware

  • Training: 2x NVIDIA A100-SXM4-80GB (RunPod)
    • GPU 0: Policy LoRA + PRM (CPU-GPU time-sharing)
    • GPU 1: vLLM rollout
  • Training speed: ~150 seconds/step
  • Evaluation: 1x NVIDIA L40 (48GB)

Checkpoints

Checkpoint Steps Description
checkpoint-50 50 Early training (warmup phase)
checkpoint-100 100 Warmup phase
checkpoint-150 150 Warmup ending
checkpoint-200 200 Post-warmup
checkpoint-250 250 Mid training
checkpoint-300 300 Mid training
checkpoint-350 350 Mid training
checkpoint-400 400 Mid-late training
checkpoint-450 450 Mid-late training
checkpoint-500 500 Late training

Training Logs

Full training curves available on Weights & Biases:

Key metrics tracked:

  • rewards/accuracy_reward: Math problem accuracy (0/1)
  • prm/ce_loss: Implicit PRM binary cross-entropy loss
  • prm/classification_acc: PRM ability to predict correct/incorrect
  • train/loss: Policy gradient loss
  • train/kl: KL divergence from reference policy
  • train/grad_norm: Gradient norm

Usage

Install Dependencies

pip install torch==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip install vllm==0.7.2 transformers==4.48.2 peft==0.14.0
pip install math-verify latex2sympy2-extended

Load and Merge LoRA Adapter

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base_model = AutoModelForCausalLM.from_pretrained(
    "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B",
    torch_dtype=torch.bfloat16,
)
model = PeftModel.from_pretrained(base_model, "whalexdfsa/open-rs2-PRIME/checkpoint-500")
model = model.merge_and_unload()

tokenizer = AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")

Inference with vLLM (Recommended)

from vllm import LLM, SamplingParams

# After merging and saving to a local directory
llm = LLM(model="path/to/merged_model", dtype="bfloat16", max_model_len=32768)
sampling = SamplingParams(max_tokens=32768, temperature=0.6, top_p=0.95)

prompt = "Solve: What is the sum of all positive integers n such that n^2 - 19n + 99 is a perfect square?"
messages = [{"role": "user", "content": prompt}]
formatted = llm.get_tokenizer().apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
output = llm.generate([formatted], sampling)
print(output[0].outputs[0].text)

Run Benchmark Evaluation

python scripts/eval/eval_prime.py \
    --repo_id whalexdfsa/open-rs2-PRIME \
    --checkpoints checkpoint-300 checkpoint-500

Code Repository

Full training and evaluation code: https://github.com/LYF22034/open-rs2-PRIME

Key files:

  • tina/post_train_hf/implicit_prm.py - Implicit PRM module
  • tina/post_train_hf/grpo_trainer.py - Modified GRPO trainer with PRIME integration
  • tina/post_train_hf/grpo.py - Training entry point
  • scripts/eval/eval_prime.py - Benchmark evaluation script
  • recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/train_model_open_rs2_prime.yaml - Training config

References

@article{wang2025tina,
  title={Tina: Tiny Reasoning Models via LoRA},
  author={Wang, Shangshang and Zheng, Julian and Chia, Yee Whye},
  journal={arXiv preprint arXiv:2504.15777},
  year={2025}
}

@article{cui2025prime,
  title={Process Reinforcement through Implicit Rewards},
  author={Cui, Ganqu and Li, Lifan and Xiang, Bingxiang and Wang, Yuling and others},
  journal={arXiv preprint arXiv:2502.01456},
  year={2025}
}

@misc{deepseek-r1,
  title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning},
  author={DeepSeek-AI},
  year={2025}
}

License

This project is released under the Apache 2.0 License, following the base model and Tina framework licenses.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for whalexdfsa/open-rs2-GPRA

Adapter
(213)
this model

Dataset used to train whalexdfsa/open-rs2-GPRA

Papers for whalexdfsa/open-rs2-GPRA