yyqoni
/

rlhflow-llama-3-sft-8b-v2-bandit-ppo-60k

Text Generation

text-generation-inference

Model card Files Files and versions

This is the bandit reward based ppo model introduced in the preprint Segmenting Text and Learning Their Rewards for Improved RLHF in Language Models (https://arxiv.org/abs/2501.02790). For more details, please visit our repository at https://github.com/yinyueqin/DenseRewardRLHF-PPO.

Downloads last month: -

Safetensors

Model size

8B params

Tensor type

BF16

·

Model tree for yyqoni/rlhflow-llama-3-sft-8b-v2-bandit-ppo-60k

Base model

RLHFlow/LLaMA3-SFT-v2

Finetuned

(5)

this model

Dataset used to train yyqoni/rlhflow-llama-3-sft-8b-v2-bandit-ppo-60k

Collection including yyqoni/rlhflow-llama-3-sft-8b-v2-bandit-ppo-60k

DenseRewardRLHF-PPO

This repository contains the released models for our paper Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model. • 18 items • Updated Jan 11, 2025 • 1

Paper for yyqoni/rlhflow-llama-3-sft-8b-v2-bandit-ppo-60k

Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model

Paper • 2501.02790 • Published Jan 6, 2025 • 8