YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Online-DPO-R1

Introduction

We release unofficial checkpoints for PPO, iterative DPO and rejection sampling (RAFT) trained from Qwen2.5-MATH-7B-base with rule-based RL, which are based on the success of Deepseek-R1-Zero and recent replications of PPO approach. Evaluated on five widely-adopted benchmarks AIME 2024, MATH 500, AMC, Minerva Math, OlympiadBench, our iterative DPO and RAFT model achieve significant enhancement compared to the base model and are comparable to the PPO approach. Our models are trained by using the prompt set from the MATH training set and Numina Math.

Moreover, we provide a detailed recipe to reproduce the model. Enjoy!

Model Releases

Dataset

Training methods

  • Iterative DPO: Following the RLHF Workflow framework (https://arxiv.org/pdf/2405.07863), in each iteration, we sample multiple responses from the last trained policy, rank them via the ruled-based reward, and construct the preference pairs. Then, we optimize the policy by minimizing the DPO loss and enter the next iteration. Online iterative DPO can mitigate the issue of distribution shift and the limited coverage of offline data effectively. More detailed can be found in our blog!

Performance

Model AIME 2024 MATH 500 AMC Minerva Math OlympiadBench Average
Ours
RLHFlow/Qwen2.5-7B-PPO-Zero 43.3 (+26.6) 79.4 (+27.0) 62.5 (+10.0) 33.1 (+20.2) 40.7 (+24.3) 51.8 (+21.6)
RLHFlow/Qwen2.5-7B-DPO-Zero 26.8 (+10.1) 76.8 (+24.4) 62.5 (+10.0) 30.9 (+18.0) 37.9 (+21.5) 47.0 (+16.8)
RLHFlow/Qwen2.5-7B-RAFT-Zero 20.0 (+3.3) 77.6 (+25.2) 55.0 (+2.5) 30.5 (+17.6) 38.7 (+22.3) 44.4 (+14.2)
Baselines
Qwen2.5-Math-7B-Base 16.7 52.4 52.5 12.9 16.4 30.2
Qwen-2.5-Math-7B-Instruct 13.3 79.8 50.6 34.6 40.7 43.8
Llama-3.1-70B-Instruct 16.7 64.6 30.1 35.3 31.9 35.7
Eurus-2-7B-PRIME 26.7 79.2 57.8 38.6 42.1 48.9
GPT-4o 9.3 76.4 45.8 36.8 43.3 43.3

Usage

Citation

Downloads last month
0
Safetensors
Model size
7.62B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including RLHFlow/Qwen2.5-7B-DPO-NLL-Zero