Online-DPO-R1
Collection
This is the collection of the online-DPO-R1 project.
•
6 items
•
Updated
We release unofficial checkpoints for PPO, iterative DPO and rejection sampling (RAFT) trained from Qwen2.5-MATH-7B-base with rule-based RL, which are based on the success of Deepseek-R1-Zero and recent replications of PPO approach. Evaluated on five widely-adopted benchmarks AIME 2024, MATH 500, AMC, Minerva Math, OlympiadBench, our iterative DPO and RAFT model achieve significant enhancement compared to the base model and are comparable to the PPO approach. Our models are trained by using the prompt set from the MATH training set and Numina Math.
Moreover, we provide a detailed recipe to reproduce the model. Enjoy!
Model | AIME 2024 | MATH 500 | AMC | Minerva Math | OlympiadBench | Average |
---|---|---|---|---|---|---|
Ours | ||||||
RLHFlow/Qwen2.5-7B-PPO-Zero | 43.3 (+26.6) | 79.4 (+27.0) | 62.5 (+10.0) | 33.1 (+20.2) | 40.7 (+24.3) | 51.8 (+21.6) |
RLHFlow/Qwen2.5-7B-DPO-Zero | 26.8 (+10.1) | 76.8 (+24.4) | 62.5 (+10.0) | 30.9 (+18.0) | 37.9 (+21.5) | 47.0 (+16.8) |
RLHFlow/Qwen2.5-7B-RAFT-Zero | 20.0 (+3.3) | 77.6 (+25.2) | 55.0 (+2.5) | 30.5 (+17.6) | 38.7 (+22.3) | 44.4 (+14.2) |
Baselines | ||||||
Qwen2.5-Math-7B-Base | 16.7 | 52.4 | 52.5 | 12.9 | 16.4 | 30.2 |
Qwen-2.5-Math-7B-Instruct | 13.3 | 79.8 | 50.6 | 34.6 | 40.7 | 43.8 |
Llama-3.1-70B-Instruct | 16.7 | 64.6 | 30.1 | 35.3 | 31.9 | 35.7 |
Eurus-2-7B-PRIME | 26.7 | 79.2 | 57.8 | 38.6 | 42.1 | 48.9 |
GPT-4o | 9.3 | 76.4 | 45.8 | 36.8 | 43.3 | 43.3 |