Open R1: Update #2

Community Article Published February 10, 2025

image/png

We are now two weeks into the Open R1 project which aims to reconstruct the missing pieces of DeepSeek R1—specifically, the training pipeline and synthetic data.

In this post, we are happy to share the construction of OpenR1-Math-220k: our first large-scale dataset for mathematical reasoning!

We also take a look at some exciting developments from the community towards curating small, high-quality datasets for fine-tuning, along with insights into how to control the length of the chain-of-thought from reasoning models at both train-time and inference-time.

Let’s dive in!

OpenR1-Math-220k dataset

One of the key advantages of DeepSeek R1 is its ability to transfer advanced reasoning capabilities to smaller models through distillation. The DeepSeek team demonstrated this by generating 600k reasoning traces and fine-tuning a series of Qwen and Llama models, showing that direct distillation from R1 can achieve competitive reasoning performance without reinforcement learning. Notably, DeepSeek-R1-Distill-Qwen-7B achieved 55.5% on AIME 2024, surpassing larger models like QwQ-32B-Preview.

However, the reasoning traces used for distillation have not been released publicly, prompting the community to independently recreate similar datasets. So far, multiple open datasets have been released by the community, including OpenThoughts-114k, Bespoke-Stratos-17k, Dolphin-R1, and LIMO.

🐳  Introducing OpenR1-Math-220k, a large-scale math reasoning dataset generated locally on 512 H100s, with multiple answers per problem. To create OpenR1-Math-220k, we collaborated with Numina who have developed a brand new version of their popular NuminaMath-CoT dataset.

What’s new in OpenR1 dataset compared to existing datasets:

  • 800k R1 reasoning traces: We generate two answers for 400k problems using DeepSeek R1. The filtered dataset contains 220k problems with correct reasoning traces.
  • 512 H100s running locally: Instead of relying on an API, we leverage vLLM and SGLang to run generations locally on our science cluster, generating 180k reasoning traces per day.
  • Based on NuminaMath 1.5: we focus on math reasoning traces and generate answers for problems in NuminaMath 1.5, an improved version of the NuminaMath-CoT dataset.
  • Automated filtering: We apply Math Verify to only retain problems with at least one correct answer. We also leverage Llama3.3-70B-Instruct as a judge to retrieve more correct examples (e.g for cases with malformed answers that can’t be verified with a rules-based parser)
  • We match the performance of DeepSeek-Distill-Qwen-7B by finetuning Qwen-7B-Math-Instruct on our dataset.

By demonstrating scalable, high-quality reasoning data generation, we hope this pipeline can be extended beyond math to domains like code generation.

Data generation

To build OpenR1-220k, we prompt DeepSeek R1 to generate solutions for 400k problems from NuminaMath 1.5. We follow the model card’s recommended parameters and prepend the following instruction to the user prompt:

"Please reason step by step, and put your final answer within \boxed{}."

We set a 16k token limit per generation, as our analysis showed that only 75% of problems could be solved in under 8k tokens, and most of the remaining problems required the full 16k tokens. Initially, we used vLLM for inference, achieving a throughput of 15 generations per hour per H100, and shared our generation scripts in previous updates and on the OpenR1 repo. Recently, we started experimenting with SGLang and we were able to generate 25 solutions per hour per H100 (almost 2x speedup!), enabling us to generate 300k problem solutions per day on 512 H100s. This allowed us to produce 800k reasoning traces in just a few days.

We generate two solutions per problem—and in some cases, four—to provide flexibility in filtering and training. This approach allows for rejection sampling, similar to DeepSeek R1’s methodology, and also makes the dataset suitable for preference optimisation methods like DPO.

The scripts for the data generation are available here: https://github.com/huggingface/open-r1/tree/main/slurm

Data Filtering

To retain only high-quality, correct reasoning traces, we leverage Math Verify, a robust mathematical expression evaluation system designed to assess LLM-generated answers. We extract the final answers from model generations and compare them against ground truth answers in the dataset.

We find that 55% of problems have at least one correct answer. However, some ground truth answers in NuminaMath 1.5 were empty or not in a verifiable format, making automatic validation challenging. While we have improved Math-Verify to more accurately handle these more uncommon output formats (see Math-Verify improvements below), we also explored an alternative method to recover valid solutions from rejected samples: using Llama-3.3-70B-Instruct as a judge on a subset of rejected problems. Before running this verification step, we filter out samples that are incomplete or that contain an empty ground truth answer, ensuring that only well-formed responses with a clearly boxed final answer are considered. This process successfully retrieves 28,000 of previously rejected problems.

We prompt Llama3.3-70B-Instruct as follows:

You are a mathematical answer validator. You will be provided with a mathematical problem and you need to compare the answer in the reference solution, and the final answer in a model's solution to determine if they are equivalent, even if formatted differently.

PROBLEM:

{problem}

REFERENCE SOLUTION:

{answer}

MODEL'S SOLUTION:

{generation}

Focus ONLY on comparing the final mathematical answer provided by the model while ignoring differences in:

- Formatting (e.g., \\boxed{{}} vs plain text)
- Multiple choice formatting (e.g., "A" vs full solution)
- Order of coordinate pairs or solutions
- Equivalent mathematical expressions or notation variations
- If the model's answer is nonsense, return "Verdict: AMBIGUOUS"

Start with a brief explanation of your comparison (2-3 sentences). Then output your final answer in one of the following formats:

- "Verdict: EQUIVALENT"
- "Verdict: DIFFERENT"
- "Verdict: AMBIGUOUS"

By combining rule-based verification (Math Verify) with LLM-based evaluation, we improve dataset quality while maintaining scale. The final dataset consists of 220k problems with verified reasoning traces, making it a valuable resource for training reasoning models. Providing multiple solutions per problem gives the community flexibility to filter for better generations and apply more targeted refinements based on NuminaMath data sources and problem types.

image/png

The dataset is available in two splits:

  • default (94k problems), which achieves the best performance after SFT.
  • extended (131k problems), which includes additional NuminaMath 1.5 sources like cn_k12, providing more reasoning traces. However, we observed that performance after SFT on this subset was lower than the default split, likely due to cn_k12 containing simpler questions compared to other sources.

For rows with multiple correct answers, we also tried applying a Reward Model (RM) as a final filter to select the best response. For each row with multiple correct generations by R1, we extracted the final answer by removing the thinking tokens (<think>…</think>), and then pass the problem + the extracted answer to Qwen/Qwen2.5-Math-RM-72B served using vLLM to get an score. Using these scores, we built a ranking for each row containing more than one correct response. The top-1 correct generations were selected and included in the training dataset, but sadly the training ablations showed that this approach doesn’t help to improve model performance with respect to selecting one random correct generation. A possible improvement could be to include the reasoning trace rather than just the final answer when scoring with the RM.

Performance Comparison with DeepSeek-Distill-Qwen-7B

We fine-tune Qwen2.5-Math-Instruct for 3 epochs on the default split of the dataset using a learning rate of 5e-5. To extend the context length from 4k to 32k, we increase RoPE frequency to 300k. The training follows a linear learning rate schedule with a 10% warmup phase. The table below compares the performance of OpenR1-Qwen-7B to DeepSeek-Distill-Qwen-7B and OpenThinker-7B using lighteval.

Model MATH-500 AIME24 AIME25
DeepSeek-Distill-Qwen-7B 91.6 43.3 40
OpenR1-Qwen-7B 90.6 36.7 40
OpenThinker-7B 89.6 30.0 33.3

This dataset represents an initial version, providing a foundation for further refinement. The community can explore additional filtering strategies to improve performance, such as rejection sampling, which was used in DeepSeek R1 to enhance quality.

Math-Verify improvements

We identified several failure cases in Math-Verify during our inspection of the verification results. To address these issues, we implemented significant improvements and fixes. We strongly recommend updating to the latest version (0.5.2) to benefit from these enhancements:

pip install math-verify==0.5.2

The following is the summary of the most important improvements:

  • Improved parsing and verification of text only answers (e.g $\text{E}$ == $E$)
  • Improved parsing of list of answers (e.g $1$ and $2$ and $3$ == $1,2,3$)
  • Fixed parsing of multiple boxed answers in single latex env (e.g $\boxed{1},\boxed{2}$ == {1,2})
  • Introduction of ordered tuples. Inferring whether the list is a tuple of set is very hard, and we therefore use the gold answer to guide us:
    • (1,2,3) ≠ {3,2,1}; 1,2,3 == {3,2,1}; {3,2,1} == {1,2,3}
  • Support for relational (e.g. lower than) in gold and interval in prediction (e.g $1 < x < 2$ == $(1,2)$)

Community highlights

This week saw the community explore GRPO from many different angles, while multiple research labs have shown that only ~1000 high quality training samples may be sufficient to elicit reasoning in existing open models.

GRPO in the wild

  • nrehiew showed that applying GRPO directly to the Qwen2.5-0.5B base model yields ~51% accuracy on the GSM8k benchmark, which is a 10 point improvement over the Qwen2.5-0.5B-Instruct model. Impressive results like these have prompted many discussions about the role of instruct data in pretraining, as people have not (yet) been able to obtain similar gains when applying GRPO to other base models like Llama 3. In particular, researchers at Sea AI Lab (SAIL) showed that base models can be easily prompted to produce self-reflection and that the “aha” moment from the DeepSeek-R1 paper may be more a symptom of the base model than the RL optimisation process.
  • Unsloth have applied their optimisation magic to enable models up to 15B parameters to be trained with GRPO with just 15GB VRAM 🤯. This means you can now use GRPO in Google Colab for free!
  • Wing Lian from Axolotl has shown that DoRA converges faster than both LoRA and full-finetuning.
  • Alexander Doria found a way to craft reward functions for poetry. This is exciting as it provides one of the first public examples of GRPO being applied to a domain that is not conventionally treated as “verifiable”.

Evaluation

The first part of the AIME 2025 was released this week, which consists of 15 difficult math problems that are used to train high school students for the International Math Olympiad. In the past year, AIME 2024 has stood as the main benchmark to probe the mathematical capabilities of LLMs and the community was excited to see how well models performs on a new set of unseen problems:

Do LLMs need to reason in natural language?

image/png

An interesting new research paper shows that by using a recurrent language model, it is possible to scale test-time compute by implicitly reasoning in latent space. This resembles Meta’s Coconut work to train language models in latent space, but now adapted to reasoning tasks. The advantage of these methods is that they are far more compute efficient: by exploring the latent, one does not need to generate huge amounts of “thinking” tokens to obtain high performance.

A shift toward smaller, high-quality reasoning data?

While DeepSeek R1 leveraged 600k reasoning traces for distillation, recent work suggests that complex reasoning can emerge in language models not through massive-scale training, but through a small number of carefully curated samples.

One example of this approach is the s1K dataset. It consists of 1,000 carefully selected math questions with distilled reasoning traces from Gemini Flash. The selection approach focuses on difficulty, diversity, and quality. The authors fine-tune Qwen2.5-32B-Instruct on s1K and manage to exceed OpenAI’s o1-preview on competition math benchmarks by up to 27%.

Another dataset, LIMO, pushes this idea further, achieving strong performance on AIME and MATH benchmarks using only 817 training samples.  The authors hypothesize that when a model has already acquired extensive domain knowledge during pre-training, only a small number of well-structured examples may be needed to unlock advanced reasoning capabilities.

CoT length: budget forcing & reward shaping

One important ingredient allowing the fine-tuned Qwen2.5-32B-Instruct model from s1K to reach such strong performance is budget forcing, a test-time compute technique that either extends or truncates reasoning by appending “Wait” or an end-of-thinking token delimiter to the model’s generation, respectively. This tool allowed the authors to vary thinking time and conclude that their model exhibits test-time scaling: as thinking time increases, so does accuracy on different math benchmarks.

image/png

Similarly, Demystifying Long Chain-of-Thought Reasoning in LLMs (Yeo et al.) also studies the effect of Chain-of-Thought (CoT) length on model performance. They introduce the Cosine Reward — a novel reward function that they use to incentivize shorter CoTs for correct generations and longer CoTs for wrong generations — which stabilizes RL training, particularly when the model has relatively limited max context size and average response length could explode. Repetition penalty is also employed when the model starts to show signs of reward hacking on hard questions, by artificially increasing CoT length through repetition instead of attempting to solve the problem.

image/png

What’s next?

Now that GRPO is humming in TRL, we are running an extensive set of experiments to understand which hyperparameters and reward functions have the greatest impact on training. You can follow our progress in the community tab and will write up our findings in the next update!

If you want to contribute check out the open-r1 repository on GitHub or follow the Hugging Face open-r1 org.

Community

Thanks for sharing your results and describing the background of what happened around GRPO research last week. Do you plan to test classical distillation, not just fine-tuning on reasoning traces?

¡
Article author
•
edited 1 day ago

Yes that's something we might also try!

This is awesome! Great work!

somewhat ambiguous point:

  • with Math Verify, the size of filtered dataset is 220k (55% or 400k)
  • with LLM based evaluation, the size of retrieved data from rejected sampling is 28k

But, this article claims as below:

By combining rule-based verification (Math Verify) with LLM-based evaluation, we improve dataset quality while maintaining scale. The final dataset consists of 220k problems with verified reasoning traces, making it a valuable resource for training reasoning models

I think the size should be 248k? otherwise, it seems like the LLM based evaluation hasn't been included in the final dataset.

¡
Article author

We only applied Llama verification to the default subset, those rejected by Math Verify from the extended subset didn't go through a second verification step. We can release the unfiltered data with 400k problems if the community wants to do different filtering.

kind of wondering about the following statement

achieving a throughput of 15 generations per hour per H100

Since DeepSeek-R1 can't fit into a single H100 (and based on Update #2, the model fits into 8xH100), how do you measure the throughput of H100? maybe 15*8 = 120 by 8xH100?

¡

The model actually fits on two 8xH100 https://huggingface.co/blog/open-r1/update-1#synthetic-data-generation
And the 15 generations per hour per H100 is the throughput on four nodes divided by 32 GPUs (4 to avoid the cache filling up)

The discussions about SFT in pre training. If I understand correctly, the idea is that models that have pretraining data that contains some instruct data tend to learn to reason while those without any instruct data don't ever figure out how to reason.

Sign up or log in to comment