Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
Abstract
Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.
Community
Very happy to share our work to the community!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models (2025)
- Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning (2025)
- Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models (2025)
- Pensez: Less Data, Better Reasoning -- Rethinking French LLM (2025)
- AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO (2025)
- L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning (2025)
- Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Hi team! Thank you for your work and for open-sourcing the models and dataset. I have a question about checkpoint selection.
Since all three experiments were somewhat unstable and failed to complete the full training process, I’m wondering how you selected the final checkpoint reported in Table 1 (step 100, 50 and 50).
Did you choose the checkpoint with the highest AIME/AMC performance before training collapsed, or was it based on a validation set (which is not mentioned in your paper)?
Hi @zwhe99 , thanks for your question. Here is my answer
- We chose the final checkpoint based on the Insight from Experiment 1 and it is also shown in Figure 2 in the paper.
- Furthermore, we also observed that the model's behavior gets worse when languages other than English are present after 200-250 training steps.
You are welcome to have more questions!
@zwhe99
Regarding the discrepancy between choosing step 50 or 100 that you see, it is because in experiment 1, we intended to design 100 steps to log once, but after experiment 1 was finished (We stopped this experiment at 500-th step because model can't generate English text-only after this step, if we run full training phase, it will take about 1500 steps), we found that we could reduce the data and the number of steps to log, so 50 would be more reasonable than 100 to observation
I see. The performance fluctuates wildly during training. But how did you decide between 50 and 100 steps?
By the way, the reason I’m asking is that I found Open-RS3 performs very inconsistently across different test sets—it’s very strong on AIME24 (ranked #1 among all the models I tested) but performs poorly on AIME25 (ranking the bottom of all models I tested). Open-RS3’s score on AIME25 is 22.7, even lower than R1-Distill-Qwen-1.5B’s 24.4.
Can you share your AIME25 that you tested? I will test again
So... I suspect this inconsistency might be due to some form of AIME24-specific checkpoint selection. Correct me if I’m wrong about anything.
I'm not sure. Because we just found out that the performance is significantly different on different GPUs. I don't know if the reason is from this or not.
You can read more in this issue in from repo
@zwhe99
Regarding the discrepancy between choosing step 50 or 100 that you see, it is because in experiment 1, we intended to design 100 steps to log once, but after experiment 1 was finished (We stopped this experiment at 500-th step because model can't generate English text-only after this step, if we run full training phase, it will take about 1500 steps), we found that we could reduce the data and the number of steps to log, so 50 would be more reasonable than 100 to observation
I got you. However, I must point out that this is not a reasonable checkpoint selection methodology. The correct approaches are typically one of the following:
- Selecting the final checkpoint after training completion, or
- Choosing checkpoints based on validation set performance.
Under no circumstances should checkpoint selection be performed by examining test set performance (e.g., using methods like those in Figure 2).
You can also select the checkpoint with highest test set performance to explore the ideal peak performance. But this should be well explained and clarified in the paper.
I'm not sure. Because we just found out that the performance is significantly different on different GPUs. I don't know if the reason is from this or not.
You can read more in this issue in from repo
Let me try if i can reproduce the result in AIME24.
Let me know when you're done with it.
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 3
Spaces citing this paper 0
No Space linking this paper