arxiv:2503.16219

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

Published on Mar 20

· Submitted by

quyanh on Mar 21

Upvote

Authors:

Quy-Anh Dang ,

Chris Ngo

Abstract

Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.

View arXiv page View PDF GitHub repository Add to collection

Community

quyanh

Paper author Paper submitter 7 days ago

Very happy to share our work to the community!

librarian-bot

7 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

zwhe99

3 days ago

Hi team! Thank you for your work and for open-sourcing the models and dataset. I have a question about checkpoint selection.

Since all three experiments were somewhat unstable and failed to complete the full training process, I’m wondering how you selected the final checkpoint reported in Table 1 (step 100, 50 and 50).

Did you choose the checkpoint with the highest AIME/AMC performance before training collapsed, or was it based on a validation set (which is not mentioned in your paper)?

quyanh

Paper author Paper submitter 3 days ago

Hi @zwhe99 , thanks for your question. Here is my answer

We chose the final checkpoint based on the Insight from Experiment 1 and it is also shown in Figure 2 in the paper.

Furthermore, we also observed that the model's behavior gets worse when languages other than English are present after 200-250 training steps.

You are welcome to have more questions!

quyanh

Paper author Paper submitter 3 days ago

•

edited 3 days ago

@zwhe99
Regarding the discrepancy between choosing step 50 or 100 that you see, it is because in experiment 1, we intended to design 100 steps to log once, but after experiment 1 was finished (We stopped this experiment at 500-th step because model can't generate English text-only after this step, if we run full training phase, it will take about 1500 steps), we found that we could reduce the data and the number of steps to log, so 50 would be more reasonable than 100 to observation

zwhe99

3 days ago

I see. The performance fluctuates wildly during training. But how did you decide between 50 and 100 steps?

By the way, the reason I’m asking is that I found Open-RS3 performs very inconsistently across different test sets—it’s very strong on AIME24 (ranked #1 among all the models I tested) but performs poorly on AIME25 (ranking the bottom of all models I tested). Open-RS3’s score on AIME25 is 22.7, even lower than R1-Distill-Qwen-1.5B’s 24.4.

quyanh

Paper author Paper submitter 3 days ago

Can you share your AIME25 that you tested? I will test again

zwhe99

3 days ago

So... I suspect this inconsistency might be due to some form of AIME24-specific checkpoint selection. Correct me if I’m wrong about anything.

zwhe99

3 days ago

Sure! Here is the aime25 testset: https://huggingface.co/datasets/math-ai/aime25

quyanh

Paper author Paper submitter 3 days ago

I'm not sure. Because we just found out that the performance is significantly different on different GPUs. I don't know if the reason is from this or not.

You can read more in this issue in from repo

zwhe99

3 days ago

@zwhe99
Regarding the discrepancy between choosing step 50 or 100 that you see, it is because in experiment 1, we intended to design 100 steps to log once, but after experiment 1 was finished (We stopped this experiment at 500-th step because model can't generate English text-only after this step, if we run full training phase, it will take about 1500 steps), we found that we could reduce the data and the number of steps to log, so 50 would be more reasonable than 100 to observation

I got you. However, I must point out that this is not a reasonable checkpoint selection methodology. The correct approaches are typically one of the following:

Selecting the final checkpoint after training completion, or
Choosing checkpoints based on validation set performance.

Under no circumstances should checkpoint selection be performed by examining test set performance (e.g., using methods like those in Figure 2).

zwhe99

3 days ago

You can also select the checkpoint with highest test set performance to explore the ideal peak performance. But this should be well explained and clarified in the paper.

zwhe99

3 days ago

I'm not sure. Because we just found out that the performance is significantly different on different GPUs. I don't know if the reason is from this or not.

You can read more in this issue in from repo

Let me try if i can reproduce the result in AIME24.