Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
Jaward 
posted an update 14 days ago
Post
1467
The beauty in GRPO is the fact that it doesn’t care if the rewards are rule-based or learned, the hack: let the data self-normalize— trajectories in a batch compete against their mean, no value model, no extra params, just clean, efficient RL that cuts memory usage by 50%, while maintaining SOTA performance. btw it was introduced 9months prior to R1: arxiv.org/pdf/2402.03300

Yeah, the fun part is that I use any QA dataset in GRPO just by instructing a model to follow simple rules. Place your answer in \boxed{} or ** ** tags. I do a regex, and it simply works.