Reasoning Work
Collection
Models I've trained to think like DeepSeek R1 using online learning - Group Relative Policy Optimization (GRPO) introduced by DeepSeekMath
•
6 items
•
Updated
This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.