Granite GRPO Mathematical Reasoning Model
Model Details
- Base Model: IBM Granite 3.1-2b
- Training Approach: Grouped Reward Policy Optimization (GRPO)
- Dataset: GSM8K (Grade School Math 8K)
- Training Progress: 400/1000 steps (40% complete)
- Training Configuration:
- Learning rate: 3e-6
- Batch size: 1 per device
- Gradient accumulation steps: 8
- Mixed precision: bfloat16
- DeepSpeed ZeRO-3 optimization
Training Methodology
This model was trained using GRPO with multiple reward functions:
- Correctness Reward (1.0 max): Exact match with reference answer
- Integer Format (0.5 max): Validates numerical answer format
- Strict Format (0.5 max): Enforces XML-style response structure
- Soft Format (0.5 max): Ensures basic response organization
Performance Metrics (at step 400)
- Correctness Reward: ~0.65-0.70 (improving)
- Format Rewards:
- Strict format: ~0.40
- Soft format: ~0.45
- Integer format: ~0.45
- Total Reward: ~2.0 (stable)
Input Format
The model expects inputs in the following format:
[Question text]
Output Format
The model generates responses in the following structure:
<reasoning>
Step-by-step mathematical reasoning
</reasoning>
<answer>
Numerical answer
</answer>
Limitations
- Training is incomplete (40% of planned steps)
- Experimental research model
- Performance may vary on complex mathematical problems
- Limited to grade-school level mathematics
- May occasionally produce incorrect reasoning despite correct answers
Intended Use
- Mathematical problem-solving assistance
- Educational support for grade-school math
- Research in mathematical reasoning capabilities of language models
Training Infrastructure
- Framework: DeepSpeed ZeRO-3
- Hardware: 7 GPUs
- Mixed Precision: bfloat16
License
Apache 2.0
Citation
If you use this model, please cite:
@software{granite-grpo-gsm8k,
author = {Your Name},
title = {Granite GRPO Mathematical Reasoning Model},
year = {2024},
publisher = {Hugging Face},
url = {https://huggingface.co/manavg/granite-grpo-gsm8k-40pct}
}
Acknowledgments
- IBM for the base Granite model
- OpenAI for the GSM8K dataset
- Downloads last month
- 26
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.
Model tree for manavg/granite-grpo-gsm8k-40pct
Base model
ibm-granite/granite-3.1-2b-base
Finetuned
ibm-granite/granite-3.1-2b-instruct