Granite GRPO Mathematical Reasoning Model

Model Details

Base Model: IBM Granite 3.1-2b
Training Approach: Grouped Reward Policy Optimization (GRPO)
Dataset: GSM8K (Grade School Math 8K)
Training Progress: 400/1000 steps (40% complete)
Training Configuration:
- Learning rate: 3e-6
- Batch size: 1 per device
- Gradient accumulation steps: 8
- Mixed precision: bfloat16
- DeepSpeed ZeRO-3 optimization

Training Methodology

This model was trained using GRPO with multiple reward functions:

Correctness Reward (1.0 max): Exact match with reference answer
Integer Format (0.5 max): Validates numerical answer format
Strict Format (0.5 max): Enforces XML-style response structure
Soft Format (0.5 max): Ensures basic response organization

Performance Metrics (at step 400)

Correctness Reward: ~0.65-0.70 (improving)
Format Rewards:
- Strict format: ~0.40
- Soft format: ~0.45
- Integer format: ~0.45
Total Reward: ~2.0 (stable)

Input Format

The model expects inputs in the following format:

[Question text]

Output Format

The model generates responses in the following structure:

<reasoning>
Step-by-step mathematical reasoning
</reasoning>
<answer>
Numerical answer
</answer>

Limitations

Training is incomplete (40% of planned steps)
Experimental research model
Performance may vary on complex mathematical problems
Limited to grade-school level mathematics
May occasionally produce incorrect reasoning despite correct answers

Intended Use

Mathematical problem-solving assistance
Educational support for grade-school math
Research in mathematical reasoning capabilities of language models

Training Infrastructure

Framework: DeepSpeed ZeRO-3
Hardware: 7 GPUs
Mixed Precision: bfloat16

License

Apache 2.0

Citation

If you use this model, please cite:

@software{granite-grpo-gsm8k,
  author = {Your Name},
  title = {Granite GRPO Mathematical Reasoning Model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/manavg/granite-grpo-gsm8k-40pct}
}

Acknowledgments

IBM for the base Granite model
OpenAI for the GSM8K dataset

manavg
/

granite-grpo-gsm8k-40pct