Granite GRPO Mathematical Reasoning Model

Model Details

  • Base Model: IBM Granite 3.1-2b
  • Training Approach: Grouped Reward Policy Optimization (GRPO)
  • Dataset: GSM8K (Grade School Math 8K)
  • Training Progress: 400/1000 steps (40% complete)
  • Training Configuration:
    • Learning rate: 3e-6
    • Batch size: 1 per device
    • Gradient accumulation steps: 8
    • Mixed precision: bfloat16
    • DeepSpeed ZeRO-3 optimization

Training Methodology

This model was trained using GRPO with multiple reward functions:

  1. Correctness Reward (1.0 max): Exact match with reference answer
  2. Integer Format (0.5 max): Validates numerical answer format
  3. Strict Format (0.5 max): Enforces XML-style response structure
  4. Soft Format (0.5 max): Ensures basic response organization

Performance Metrics (at step 400)

  • Correctness Reward: ~0.65-0.70 (improving)
  • Format Rewards:
    • Strict format: ~0.40
    • Soft format: ~0.45
    • Integer format: ~0.45
  • Total Reward: ~2.0 (stable)

Input Format

The model expects inputs in the following format:

[Question text]

Output Format

The model generates responses in the following structure:

<reasoning>
Step-by-step mathematical reasoning
</reasoning>
<answer>
Numerical answer
</answer>

Limitations

  • Training is incomplete (40% of planned steps)
  • Experimental research model
  • Performance may vary on complex mathematical problems
  • Limited to grade-school level mathematics
  • May occasionally produce incorrect reasoning despite correct answers

Intended Use

  • Mathematical problem-solving assistance
  • Educational support for grade-school math
  • Research in mathematical reasoning capabilities of language models

Training Infrastructure

  • Framework: DeepSpeed ZeRO-3
  • Hardware: 7 GPUs
  • Mixed Precision: bfloat16

License

Apache 2.0

Citation

If you use this model, please cite:

@software{granite-grpo-gsm8k,
  author = {Your Name},
  title = {Granite GRPO Mathematical Reasoning Model},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/manavg/granite-grpo-gsm8k-40pct}
}

Acknowledgments

  • IBM for the base Granite model
  • OpenAI for the GSM8K dataset
Downloads last month
26
Safetensors
Model size
2.53B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for manavg/granite-grpo-gsm8k-40pct

Finetuned
(8)
this model

Dataset used to train manavg/granite-grpo-gsm8k-40pct