PyTorch
mistral
Krutrim
language-model
krutrim-admin commited on
Commit
bbae435
·
verified ·
1 Parent(s): a06fe4f

Extended the En benchmarks table

Browse files
Files changed (1) hide show
  1. README.md +15 -14
README.md CHANGED
@@ -73,20 +73,21 @@ After fine-tuning, the model underwent Direct Preference Optimization (DPO) with
73
 
74
  ### English/Code/Math Benchmarks
75
 
76
- | Dataset | Mistral-NeMo-12B-Base | Krutrim-1 | Mistral-NeMo-12B-Instruct |Krutrim-2-Instruct-0131 |
77
- |-----------------------------|-----------------------|-----------|---------------------------|-----------|
78
- | HellaSwag | 83% | 73% | 82% | 83% |
79
- | Winogrande | 73% | 67% | 74% | 77% |
80
- | CommonSenseQA | 62% | 39% | 70% | 74% |
81
- | MMLU | 69% | 44% | 68% | 63% |
82
- | OpenBookQA | 48% | 44% | 46% | 49% |
83
- | TriviaQA | 75% | 52% | 72% | 62% |
84
- | NaturalQuestions | 32% | 19% | 28% | 26% |
85
- | TruthfulQA | 48% | 38% | 54% | 59% |
86
- | GSM8K | 17% | 09% | 74% | 71% |
87
- | ARC_Challenge | 58% | 42% | 59% | 60% |
88
- | ARC_Easy | 82% | 70% | 80% | 82% |
89
- | HumanEval (pass@10) | 32% | 00% | 23% | 80% |
 
90
 
91
  ### Indic Benchmarks
92
 
 
73
 
74
  ### English/Code/Math Benchmarks
75
 
76
+ | Benchmark | Krutrim-1 7B | MN-12B-Instruct | Krutrim-2 12B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
77
+ |-------------------------------------------|--------------|-----------------|---------------|----------------------|------------------------|-----------------------|
78
+ | Hellaswag (0-shot) - Accuracy | 0.74 | 0.82 | 0.83 | 0.95 | 0.87 (10-shot) | 0.95 (10-shot) |
79
+ | Winogrande (0-shot) - Accuracy | 0.67 | 0.74 | 0.77 | 0.85 (5-shot) | - | 0.88 (5-shot) |
80
+ | OpenBookQA (0-shot) - Accuracy | 0.45 | 0.46 | 0.49 | - | - | - |
81
+ | CommonSenseQA (0-shot) - Accuracy | 0.74 | 0.70 | 0.74 | - | - | 0.85 |
82
+ | TruthfulQA (0-shot) - Accuracy | 0.49 | 0.54 | 0.59 | - | - | 0.59 |
83
+ | MMLU (5-shot) - Accuracy | 0.47 | 0.68 | 0.63 | 0.82 | 0.79 | 0.86 |
84
+ | TriviaQA (5-shot) - EM | 0.44 | 0.72 | 0.62 | - | - | - |
85
+ | NaturalQuestions (5-shot) - EM | 0.15 | 0.28 | 0.26 | - | - | - |
86
+ | GSM8K (0-shot) - EM | 0.07 | 0.74 | 0.71 | 0.93 (8-shot, CoT) | 0.86 (11-shot) | 0.89 |
87
+ | ARC_Challenge (0-shot) - Accuracy | 0.48 | 0.59 | 0.60 | 0.93 (25-shot) | - | 0.50 |
88
+ | ARC_Easy (0-shot) - Accuracy | 0.73 | 0.80 | 0.82 | - | - | - |
89
+ | HumanEval - Pass@10 | 0.00 | 0.23 | 0.80 | 0.88 | 0.74 (0-shot) | 0.90 |
90
+ | IF_Eval (0-shot) - Accuracy | 0.16 | - | 0.56 | 0.92 | - | 0.84 |
91
 
92
  ### Indic Benchmarks
93