Extended the En benchmarks table
Browse files
README.md
CHANGED
@@ -73,20 +73,21 @@ After fine-tuning, the model underwent Direct Preference Optimization (DPO) with
|
|
73 |
|
74 |
### English/Code/Math Benchmarks
|
75 |
|
76 |
-
|
|
77 |
-
|
78 |
-
|
|
79 |
-
| Winogrande
|
80 |
-
|
|
81 |
-
|
|
82 |
-
|
|
83 |
-
|
|
84 |
-
|
|
85 |
-
|
|
86 |
-
| GSM8K |
|
87 |
-
| ARC_Challenge
|
88 |
-
| ARC_Easy
|
89 |
-
| HumanEval
|
|
|
90 |
|
91 |
### Indic Benchmarks
|
92 |
|
|
|
73 |
|
74 |
### English/Code/Math Benchmarks
|
75 |
|
76 |
+
| Benchmark | Krutrim-1 7B | MN-12B-Instruct | Krutrim-2 12B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
|
77 |
+
|-------------------------------------------|--------------|-----------------|---------------|----------------------|------------------------|-----------------------|
|
78 |
+
| Hellaswag (0-shot) - Accuracy | 0.74 | 0.82 | 0.83 | 0.95 | 0.87 (10-shot) | 0.95 (10-shot) |
|
79 |
+
| Winogrande (0-shot) - Accuracy | 0.67 | 0.74 | 0.77 | 0.85 (5-shot) | - | 0.88 (5-shot) |
|
80 |
+
| OpenBookQA (0-shot) - Accuracy | 0.45 | 0.46 | 0.49 | - | - | - |
|
81 |
+
| CommonSenseQA (0-shot) - Accuracy | 0.74 | 0.70 | 0.74 | - | - | 0.85 |
|
82 |
+
| TruthfulQA (0-shot) - Accuracy | 0.49 | 0.54 | 0.59 | - | - | 0.59 |
|
83 |
+
| MMLU (5-shot) - Accuracy | 0.47 | 0.68 | 0.63 | 0.82 | 0.79 | 0.86 |
|
84 |
+
| TriviaQA (5-shot) - EM | 0.44 | 0.72 | 0.62 | - | - | - |
|
85 |
+
| NaturalQuestions (5-shot) - EM | 0.15 | 0.28 | 0.26 | - | - | - |
|
86 |
+
| GSM8K (0-shot) - EM | 0.07 | 0.74 | 0.71 | 0.93 (8-shot, CoT) | 0.86 (11-shot) | 0.89 |
|
87 |
+
| ARC_Challenge (0-shot) - Accuracy | 0.48 | 0.59 | 0.60 | 0.93 (25-shot) | - | 0.50 |
|
88 |
+
| ARC_Easy (0-shot) - Accuracy | 0.73 | 0.80 | 0.82 | - | - | - |
|
89 |
+
| HumanEval - Pass@10 | 0.00 | 0.23 | 0.80 | 0.88 | 0.74 (0-shot) | 0.90 |
|
90 |
+
| IF_Eval (0-shot) - Accuracy | 0.16 | - | 0.56 | 0.92 | - | 0.84 |
|
91 |
|
92 |
### Indic Benchmarks
|
93 |
|