krutrim-ai-labs
/

Krutrim-2-instruct

PyTorch

mistral

Krutrim

language-model

Model card Files Files and versions Community

krutrim-admin commited on 12 days ago

Commit

bbae435

verified ·

1 Parent(s): a06fe4f

Extended the En benchmarks table

Browse files

Files changed (1) hide show

README.md +15 -14

README.md CHANGED Viewed

@@ -73,20 +73,21 @@ After fine-tuning, the model underwent Direct Preference Optimization (DPO) with
 ### English/Code/Math Benchmarks
-| Dataset                     | Mistral-NeMo-12B-Base | Krutrim-1 | Mistral-NeMo-12B-Instruct |Krutrim-2-Instruct-0131 |
-|-----------------------------|-----------------------|-----------|---------------------------|-----------|
-| HellaSwag                   | 83%                   | 73%       | 82%                       | 83%       |
-| Winogrande                  | 73%                   | 67%       | 74%                       | 77%       |
-| CommonSenseQA               | 62%                   | 39%       | 70%                       | 74%       |
-| MMLU                        | 69%                   | 44%       | 68%                       | 63%       |
-| OpenBookQA                  | 48%                   | 44%       | 46%                       | 49%       |
-| TriviaQA                    | 75%                   | 52%       | 72%                       | 62%       |
-| NaturalQuestions            | 32%                   | 19%       | 28%                       | 26%       |
-| TruthfulQA                  | 48%                   | 38%       | 54%                       | 59%       |
-| GSM8K                       | 17%                   | 09%       | 74%                       | 71%       |
-| ARC_Challenge               | 58%                   | 42%       | 59%                       | 60%       |
-| ARC_Easy                    | 82%                   | 70%       | 80%                       | 82%       |
-| HumanEval (pass@10)         | 32%                   | 00%       | 23%                       | 80%       |
 ### Indic Benchmarks

 ### English/Code/Math Benchmarks
+| Benchmark                                 | Krutrim-1 7B | MN-12B-Instruct | Krutrim-2 12B | llama-3.3-70B       | Gemini-1.5 Flash       | GPT-4o                |
+|-------------------------------------------|--------------|-----------------|---------------|----------------------|------------------------|-----------------------|
+| Hellaswag (0-shot) - Accuracy             | 0.74         | 0.82            | 0.83          | 0.95                 | 0.87 (10-shot)         | 0.95 (10-shot)        |
+| Winogrande (0-shot) - Accuracy            | 0.67         | 0.74            | 0.77          | 0.85 (5-shot)        | -                      | 0.88 (5-shot)        |
+| OpenBookQA (0-shot) - Accuracy            | 0.45         | 0.46            | 0.49          | -                    | -                      | -                     |
+| CommonSenseQA (0-shot) - Accuracy         | 0.74         | 0.70            | 0.74          | -                    | -                      | 0.85                  |
+| TruthfulQA (0-shot) - Accuracy            | 0.49         | 0.54            | 0.59          | -                    | -                      | 0.59                  |
+| MMLU (5-shot) - Accuracy                  | 0.47         | 0.68            | 0.63          | 0.82                 | 0.79                   | 0.86                  |
+| TriviaQA (5-shot) - EM                    | 0.44         | 0.72            | 0.62          | -                    | -                      | -                     |
+| NaturalQuestions (5-shot) - EM            | 0.15         | 0.28            | 0.26          | -                    | -                      | -                     |
+| GSM8K (0-shot) - EM                       | 0.07         | 0.74            | 0.71          | 0.93 (8-shot, CoT)   | 0.86 (11-shot)         | 0.89                  |
+| ARC_Challenge (0-shot) - Accuracy         | 0.48         | 0.59            | 0.60          | 0.93 (25-shot)       | -                      | 0.50                  |
+| ARC_Easy (0-shot) - Accuracy              | 0.73         | 0.80            | 0.82          | -                    | -                      | -                     |
+| HumanEval - Pass@10                       | 0.00         | 0.23            | 0.80          | 0.88                 | 0.74 (0-shot)          | 0.90                  |
+| IF_Eval (0-shot) - Accuracy               | 0.16         | -               | 0.56          | 0.92                 | -                      | 0.84                  |
 ### Indic Benchmarks