krutrim-ai-labs
/

Krutrim-2-instruct

PyTorch

mistral

Krutrim

language-model

Model card Files Files and versions Community

krutrim-admin commited on Feb 3

Commit

35bcb04

verified ·

1 Parent(s): fb2ce88

Updated inference code and eval result.

Browse files

Files changed (1) hide show

README.md +19 -25

README.md CHANGED Viewed

@@ -72,25 +72,25 @@ After fine-tuning, the model underwent Direct Preference Optimization (DPO) to e
 ### English/Code/Math Benchmarks
-| Benchmark                                 | Krutrim-1 7B | MN-12B-Instruct|Krutrim-2-base  | Krutrim-2-instruct | llama-3.3-70B       | Gemini-1.5 Flash       | GPT-4o                 |
-|-------------------------------------------|--------------|----------------|----------------|--------------------|----------------------|------------------------|-----------------------|
-| Hellaswag (0-shot) - Accuracy             | 0.74         | 0.82           |0.80            | 0.83               | 0.95                 | 0.87 (10-shot)         | 0.95 (10-shot)        |
-| Winogrande (0-shot) - Accuracy            | 0.67         | 0.74           |0.73            | 0.77               | 0.85 (5-shot)        | -                      | 0.88 (5-shot)         |
-| OpenBookQA (0-shot) - Accuracy            | 0.45         | 0.46           |0.47            | 0.49               | -                    | -                      | -                     |
-| CommonSenseQA (0-shot) - Accuracy         | 0.74         | 0.70           |0.66            | 0.74               | -                    | -                      | 0.85                  |
-| TruthfulQA (0-shot) - Accuracy            | 0.49         | 0.54           |0.48            | 0.59               | -                    | -                      | 0.59                  |
-| MMLU (5-shot) - Accuracy                  | 0.47         | 0.68           |0.64            | 0.63               | 0.82                 | 0.79                   | 0.86                  |
-| TriviaQA (5-shot) - EM                    | 0.44         | 0.72           |0.66            | 0.62               | -                    | -                      | -                     |
-| NaturalQuestions (5-shot) - EM            | 0.15         | 0.28           |0.27            | 0.26               | -                    | -                      | -                     |
-| GSM8K (0-shot) - EM                       | 0.07         | 0.74           |0.55            | 0.71               | 0.93 (8-shot, CoT)   | 0.86 (11-shot)         | 0.89                  |
-| ARC_Challenge (0-shot) - Accuracy         | 0.48         | 0.59           |0.55            | 0.60               | 0.93 (25-shot)       | -                      | 0.50                  |
-| ARC_Easy (0-shot) - Accuracy              | 0.73         | 0.80           |0.79            | 0.82               | -                    | -                      | -                     |
-| HumanEval - Pass@10                       | 0.00         | 0.23           |0.59            | 0.80               | 0.88                 | 0.74 (0-shot)          | 0.90                  |
-| IF_Eval (0-shot) - Accuracy               | 0.16         | 0.46           |-               | 0.56               | 0.92                 | -                      | 0.84                  |
 ### Indic Benchmarks
-| Benchmark                                  | Metric     | Krutrim-1 7B | MN-12B-Instruct | Krutrim-2 12B | llama-3.1-8B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
 |--------------------------------------------|------------|--------------|----------------|--------------|--------------|--------------|----------------|--------|
 | IndicSentiment (0-shot)                   | Accuracy   | 0.65         | 0.70           | 0.95         | 0.05         | 0.96         | 0.99           | 0.98   |
 | IndicCOPA (0-shot)                        | Accuracy   | 0.51         | 0.58           | 0.80         | 0.48         | 0.83         | 0.88           | 0.91   |
@@ -107,7 +107,7 @@ After fine-tuning, the model underwent Direct Preference Optimization (DPO) to e
 ### BharatBench
 The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
-| Benchmark                           | Metric      | Krutrim-1 7B | MN-12B-Instruct | Krutrim-2 12B | llama-3.1-8B-Instruct | llama-3.1-70B-Instruct | Gemma-2-9B-Instruct | Gemma-2-27B-Instruct | GPT-4o |
 |-------------------------------------|------------|--------------|-----------------|---------------|------------------------|------------------------|---------------------|---------------------|--------|
 | Indian Cultural Context (0-shot)    | Bert Score | 0.86         | 0.56            | 0.88          | 0.87                   | 0.88                   | 0.87                | 0.87                | 0.89   |
 | Grammar Correction (5-shot)         | Bert Score | 0.96         | 0.94            | 0.98          | 0.95                   | 0.98                   | 0.96                | 0.96                | 0.97   |
@@ -145,16 +145,10 @@ inputs.pop("token_type_ids", None)
 outputs = model.generate(
     **inputs,
     max_length=4096,
-    temperature=0.3,
-    top_k=50,
-    top_p=0.9,
-    repetition_penalty=1.2,
-    num_return_sequences=1,
-    do_sample=True,
-    eos_token_id=2,
 )
-response_list = [tokenizer.decode(output).split(prompt)[1] for output in outputs]
 ```
 Note: The provided chat template, which is the default chat template, helps generate the best response by structuring conversations optimally for the model.
 We recommend using `temperature=0.3` for the best performance

 ### English/Code/Math Benchmarks
+| Benchmark                                 | Krutrim-1-7B | MN-12B-Instruct| Krutrim-2-12B | llama-3.3-70B       | Gemini-1.5 Flash       | GPT-4o                 |
+|-------------------------------------------|--------------|----------------|--------------------|----------------------|------------------------|-----------------------|
+| Hellaswag (0-shot) - Accuracy             | 0.74         | 0.82           | 0.83               | 0.95                 | 0.87 (10-shot)         | 0.95 (10-shot)        |
+| Winogrande (0-shot) - Accuracy            | 0.67         | 0.74           | 0.77               | 0.85 (5-shot)        | -                      | 0.88 (5-shot)         |
+| OpenBookQA (0-shot) - Accuracy            | 0.45         | 0.46           | 0.49               | -                    | -                      | -                     |
+| CommonSenseQA (0-shot) - Accuracy         | 0.74         | 0.70           | 0.74               | -                    | -                      | 0.85                  |
+| TruthfulQA (0-shot) - Accuracy            | 0.49         | 0.54           | 0.59               | -                    | -                      | 0.59                  |
+| MMLU (5-shot) - Accuracy                  | 0.47         | 0.68           | 0.63               | 0.82                 | 0.79                   | 0.86                  |
+| TriviaQA (5-shot) - EM                    | 0.44         | 0.72           | 0.62               | -                    | -                      | -                     |
+| NaturalQuestions (5-shot) - EM            | 0.15         | 0.28           | 0.26               | -                    | -                      | -                     |
+| GSM8K (0-shot) - EM                       | 0.07         | 0.74           | 0.71               | 0.93 (8-shot, CoT)   | 0.86 (11-shot)         | 0.89                  |
+| ARC_Challenge (0-shot) - Accuracy         | 0.48         | 0.59           | 0.60               | 0.93 (25-shot)       | -                      | 0.50                  |
+| ARC_Easy (0-shot) - Accuracy              | 0.73         | 0.80           | 0.82               | -                    | -                      | -                     |
+| HumanEval - Pass@10                       | 0.00         | 0.23           | 0.80               | 0.88                 | 0.74 (0-shot)          | 0.90                  |
+| IF_Eval (0-shot) - Accuracy               | 0.16         | 0.46           | 0.56               | 0.92                 | -                      | 0.84                  |
 ### Indic Benchmarks
+| Benchmark                                  | Metric     | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
 |--------------------------------------------|------------|--------------|----------------|--------------|--------------|--------------|----------------|--------|
 | IndicSentiment (0-shot)                   | Accuracy   | 0.65         | 0.70           | 0.95         | 0.05         | 0.96         | 0.99           | 0.98   |
 | IndicCOPA (0-shot)                        | Accuracy   | 0.51         | 0.58           | 0.80         | 0.48         | 0.83         | 0.88           | 0.91   |
 ### BharatBench
 The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
+| Benchmark                           | Metric      | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B-Instruct | llama-3.1-70B-Instruct | Gemma-2-9B-Instruct | Gemma-2-27B-Instruct | GPT-4o |
 |-------------------------------------|------------|--------------|-----------------|---------------|------------------------|------------------------|---------------------|---------------------|--------|
 | Indian Cultural Context (0-shot)    | Bert Score | 0.86         | 0.56            | 0.88          | 0.87                   | 0.88                   | 0.87                | 0.87                | 0.89   |
 | Grammar Correction (5-shot)         | Bert Score | 0.96         | 0.94            | 0.98          | 0.95                   | 0.98                   | 0.96                | 0.96                | 0.97   |
 outputs = model.generate(
     **inputs,
     max_length=4096,
+    temperature=0.3
 )
+response = tokenizer.decode(outputs[0])
 ```
 Note: The provided chat template, which is the default chat template, helps generate the best response by structuring conversations optimally for the model.
 We recommend using `temperature=0.3` for the best performance