Updated inference code and eval result.
Browse files
README.md
CHANGED
@@ -72,25 +72,25 @@ After fine-tuning, the model underwent Direct Preference Optimization (DPO) to e
|
|
72 |
|
73 |
### English/Code/Math Benchmarks
|
74 |
|
75 |
-
| Benchmark | Krutrim-1
|
76 |
-
|
77 |
-
| Hellaswag (0-shot) - Accuracy | 0.74 | 0.82 |
|
78 |
-
| Winogrande (0-shot) - Accuracy | 0.67 | 0.74 |
|
79 |
-
| OpenBookQA (0-shot) - Accuracy | 0.45 | 0.46 |
|
80 |
-
| CommonSenseQA (0-shot) - Accuracy | 0.74 | 0.70 |
|
81 |
-
| TruthfulQA (0-shot) - Accuracy | 0.49 | 0.54 |
|
82 |
-
| MMLU (5-shot) - Accuracy | 0.47 | 0.68 |
|
83 |
-
| TriviaQA (5-shot) - EM | 0.44 | 0.72 |
|
84 |
-
| NaturalQuestions (5-shot) - EM | 0.15 | 0.28 |
|
85 |
-
| GSM8K (0-shot) - EM | 0.07 | 0.74 |
|
86 |
-
| ARC_Challenge (0-shot) - Accuracy | 0.48 | 0.59 |
|
87 |
-
| ARC_Easy (0-shot) - Accuracy | 0.73 | 0.80 |
|
88 |
-
| HumanEval - Pass@10 | 0.00 | 0.23 |
|
89 |
-
| IF_Eval (0-shot) - Accuracy | 0.16 | 0.46
|
90 |
|
91 |
### Indic Benchmarks
|
92 |
|
93 |
-
| Benchmark | Metric | Krutrim-1
|
94 |
|--------------------------------------------|------------|--------------|----------------|--------------|--------------|--------------|----------------|--------|
|
95 |
| IndicSentiment (0-shot) | Accuracy | 0.65 | 0.70 | 0.95 | 0.05 | 0.96 | 0.99 | 0.98 |
|
96 |
| IndicCOPA (0-shot) | Accuracy | 0.51 | 0.58 | 0.80 | 0.48 | 0.83 | 0.88 | 0.91 |
|
@@ -107,7 +107,7 @@ After fine-tuning, the model underwent Direct Preference Optimization (DPO) to e
|
|
107 |
### BharatBench
|
108 |
The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
|
109 |
|
110 |
-
| Benchmark | Metric | Krutrim-1
|
111 |
|-------------------------------------|------------|--------------|-----------------|---------------|------------------------|------------------------|---------------------|---------------------|--------|
|
112 |
| Indian Cultural Context (0-shot) | Bert Score | 0.86 | 0.56 | 0.88 | 0.87 | 0.88 | 0.87 | 0.87 | 0.89 |
|
113 |
| Grammar Correction (5-shot) | Bert Score | 0.96 | 0.94 | 0.98 | 0.95 | 0.98 | 0.96 | 0.96 | 0.97 |
|
@@ -145,16 +145,10 @@ inputs.pop("token_type_ids", None)
|
|
145 |
outputs = model.generate(
|
146 |
**inputs,
|
147 |
max_length=4096,
|
148 |
-
temperature=0.3
|
149 |
-
top_k=50,
|
150 |
-
top_p=0.9,
|
151 |
-
repetition_penalty=1.2,
|
152 |
-
num_return_sequences=1,
|
153 |
-
do_sample=True,
|
154 |
-
eos_token_id=2,
|
155 |
)
|
156 |
|
157 |
-
|
158 |
```
|
159 |
Note: The provided chat template, which is the default chat template, helps generate the best response by structuring conversations optimally for the model.
|
160 |
We recommend using `temperature=0.3` for the best performance
|
|
|
72 |
|
73 |
### English/Code/Math Benchmarks
|
74 |
|
75 |
+
| Benchmark | Krutrim-1-7B | MN-12B-Instruct| Krutrim-2-12B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
|
76 |
+
|-------------------------------------------|--------------|----------------|--------------------|----------------------|------------------------|-----------------------|
|
77 |
+
| Hellaswag (0-shot) - Accuracy | 0.74 | 0.82 | 0.83 | 0.95 | 0.87 (10-shot) | 0.95 (10-shot) |
|
78 |
+
| Winogrande (0-shot) - Accuracy | 0.67 | 0.74 | 0.77 | 0.85 (5-shot) | - | 0.88 (5-shot) |
|
79 |
+
| OpenBookQA (0-shot) - Accuracy | 0.45 | 0.46 | 0.49 | - | - | - |
|
80 |
+
| CommonSenseQA (0-shot) - Accuracy | 0.74 | 0.70 | 0.74 | - | - | 0.85 |
|
81 |
+
| TruthfulQA (0-shot) - Accuracy | 0.49 | 0.54 | 0.59 | - | - | 0.59 |
|
82 |
+
| MMLU (5-shot) - Accuracy | 0.47 | 0.68 | 0.63 | 0.82 | 0.79 | 0.86 |
|
83 |
+
| TriviaQA (5-shot) - EM | 0.44 | 0.72 | 0.62 | - | - | - |
|
84 |
+
| NaturalQuestions (5-shot) - EM | 0.15 | 0.28 | 0.26 | - | - | - |
|
85 |
+
| GSM8K (0-shot) - EM | 0.07 | 0.74 | 0.71 | 0.93 (8-shot, CoT) | 0.86 (11-shot) | 0.89 |
|
86 |
+
| ARC_Challenge (0-shot) - Accuracy | 0.48 | 0.59 | 0.60 | 0.93 (25-shot) | - | 0.50 |
|
87 |
+
| ARC_Easy (0-shot) - Accuracy | 0.73 | 0.80 | 0.82 | - | - | - |
|
88 |
+
| HumanEval - Pass@10 | 0.00 | 0.23 | 0.80 | 0.88 | 0.74 (0-shot) | 0.90 |
|
89 |
+
| IF_Eval (0-shot) - Accuracy | 0.16 | 0.46 | 0.56 | 0.92 | - | 0.84 |
|
90 |
|
91 |
### Indic Benchmarks
|
92 |
|
93 |
+
| Benchmark | Metric | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
|
94 |
|--------------------------------------------|------------|--------------|----------------|--------------|--------------|--------------|----------------|--------|
|
95 |
| IndicSentiment (0-shot) | Accuracy | 0.65 | 0.70 | 0.95 | 0.05 | 0.96 | 0.99 | 0.98 |
|
96 |
| IndicCOPA (0-shot) | Accuracy | 0.51 | 0.58 | 0.80 | 0.48 | 0.83 | 0.88 | 0.91 |
|
|
|
107 |
### BharatBench
|
108 |
The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
|
109 |
|
110 |
+
| Benchmark | Metric | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B-Instruct | llama-3.1-70B-Instruct | Gemma-2-9B-Instruct | Gemma-2-27B-Instruct | GPT-4o |
|
111 |
|-------------------------------------|------------|--------------|-----------------|---------------|------------------------|------------------------|---------------------|---------------------|--------|
|
112 |
| Indian Cultural Context (0-shot) | Bert Score | 0.86 | 0.56 | 0.88 | 0.87 | 0.88 | 0.87 | 0.87 | 0.89 |
|
113 |
| Grammar Correction (5-shot) | Bert Score | 0.96 | 0.94 | 0.98 | 0.95 | 0.98 | 0.96 | 0.96 | 0.97 |
|
|
|
145 |
outputs = model.generate(
|
146 |
**inputs,
|
147 |
max_length=4096,
|
148 |
+
temperature=0.3
|
|
|
|
|
|
|
|
|
|
|
|
|
149 |
)
|
150 |
|
151 |
+
response = tokenizer.decode(outputs[0])
|
152 |
```
|
153 |
Note: The provided chat template, which is the default chat template, helps generate the best response by structuring conversations optimally for the model.
|
154 |
We recommend using `temperature=0.3` for the best performance
|