PyTorch
mistral
Krutrim
language-model
krutrim-admin commited on
Commit
35bcb04
·
verified ·
1 Parent(s): fb2ce88

Updated inference code and eval result.

Browse files
Files changed (1) hide show
  1. README.md +19 -25
README.md CHANGED
@@ -72,25 +72,25 @@ After fine-tuning, the model underwent Direct Preference Optimization (DPO) to e
72
 
73
  ### English/Code/Math Benchmarks
74
 
75
- | Benchmark | Krutrim-1 7B | MN-12B-Instruct|Krutrim-2-base | Krutrim-2-instruct | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
76
- |-------------------------------------------|--------------|----------------|----------------|--------------------|----------------------|------------------------|-----------------------|
77
- | Hellaswag (0-shot) - Accuracy | 0.74 | 0.82 |0.80 | 0.83 | 0.95 | 0.87 (10-shot) | 0.95 (10-shot) |
78
- | Winogrande (0-shot) - Accuracy | 0.67 | 0.74 |0.73 | 0.77 | 0.85 (5-shot) | - | 0.88 (5-shot) |
79
- | OpenBookQA (0-shot) - Accuracy | 0.45 | 0.46 |0.47 | 0.49 | - | - | - |
80
- | CommonSenseQA (0-shot) - Accuracy | 0.74 | 0.70 |0.66 | 0.74 | - | - | 0.85 |
81
- | TruthfulQA (0-shot) - Accuracy | 0.49 | 0.54 |0.48 | 0.59 | - | - | 0.59 |
82
- | MMLU (5-shot) - Accuracy | 0.47 | 0.68 |0.64 | 0.63 | 0.82 | 0.79 | 0.86 |
83
- | TriviaQA (5-shot) - EM | 0.44 | 0.72 |0.66 | 0.62 | - | - | - |
84
- | NaturalQuestions (5-shot) - EM | 0.15 | 0.28 |0.27 | 0.26 | - | - | - |
85
- | GSM8K (0-shot) - EM | 0.07 | 0.74 |0.55 | 0.71 | 0.93 (8-shot, CoT) | 0.86 (11-shot) | 0.89 |
86
- | ARC_Challenge (0-shot) - Accuracy | 0.48 | 0.59 |0.55 | 0.60 | 0.93 (25-shot) | - | 0.50 |
87
- | ARC_Easy (0-shot) - Accuracy | 0.73 | 0.80 |0.79 | 0.82 | - | - | - |
88
- | HumanEval - Pass@10 | 0.00 | 0.23 |0.59 | 0.80 | 0.88 | 0.74 (0-shot) | 0.90 |
89
- | IF_Eval (0-shot) - Accuracy | 0.16 | 0.46 |- | 0.56 | 0.92 | - | 0.84 |
90
 
91
  ### Indic Benchmarks
92
 
93
- | Benchmark | Metric | Krutrim-1 7B | MN-12B-Instruct | Krutrim-2 12B | llama-3.1-8B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
94
  |--------------------------------------------|------------|--------------|----------------|--------------|--------------|--------------|----------------|--------|
95
  | IndicSentiment (0-shot) | Accuracy | 0.65 | 0.70 | 0.95 | 0.05 | 0.96 | 0.99 | 0.98 |
96
  | IndicCOPA (0-shot) | Accuracy | 0.51 | 0.58 | 0.80 | 0.48 | 0.83 | 0.88 | 0.91 |
@@ -107,7 +107,7 @@ After fine-tuning, the model underwent Direct Preference Optimization (DPO) to e
107
  ### BharatBench
108
  The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
109
 
110
- | Benchmark | Metric | Krutrim-1 7B | MN-12B-Instruct | Krutrim-2 12B | llama-3.1-8B-Instruct | llama-3.1-70B-Instruct | Gemma-2-9B-Instruct | Gemma-2-27B-Instruct | GPT-4o |
111
  |-------------------------------------|------------|--------------|-----------------|---------------|------------------------|------------------------|---------------------|---------------------|--------|
112
  | Indian Cultural Context (0-shot) | Bert Score | 0.86 | 0.56 | 0.88 | 0.87 | 0.88 | 0.87 | 0.87 | 0.89 |
113
  | Grammar Correction (5-shot) | Bert Score | 0.96 | 0.94 | 0.98 | 0.95 | 0.98 | 0.96 | 0.96 | 0.97 |
@@ -145,16 +145,10 @@ inputs.pop("token_type_ids", None)
145
  outputs = model.generate(
146
  **inputs,
147
  max_length=4096,
148
- temperature=0.3,
149
- top_k=50,
150
- top_p=0.9,
151
- repetition_penalty=1.2,
152
- num_return_sequences=1,
153
- do_sample=True,
154
- eos_token_id=2,
155
  )
156
 
157
- response_list = [tokenizer.decode(output).split(prompt)[1] for output in outputs]
158
  ```
159
  Note: The provided chat template, which is the default chat template, helps generate the best response by structuring conversations optimally for the model.
160
  We recommend using `temperature=0.3` for the best performance
 
72
 
73
  ### English/Code/Math Benchmarks
74
 
75
+ | Benchmark | Krutrim-1-7B | MN-12B-Instruct| Krutrim-2-12B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
76
+ |-------------------------------------------|--------------|----------------|--------------------|----------------------|------------------------|-----------------------|
77
+ | Hellaswag (0-shot) - Accuracy | 0.74 | 0.82 | 0.83 | 0.95 | 0.87 (10-shot) | 0.95 (10-shot) |
78
+ | Winogrande (0-shot) - Accuracy | 0.67 | 0.74 | 0.77 | 0.85 (5-shot) | - | 0.88 (5-shot) |
79
+ | OpenBookQA (0-shot) - Accuracy | 0.45 | 0.46 | 0.49 | - | - | - |
80
+ | CommonSenseQA (0-shot) - Accuracy | 0.74 | 0.70 | 0.74 | - | - | 0.85 |
81
+ | TruthfulQA (0-shot) - Accuracy | 0.49 | 0.54 | 0.59 | - | - | 0.59 |
82
+ | MMLU (5-shot) - Accuracy | 0.47 | 0.68 | 0.63 | 0.82 | 0.79 | 0.86 |
83
+ | TriviaQA (5-shot) - EM | 0.44 | 0.72 | 0.62 | - | - | - |
84
+ | NaturalQuestions (5-shot) - EM | 0.15 | 0.28 | 0.26 | - | - | - |
85
+ | GSM8K (0-shot) - EM | 0.07 | 0.74 | 0.71 | 0.93 (8-shot, CoT) | 0.86 (11-shot) | 0.89 |
86
+ | ARC_Challenge (0-shot) - Accuracy | 0.48 | 0.59 | 0.60 | 0.93 (25-shot) | - | 0.50 |
87
+ | ARC_Easy (0-shot) - Accuracy | 0.73 | 0.80 | 0.82 | - | - | - |
88
+ | HumanEval - Pass@10 | 0.00 | 0.23 | 0.80 | 0.88 | 0.74 (0-shot) | 0.90 |
89
+ | IF_Eval (0-shot) - Accuracy | 0.16 | 0.46 | 0.56 | 0.92 | - | 0.84 |
90
 
91
  ### Indic Benchmarks
92
 
93
+ | Benchmark | Metric | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B | llama-3.3-70B | Gemini-1.5 Flash | GPT-4o |
94
  |--------------------------------------------|------------|--------------|----------------|--------------|--------------|--------------|----------------|--------|
95
  | IndicSentiment (0-shot) | Accuracy | 0.65 | 0.70 | 0.95 | 0.05 | 0.96 | 0.99 | 0.98 |
96
  | IndicCOPA (0-shot) | Accuracy | 0.51 | 0.58 | 0.80 | 0.48 | 0.83 | 0.88 | 0.91 |
 
107
  ### BharatBench
108
  The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
109
 
110
+ | Benchmark | Metric | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B-Instruct | llama-3.1-70B-Instruct | Gemma-2-9B-Instruct | Gemma-2-27B-Instruct | GPT-4o |
111
  |-------------------------------------|------------|--------------|-----------------|---------------|------------------------|------------------------|---------------------|---------------------|--------|
112
  | Indian Cultural Context (0-shot) | Bert Score | 0.86 | 0.56 | 0.88 | 0.87 | 0.88 | 0.87 | 0.87 | 0.89 |
113
  | Grammar Correction (5-shot) | Bert Score | 0.96 | 0.94 | 0.98 | 0.95 | 0.98 | 0.96 | 0.96 | 0.97 |
 
145
  outputs = model.generate(
146
  **inputs,
147
  max_length=4096,
148
+ temperature=0.3
 
 
 
 
 
 
149
  )
150
 
151
+ response = tokenizer.decode(outputs[0])
152
  ```
153
  Note: The provided chat template, which is the default chat template, helps generate the best response by structuring conversations optimally for the model.
154
  We recommend using `temperature=0.3` for the best performance