krutrim-ai-labs
/

Krutrim-2-instruct

PyTorch

mistral

Krutrim

language-model

Model card Files Files and versions Community

krutrim-admin commited on 9 days ago

Commit

fb9bab0

verified ·

1 Parent(s): 0cc38b7

removed gemma-9b

Browse files

Files changed (1) hide show

README.md +10 -10

README.md CHANGED Viewed

@@ -40,7 +40,7 @@ The model delivers best-in-class performance across Indic tasks and a promising
 - Matches or exceeds performance of models much larger (x6) on multilingual Indic generation tasks including creative writing, summarization, and translation;
 - Stronger Indian cultural context relevance - scored the highest in manual evaluation with multiple models in an anonymised setting;
 - Delivers top-3 performance on 5 (out of 7) tasks in BharatBench among much larger open source and commercial models.
-- Available in both pre-trained and instruction-tuned versions
 ## Model Developer
 - OLA Krutrim Team
@@ -110,15 +110,15 @@ We use the LM Evaluation Harness to evaluate our model on the En benchmarks task
 ### BharatBench
 The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
-| Benchmark                           | Metric      | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B-Instruct | llama-3.1-70B-Instruct | Gemma-2-9B-Instruct | Gemma-2-27B-Instruct | GPT-4o |
-|-------------------------------------|------------|--------------|-----------------|---------------|------------------------|------------------------|---------------------|---------------------|--------|
-| Indian Cultural Context (0-shot)    | Bert Score | 0.86         | 0.56            | 0.88          | 0.87                   | 0.88                   | 0.87                | 0.87                | 0.89   |
-| Grammar Correction (5-shot)         | Bert Score | 0.96         | 0.94            | 0.98          | 0.95                   | 0.98                   | 0.96                | 0.96                | 0.97   |
-| Multi Turn (0-shot)                 | Bert Score | 0.88         | 0.87            | 0.91          | 0.88                   | 0.90                   | 0.89                | 0.89                | 0.92   |
-| Multi Turn Comprehension (0-shot)   | Bert Score | 0.90         | 0.89            | 0.92          | 0.92                   | 0.93                   | 0.91                | 0.91                | 0.94   |
-| Multi Turn Translation (0-shot)     | Bert Score | 0.85         | 0.87            | 0.92          | 0.89                   | 0.91                   | 0.90                | 0.91                | 0.92   |
-| Text Classification (5-shot)        | Accuracy   | 0.61         | 0.71            | 0.76          | 0.72                   | 0.88                   | 0.82                | 0.86                | 0.89   |
-| Named Entity Recognition (5-shot)   | Accuracy   | 0.31         | 0.51            | 0.53          | 0.55                   | 0.61                   | 0.61                | 0.65                | 0.65   |
 ### Qualitative Results
 Below are the results from manual evaluation of prompt-response pairs across languages and task categories. Scores are between 1-5 (higher the better). Model names were anonymised during the evaluation.

 - Matches or exceeds performance of models much larger (x6) on multilingual Indic generation tasks including creative writing, summarization, and translation;
 - Stronger Indian cultural context relevance - scored the highest in manual evaluation with multiple models in an anonymised setting;
 - Delivers top-3 performance on 5 (out of 7) tasks in BharatBench among much larger open source and commercial models.
+- Available in instruction-tuned version
 ## Model Developer
 - OLA Krutrim Team
 ### BharatBench
 The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.
+| Benchmark                           | Metric      | Krutrim-1-7B | MN-12B-Instruct | Krutrim-2-12B | llama-3.1-8B-Instruct | llama-3.1-70B-Instruct | Gemma-2-27B-Instruct | GPT-4o |
+|-------------------------------------|------------|--------------|-----------------|---------------|------------------------|------------------------|---------------------|--------|
+| Indian Cultural Context (0-shot)    | Bert Score | 0.86         | 0.56            | 0.88          | 0.87                   | 0.88                   | 0.87                | 0.89   |
+| Grammar Correction (5-shot)         | Bert Score | 0.96         | 0.94            | 0.98          | 0.95                   | 0.98                   | 0.96                | 0.97   |
+| Multi Turn (0-shot)                 | Bert Score | 0.88         | 0.87            | 0.91          | 0.88                   | 0.90                   | 0.89                | 0.92   |
+| Multi Turn Comprehension (0-shot)   | Bert Score | 0.90         | 0.89            | 0.92          | 0.92                   | 0.93                   | 0.91                | 0.94   |
+| Multi Turn Translation (0-shot)     | Bert Score | 0.85         | 0.87            | 0.92          | 0.89                   | 0.91                   | 0.91                | 0.92   |
+| Text Classification (5-shot)        | Accuracy   | 0.61         | 0.71            | 0.76          | 0.72                   | 0.88                   | 0.86                | 0.89   |
+| Named Entity Recognition (5-shot)   | Accuracy   | 0.31         | 0.51            | 0.53          | 0.55                   | 0.61                   | 0.65                | 0.65   |
 ### Qualitative Results
 Below are the results from manual evaluation of prompt-response pairs across languages and task categories. Scores are between 1-5 (higher the better). Model names were anonymised during the evaluation.