Spaces:
Runtime error
Runtime error
Update src/about.py
Browse files- src/about.py +2 -1
src/about.py
CHANGED
@@ -64,6 +64,7 @@ We use our own framework to evaluate the models on the following benchmarks (TO
|
|
64 |
- <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU MCQA </a> (0-shot) - a series of multiple-choice questions in domains of *literature*, *math & logic*, and *common knowledge*.
|
65 |
- <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU NLI </a> (max[0,3,5,10]-shot) - a 3-way classification to determine whether a hypothesis sentence entails, contradicts, or is neutral with respect to a given premise sentence.
|
66 |
- <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU QQP </a> (max[0,2,5,10]-shot) - task of deciding whether a whether two given questions are paraphrases of each other or not.
|
|
|
67 |
|
68 |
For all these evaluations, a higher score is a better score.
|
69 |
|
@@ -71,7 +72,7 @@ We use the given *test* subset (for those benchmarks that also have *train* and
|
|
71 |
|
72 |
These benchmarks are picked for now, but several other benchmarks are going to be added later to help us perform a more thorough examination of models.
|
73 |
|
74 |
-
The
|
75 |
We argue that this is indeed a fair evaluation scheme since many light-weight models (around ~7B and less) can have a poor in-context learning in long-context prompts and thus perform better
|
76 |
in smaller shots (or have a small knowledge capacity and perform poorly in zero-shot). We wish to not hold this against the model by trying to measure performances in different settings and take the maximum score achieved .
|
77 |
|
|
|
64 |
- <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU MCQA </a> (0-shot) - a series of multiple-choice questions in domains of *literature*, *math & logic*, and *common knowledge*.
|
65 |
- <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU NLI </a> (max[0,3,5,10]-shot) - a 3-way classification to determine whether a hypothesis sentence entails, contradicts, or is neutral with respect to a given premise sentence.
|
66 |
- <a href="https://arxiv.org/abs/2012.06154" target="_blank"> ParsiNLU QQP </a> (max[0,2,5,10]-shot) - task of deciding whether a whether two given questions are paraphrases of each other or not.
|
67 |
+
- <a href="https://huggingface.co/datasets/MatinaLLM/persian_arc" target="_blank"> Persian ARC </a> (0-shot) - <a href="https://huggingface.co/datasets/allenai/ai2_arc" target="_blank"> ARC </a> dataset translated to Persian using GPT-4o.
|
68 |
|
69 |
For all these evaluations, a higher score is a better score.
|
70 |
|
|
|
72 |
|
73 |
These benchmarks are picked for now, but several other benchmarks are going to be added later to help us perform a more thorough examination of models.
|
74 |
|
75 |
+
The benchmarks ParsiNLU NLI and ParsiNLU QQP are evaluated in different few-shot settings and then the maximum score is returned as the final evaluation.
|
76 |
We argue that this is indeed a fair evaluation scheme since many light-weight models (around ~7B and less) can have a poor in-context learning in long-context prompts and thus perform better
|
77 |
in smaller shots (or have a small knowledge capacity and perform poorly in zero-shot). We wish to not hold this against the model by trying to measure performances in different settings and take the maximum score achieved .
|
78 |
|