Spaces:

mideind
/

icelandic-llm-leaderboard

Running

App Files Files Community

gardarjuto commited on Jul 1, 2024

Commit

67a665c

1 Parent(s): 7fcf611

add benchmark descriptions and links to About page

Browse files

Files changed (1) hide show

src/about.py +36 -10

src/about.py CHANGED Viewed

@@ -12,13 +12,13 @@ class Task:
 # ---------------------------------------------------
 class Tasks(Enum):
     # task_key in the json file, metric_key in the json file, name to display in the leaderboard
-    task0 = Task("icelandic_winogrande_stringmatch", "exact_match,get-answer", "Winogrande")
     task1 = Task("icelandic_sentences_ged_stringmatch", "exact_match,get-answer", "GED")
     task2 = Task("icelandic_inflection_easy", "json_metric,get-answer", "Inflection (common)")
     task3 = Task("icelandic_inflection_medium", "json_metric,get-answer", "Inflection (uncommon)")
     task4 = Task("icelandic_inflection_hard", "json_metric,get-answer", "Inflection (rare)")
-    task5 = Task("icelandic_belebele", "exact_match,get-answer", "Belebele")
-    task6 = Task("icelandic_arc_challenge", "exact_match,get-answer", "ARC Challenge")
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------
@@ -35,10 +35,39 @@ Intro text
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
-## How it works
-## Reproducibility
-To reproduce our results, here is the commands you can run:
 """
@@ -72,6 +101,3 @@ Make sure you have followed the above steps first.
 If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
 """
-CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
-CITATION_BUTTON_TEXT = r"""
-"""

 # ---------------------------------------------------
 class Tasks(Enum):
     # task_key in the json file, metric_key in the json file, name to display in the leaderboard
+    task0 = Task("icelandic_winogrande_stringmatch", "exact_match,get-answer", "WinoGrande-IS")
     task1 = Task("icelandic_sentences_ged_stringmatch", "exact_match,get-answer", "GED")
     task2 = Task("icelandic_inflection_easy", "json_metric,get-answer", "Inflection (common)")
     task3 = Task("icelandic_inflection_medium", "json_metric,get-answer", "Inflection (uncommon)")
     task4 = Task("icelandic_inflection_hard", "json_metric,get-answer", "Inflection (rare)")
+    task5 = Task("icelandic_belebele", "exact_match,get-answer", "Belebele (IS)")
+    task6 = Task("icelandic_arc_challenge", "exact_match,get-answer", "ARC-Challenge-IS")
 NUM_FEWSHOT = 0 # Change with your few shot
 # ---------------------------------------------------
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
+## Benchmark tasks
+The Icelandic LLM leaderboard evaluates models on several tasks. All of them are set up as generation tasks, where the model's output is compared to the expected output.
+This means that models that have not been instruction fine-tuned might perform poorly on these tasks.
+The following tasks are evaluated:
+### WinoGrande-IS
+The Icelandic WinoGrande task is a human-translated and localized version of the ~1000 test set examples in the WinoGrande task in English.
+Each example consists of a sentence with a blank, and two answer choices for the blank. The task is to choose the correct answer choice using coreference resolution.
+The benchmark is designed to test the model's ability to use knowledge and common sense reasoning in Icelandic.
+The Icelandic WinoGrande dataset is described in more detail in the IceBERT paper (https://aclanthology.org/2022.lrec-1.464.pdf).
+- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-winogrande
+### GED
+This is a benchmark for binary sentence-level Icelandic grammatical error detection, adapted from the Icelandic Error Corpus (IEC) and contains 200 examples.
+Each example consists of a sentence that may contain one or more grammatical errors, and the task is to predict whether the sentence contains an error.
+- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-sentences-gec
+### Inflection benchmarks
+The inflection benchmarks test the model's ability to generate inflected forms of Icelandic adjective-noun pairs. They are divided into three levels of difficulty by
+commonness: common (100 examples), uncommon (100 examples), and rare (100 examples). The model gets a point for an example if it generates error-free json with the
+correct inflected forms in all cases, singular and plural.
+- Link to dataset (common): https://huggingface.co/datasets/mideind/icelandic-inflection-easy
+- Link to dataset (uncommon): https://huggingface.co/datasets/mideind/icelandic-inflection-medium
+- Link to dataset (rare): https://huggingface.co/datasets/mideind/icelandic-inflection-hard
+### Belebele (IS)
+This is the Icelandic subset (900 examples) of the Belebele benchmark, a multiple-choice reading comprehension task. The task is to answer questions about a given passage.
+- Link to dataset: https://huggingface.co/datasets/facebook/belebele
+### ARC-Challenge-IS
+A machine-translated version of the ARC-Challenge multiple-choice question-answering dataset. For this benchmark, we use the test set which contains 1.23k examples.
+- Link to dataset: https://huggingface.co/datasets/mideind/icelandic-arc-challenge
 """
 If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
 """