Spaces:

mideind
/

icelandic-llm-leaderboard

Running

App Files Files Community

gardarjuto commited on Jul 3, 2024

Commit

80793c6

1 Parent(s): 117d89c

add submission instructions to about page

Browse files

Files changed (1) hide show

src/about.py +5 -2

src/about.py CHANGED Viewed

@@ -12,7 +12,7 @@ class Task:
 # ---------------------------------------------------
 class Tasks(Enum):
     # task_key in the json file, metric_key in the json file, name to display in the leaderboard
-    task0 = Task("icelandic_winogrande_stringmatch", "exact_match,get-answer", "WinoGrande-IS")
     task1 = Task("icelandic_sentences_ged_stringmatch", "exact_match,get-answer", "GED")
     task2 = Task("icelandic_inflection_easy", "json_metric,get-answer", "Inflection (common)")
     task3 = Task("icelandic_inflection_medium", "json_metric,get-answer", "Inflection (uncommon)")
@@ -33,6 +33,9 @@ INTRODUCTION_TEXT = """
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
 ## Benchmark tasks
 The Icelandic LLM leaderboard evaluates models on several tasks. All of them are set up as generation tasks, where the model's output is compared to the expected output.
 This means that models that have not been instruction fine-tuned might perform poorly on these tasks.
@@ -42,7 +45,7 @@ The following tasks are evaluated:
 ### WinoGrande-IS
 The Icelandic WinoGrande task is a human-translated and localized version of the ~1000 test set examples in the WinoGrande task in English.
 Each example consists of a sentence with a blank, and two answer choices for the blank. The task is to choose the correct answer choice using coreference resolution.
-The benchmark is designed to test the model's ability to use knowledge and common sense reasoning in Icelandic.
 The Icelandic WinoGrande dataset is described in more detail in the IceBERT paper (https://aclanthology.org/2022.lrec-1.464.pdf).
 - Link to dataset: https://huggingface.co/datasets/mideind/icelandic-winogrande

 # ---------------------------------------------------
 class Tasks(Enum):
     # task_key in the json file, metric_key in the json file, name to display in the leaderboard
+    task0 = Task("icelandic_winogrande_stringmatch", "exact_match,get-answer", "WinoGrande-IS (3-shot)")
     task1 = Task("icelandic_sentences_ged_stringmatch", "exact_match,get-answer", "GED")
     task2 = Task("icelandic_inflection_easy", "json_metric,get-answer", "Inflection (common)")
     task3 = Task("icelandic_inflection_medium", "json_metric,get-answer", "Inflection (uncommon)")
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
+## New submissions
+Do you want your model to be included on the leaderboard? Open a discussion on this repository with the details of your model and we will get back to you.
 ## Benchmark tasks
 The Icelandic LLM leaderboard evaluates models on several tasks. All of them are set up as generation tasks, where the model's output is compared to the expected output.
 This means that models that have not been instruction fine-tuned might perform poorly on these tasks.
 ### WinoGrande-IS
 The Icelandic WinoGrande task is a human-translated and localized version of the ~1000 test set examples in the WinoGrande task in English.
 Each example consists of a sentence with a blank, and two answer choices for the blank. The task is to choose the correct answer choice using coreference resolution.
+The benchmark is designed to test the model's ability to use knowledge and common sense reasoning in Icelandic. For this benchmark, we use 3-shot evaluation.
 The Icelandic WinoGrande dataset is described in more detail in the IceBERT paper (https://aclanthology.org/2022.lrec-1.464.pdf).
 - Link to dataset: https://huggingface.co/datasets/mideind/icelandic-winogrande