Spaces:

SeaLLMs
/

LLM_Leaderboard_for_SEA

Running

App Files Files Community

lukecq commited on Apr 23, 2024

Commit

8ab1a84

1 Parent(s): d4c15d2

udpate about page

Browse files

Files changed (1) hide show

src/display/about.py +15 -5

src/display/about.py CHANGED Viewed

@@ -31,19 +31,29 @@ Also check the [SeaBench leaderboard](https://huggingface.co/spaces/SeaLLMs/SeaB
 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
 # About
-Even though large language models (LLMs) have shown impressive performance on various benchmarks for English, their performance on Southeast Asian (SEA) languages is still underexplored. This leaderboard aims to evaluate LLMs on exam-type benchmarks for SEA languages, focusing on world knowledge and reasoning abilities.
 ## Datasets
-The leaderboard evaluates models on the following tasks:
-- **M3Exam**:
-- **MMLU**:
 ## Evalation Criteria
 ## Reults
 ## Reproducibility
-To reproduce our results, here is the commands you can run:
 """

 # Which evaluations are you running? how can people reproduce what you have?
 LLM_BENCHMARKS_TEXT = f"""
 # About
+Even though large language models (LLMs) have shown impressive performance on various benchmarks for English, their performance on Southeast Asian (SEA) languages is still underexplored. This leaderboard aims to evaluate LLMs on exam-type benchmarks for English, Chinese and SEA languages, focusing on world knowledge and reasoning abilities. The five languages for evaluation are English (en), Chinese (zh), Indonesian (id), Thai (th), and Vietnamese (vi).
 ## Datasets
+The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/datasets/SeaLLMs/SeaExam). The dataset consists of two tasks:
+- [**M3Exam**](https://arxiv.org/abs/2306.05179): a benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. We post-process the data for the 5 languages.
+- [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
 ## Evalation Criteria
+We evaluate the models with accuracy score. The leaderboard is sorted by the average score across SEA languages (id, th, and vi).
+We have the following settings for evaluation:
+- **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
+- **zero-shot**: the zero-shot setting is also available. As closed-source models has format issues with few-shot, they are evaluated with zero-shot.
 ## Reults
+You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
 ## Reproducibility
+To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
+```python
+python scripts/main.py --model $model_name_or_path
+```
 """