Spaces:
Running
Running
udpate about page
Browse files- src/display/about.py +15 -5
src/display/about.py
CHANGED
@@ -31,19 +31,29 @@ Also check the [SeaBench leaderboard](https://huggingface.co/spaces/SeaLLMs/SeaB
|
|
31 |
# Which evaluations are you running? how can people reproduce what you have?
|
32 |
LLM_BENCHMARKS_TEXT = f"""
|
33 |
# About
|
34 |
-
Even though large language models (LLMs) have shown impressive performance on various benchmarks for English, their performance on Southeast Asian (SEA) languages is still underexplored. This leaderboard aims to evaluate LLMs on exam-type benchmarks for SEA languages, focusing on world knowledge and reasoning abilities.
|
35 |
|
36 |
## Datasets
|
37 |
-
The
|
38 |
-
- **M3Exam
|
39 |
-
- **MMLU
|
40 |
|
41 |
## Evalation Criteria
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
|
43 |
## Reults
|
|
|
44 |
|
45 |
## Reproducibility
|
46 |
-
To reproduce our results,
|
|
|
|
|
|
|
47 |
|
48 |
"""
|
49 |
|
|
|
31 |
# Which evaluations are you running? how can people reproduce what you have?
|
32 |
LLM_BENCHMARKS_TEXT = f"""
|
33 |
# About
|
34 |
+
Even though large language models (LLMs) have shown impressive performance on various benchmarks for English, their performance on Southeast Asian (SEA) languages is still underexplored. This leaderboard aims to evaluate LLMs on exam-type benchmarks for English, Chinese and SEA languages, focusing on world knowledge and reasoning abilities. The five languages for evaluation are English (en), Chinese (zh), Indonesian (id), Thai (th), and Vietnamese (vi).
|
35 |
|
36 |
## Datasets
|
37 |
+
The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/datasets/SeaLLMs/SeaExam). The dataset consists of two tasks:
|
38 |
+
- [**M3Exam**](https://arxiv.org/abs/2306.05179): a benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. We post-process the data for the 5 languages.
|
39 |
+
- [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
|
40 |
|
41 |
## Evalation Criteria
|
42 |
+
We evaluate the models with accuracy score. The leaderboard is sorted by the average score across SEA languages (id, th, and vi).
|
43 |
+
|
44 |
+
We have the following settings for evaluation:
|
45 |
+
- **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
|
46 |
+
- **zero-shot**: the zero-shot setting is also available. As closed-source models has format issues with few-shot, they are evaluated with zero-shot.
|
47 |
+
|
48 |
|
49 |
## Reults
|
50 |
+
You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
|
51 |
|
52 |
## Reproducibility
|
53 |
+
To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
|
54 |
+
```python
|
55 |
+
python scripts/main.py --model $model_name_or_path
|
56 |
+
```
|
57 |
|
58 |
"""
|
59 |
|