lukecq commited on
Commit
8ab1a84
·
1 Parent(s): d4c15d2

udpate about page

Browse files
Files changed (1) hide show
  1. src/display/about.py +15 -5
src/display/about.py CHANGED
@@ -31,19 +31,29 @@ Also check the [SeaBench leaderboard](https://huggingface.co/spaces/SeaLLMs/SeaB
31
  # Which evaluations are you running? how can people reproduce what you have?
32
  LLM_BENCHMARKS_TEXT = f"""
33
  # About
34
- Even though large language models (LLMs) have shown impressive performance on various benchmarks for English, their performance on Southeast Asian (SEA) languages is still underexplored. This leaderboard aims to evaluate LLMs on exam-type benchmarks for SEA languages, focusing on world knowledge and reasoning abilities.
35
 
36
  ## Datasets
37
- The leaderboard evaluates models on the following tasks:
38
- - **M3Exam**:
39
- - **MMLU**:
40
 
41
  ## Evalation Criteria
 
 
 
 
 
 
42
 
43
  ## Reults
 
44
 
45
  ## Reproducibility
46
- To reproduce our results, here is the commands you can run:
 
 
 
47
 
48
  """
49
 
 
31
  # Which evaluations are you running? how can people reproduce what you have?
32
  LLM_BENCHMARKS_TEXT = f"""
33
  # About
34
+ Even though large language models (LLMs) have shown impressive performance on various benchmarks for English, their performance on Southeast Asian (SEA) languages is still underexplored. This leaderboard aims to evaluate LLMs on exam-type benchmarks for English, Chinese and SEA languages, focusing on world knowledge and reasoning abilities. The five languages for evaluation are English (en), Chinese (zh), Indonesian (id), Thai (th), and Vietnamese (vi).
35
 
36
  ## Datasets
37
+ The benchmark data can be found in the [SeaExam dataset](https://huggingface.co/datasets/SeaLLMs/SeaExam). The dataset consists of two tasks:
38
+ - [**M3Exam**](https://arxiv.org/abs/2306.05179): a benchmark sourced from real and official human exam questions for evaluating LLMs in a multilingual, multimodal, and multilevel context. We post-process the data for the 5 languages.
39
+ - [**MMLU**](https://arxiv.org/abs/2009.03300): a test to measure a text model's multitask accuracy in English. The test covers 57 tasks. We sample 50 questions from each task and translate the data into the other 4 languages with google translate.
40
 
41
  ## Evalation Criteria
42
+ We evaluate the models with accuracy score. The leaderboard is sorted by the average score across SEA languages (id, th, and vi).
43
+
44
+ We have the following settings for evaluation:
45
+ - **few-shot**: the default setting is few-shot (3-shot). All open-source models are evaluated with 3-shot.
46
+ - **zero-shot**: the zero-shot setting is also available. As closed-source models has format issues with few-shot, they are evaluated with zero-shot.
47
+
48
 
49
  ## Reults
50
+ You can find the detailed numerical results in the results Hugging Face dataset: https://huggingface.co/datasets/SeaLLMs/SeaExam-results
51
 
52
  ## Reproducibility
53
+ To reproduce our results, use the script in [this repo](https://github.com/DAMO-NLP-SG/SeaExam/tree/main). The script will download the model and tokenizer, and evaluate the model on the benchmark data.
54
+ ```python
55
+ python scripts/main.py --model $model_name_or_path
56
+ ```
57
 
58
  """
59