Raincleared commited on
Commit
c949828
·
verified ·
1 Parent(s): 19542fc

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -70,11 +70,13 @@ The 13B model is trained on 32 A100 GPUs. The learning rate (LR) is controlled b
70
 
71
  - **Code Generation**: We compute the average pass@1 scores on HumanEval (0-shot) and MBPP (3-shot).
72
 
73
- - **Commonsense Reasoning**: We report the average 0-shot perplexity (PPL) on PIQA, SIQA, HellaSwag, WinoGrande, and COPA.
74
 
75
- - **Reading Comprehension**: We compute the average 0-shot PPL on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
76
 
77
- - **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and the average PPL on AGI-Eval (0-shot).
 
 
78
 
79
  ### Evaluation Results
80
 
 
70
 
71
  - **Code Generation**: We compute the average pass@1 scores on HumanEval (0-shot) and MBPP (3-shot).
72
 
73
+ - **Commonsense Reasoning**: We report the average 0-shot accuracies on PIQA, SIQA, HellaSwag, WinoGrande, and COPA.
74
 
75
+ - **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
76
 
77
+ - **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot). Refer to Appendix~\ref{sec:eval-details} for more details.
78
+
79
+ Note: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
80
 
81
  ### Evaluation Results
82