SparseLLM
/

prosparse-llama-2-13b

@@ -66,7 +66,9 @@ The 13B model is trained on 32 A100 GPUs. The learning rate (LR) is controlled b
 |        4        |   \\(2e-2\\)    | 15,000 |         125.83         |
 |        5        |   \\(2e-2\\)    | 16,000 |         134.22         |
-### Evaluation Benckmarks
 - **Code Generation**: We compute the average pass@1 scores on HumanEval (0-shot) and MBPP (3-shot).
@@ -74,22 +76,10 @@ The 13B model is trained on 32 A100 GPUs. The learning rate (LR) is controlled b
 - **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
-- **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot). Refer to Appendix~\ref{sec:eval-details} for more details.
 **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
-### Evaluation Results
-The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Our evaluation is based on the framework [UltraEval](https://github.com/OpenBMB/UltraEval). The evaluation details are listed as follows:
-- **Code Generation**: We compute the average pass@1 scores on HumanEval (0-shot) and MBPP (3-shot).
-- **Commonsense Reasoning**: We report the average 0-shot perplexity (PPL) on PIQA, SIQA, HellaSwag, WinoGrande, and COPA.
-- **Reading Comprehension**: We compute the average 0-shot PPL on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
-- **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and the average PPL on AGI-Eval (0-shot).
 |        Setting        | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU  |  BBH  | AGI Eval | Average |
 | :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
 |      Original-7B      |          -          |       16.37        |          69.59           |          61.87           | 12.96 | 44.45 | 32.96 |    27.53    |  37.96  |

 |        4        |   \\(2e-2\\)    | 15,000 |         125.83         |
 |        5        |   \\(2e-2\\)    | 16,000 |         134.22         |
+### Evaluation Results
+The evaluation results on the above benchmarks demonstrate the advantage of ProSparse, which is the only method achieving high sparsity and comparable performance to the original Swish-activated LLaMA2. Note that models under all settings are trained with the same number of tokens on the same mixed dataset. Our evaluation is based on the framework [UltraEval](https://github.com/OpenBMB/UltraEval). The evaluation details are listed as follows:
 - **Code Generation**: We compute the average pass@1 scores on HumanEval (0-shot) and MBPP (3-shot).
 - **Reading Comprehension**: We compute the average 0-shot accuracies on BoolQ, 0-shot accuracy on LAMBADA and TyDi QA.
+- **Other Popular Benchmarks**: We report the average accuracies on GSM8K (8-shot), MMLU (5-shot), Big Bench Hard (BBH) (3-shot), and AGI-Eval (0-shot).
 **Notes**: For PIQA, SIQA, HellaSwag, WinoGrande, COPA, BoolQ, LAMBADA, TyDi QA, and AGI-Eval, we obtain the predicted answers based on maximized perplexity. For GSM8K, MMLU, and BBH, the predicted answers are directly generated.
 |        Setting        | Average<br>Sparsity | Code<br>Generation | Commonsense<br>Reasoning | Reading<br>Comprehension | GSM8K | MMLU  |  BBH  | AGI Eval | Average |
 | :-------------------: | :-----------------: | :----------------: | :----------------------: | :----------------------: | :---: | :---: | :---: | :---------: | :-----: |
 |      Original-7B      |          -          |       16.37        |          69.59           |          61.87           | 12.96 | 44.45 | 32.96 |    27.53    |  37.96  |