Update README.md
Browse files
README.md
CHANGED
@@ -28,24 +28,106 @@ should probably proofread and complete it, then remove this comment. -->
|
|
28 |
|
29 |

|
30 |
|
31 |
-
AlphaMonarch-laser is a DPO fine-tuned of [mlabonne/NeuralMonarch-7B](https://huggingface.co/mlabonne/NeuralMonarch-7B/) using the [argilla/OpenHermes2.5-dpo-binarized-alpha](https://huggingface.co/datasets/argilla/OpenHermes2.5-dpo-binarized-alpha) preference dataset but achieves better performance then [mlabonne/AlphaMonarch-7B](https://huggingface.co/mlabonne/AlphaMonarch-7B/) using
|
32 |
|
|
|
|
|
33 |
|
34 |
-
|
35 |
|
36 |
-
|
37 |
|
38 |
-
|
39 |
|
40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
-
|
43 |
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
-
|
47 |
|
48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
|
50 |
### Training hyperparameters
|
51 |
|
@@ -63,9 +145,6 @@ The following hyperparameters were used during training:
|
|
63 |
|
64 |
|
65 |
|
66 |
-
|
67 |
-
|
68 |
-
|
69 |
### 📝 Axolotl Configuration
|
70 |
|
71 |
```yaml
|
|
|
28 |
|
29 |

|
30 |
|
31 |
+
AlphaMonarch-laser is a DPO fine-tuned of [mlabonne/NeuralMonarch-7B](https://huggingface.co/mlabonne/NeuralMonarch-7B/) using the [argilla/OpenHermes2.5-dpo-binarized-alpha](https://huggingface.co/datasets/argilla/OpenHermes2.5-dpo-binarized-alpha) preference dataset but achieves better performance then [mlabonne/AlphaMonarch-7B](https://huggingface.co/mlabonne/AlphaMonarch-7B/) using LaserQLoRA. We have fine-tuned this model only on half of the projections, but have achieved better results as compared to the version released by Maximme Labonne. We have trained this model for 1080 steps.
|
32 |
|
33 |
+
AlphaMonarch-laser is ranking 1 on YALL - [Yet Another LLM Leaderboard](https://huggingface.co/spaces/mlabonne/Yet_Another_LLM_Leaderboard).
|
34 |
+

|
35 |
|
36 |
+
## 🏆 Evaluation results
|
37 |
|
38 |
+
# Nous Benchmark
|
39 |
|
40 |
+
### AGIEVAL
|
41 |
|
42 |
+
| Task | Version | Metric | Value | StdErr |
|
43 |
+
|---------------------------------|---------|--------------|--------|--------|
|
44 |
+
| agieval_aqua_rat | 0 | acc | 28.35% | 2.83% |
|
45 |
+
| agieval_aqua_rat | 0 | acc_norm | 26.38% | 2.77% |
|
46 |
+
| agieval_logiqa_en | 0 | acc | 38.25% | 1.91% |
|
47 |
+
| agieval_logiqa_en | 0 | acc_norm | 38.10% | 1.90% |
|
48 |
+
| agieval_lsat_ar | 0 | acc | 23.91% | 2.82% |
|
49 |
+
| agieval_lsat_ar | 0 | acc_norm | 23.48% | 2.80% |
|
50 |
+
| agieval_lsat_lr | 0 | acc | 52.75% | 2.21% |
|
51 |
+
| agieval_lsat_lr | 0 | acc_norm | 53.92% | 2.21% |
|
52 |
+
| agieval_lsat_rc | 0 | acc | 66.91% | 2.87% |
|
53 |
+
| agieval_lsat_rc | 0 | acc_norm | 67.29% | 2.87% |
|
54 |
+
| agieval_sat_en | 0 | acc | 78.64% | 2.86% |
|
55 |
+
| agieval_sat_en | 0 | acc_norm | 78.64% | 2.86% |
|
56 |
+
| agieval_sat_en_without_passage | 0 | acc | 45.15% | 3.48% |
|
57 |
+
| agieval_sat_en_without_passage | 0 | acc_norm | 44.17% | 3.47% |
|
58 |
+
| agieval_sat_math | 0 | acc | 33.18% | 3.18% |
|
59 |
+
| agieval_sat_math | 0 | acc_norm | 31.36% | 3.14% |
|
60 |
+
Average: 28.41%
|
61 |
|
62 |
+
### GPT4ALL
|
63 |
|
64 |
+
| Task | Version | Metric | Value | StdErr |
|
65 |
+
|--------------|---------|----------|-------|--------|
|
66 |
+
| arc_challenge| 0 | acc | 66.30%| ± 1.38%|
|
67 |
+
| | | acc_norm | 68.26%| ± 1.36%|
|
68 |
+
| arc_easy | 0 | acc | 86.57%| ± 0.70%|
|
69 |
+
| | | acc_norm | 80.81%| ± 0.81%|
|
70 |
+
| boolq | 1 | acc | 87.16%| ± 0.59%|
|
71 |
+
| hellaswag | 0 | acc | 69.60%| ± 0.46%|
|
72 |
+
| | | acc_norm | 87.45%| ± 0.33%|
|
73 |
+
| openbookqa | 0 | acc | 39.20%| ± 2.19%|
|
74 |
+
| | | acc_norm | 49.60%| ± 2.24%|
|
75 |
+
| piqa | 0 | acc | 83.03%| ± 0.88%|
|
76 |
+
| | | acc_norm | 84.87%| ± 0.84%|
|
77 |
+
| winogrande | 0 | acc | 81.06%| ± 1.10%|
|
78 |
+
Average: 76.98%
|
79 |
|
80 |
+
### TRUTHFUL-QA
|
81 |
|
82 |
+
| Task | Version | Metric | Value | StdErr |
|
83 |
+
|---------------|---------|--------|-------|--------|
|
84 |
+
| truthfulqa_mc | 1 | mc1 | 63.04%| ± 1.69%|
|
85 |
+
| truthfulqa_mc | 1 | mc2 | 78.39%| ± 1.37%|
|
86 |
+
Average: 70.71%
|
87 |
+
|
88 |
+
### BIGBENCH
|
89 |
+
|
90 |
+
| Task | Version | Metric | Value | StdErr |
|
91 |
+
|------------------------------------------------|---------|-----------------------|-------|--------------------|
|
92 |
+
| bigbench_causal_judgement | 0 | multiple_choice_grade| 60.00%| ± 3.56% |
|
93 |
+
| bigbench_date_understanding | 0 | multiple_choice_grade| 62.06%| ± 2.53% |
|
94 |
+
| bigbench_disambiguation_qa | 0 | multiple_choice_grade| 54.26%| ± 3.11% |
|
95 |
+
| bigbench_geometric_shapes | 0 | multiple_choice_grade| 23.96%| ± 2.26% |
|
96 |
+
| | | exact_str_match | 0.00% | ± 0.00% |
|
97 |
+
| bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade| 32.80%| ± 2.10% |
|
98 |
+
| bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade| 23.86%| ± 1.61% |
|
99 |
+
| bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade| 59.33%| ± 2.84% |
|
100 |
+
| bigbench_movie_recommendation | 0 | multiple_choice_grade| 58.00%| ± 2.21% |
|
101 |
+
| bigbench_navigate | 0 | multiple_choice_grade| 56.00%| ± 1.57% |
|
102 |
+
| bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade| 69.20%| ± 1.03% |
|
103 |
+
| bigbench_ruin_names | 0 | multiple_choice_grade| 55.36%| ± 2.35% |
|
104 |
+
| bigbench_salient_translation_error_detection | 0 | multiple_choice_grade| 41.48%| ± 1.56% |
|
105 |
+
| bigbench_snarks | 0 | multiple_choice_grade| 73.48%| ± 3.29% |
|
106 |
+
| bigbench_sports_understanding | 0 | multiple_choice_grade| 76.06%| ± 1.36% |
|
107 |
+
| bigbench_temporal_sequences | 0 | multiple_choice_grade| 55.50%| ± 1.57% |
|
108 |
+
| bigbench_tracking_shuffled_objects_five_objects| 0 | multiple_choice_grade| 23.28%| ± 1.20% |
|
109 |
+
| bigbench_tracking_shuffled_objects_seven_objects| 0 | multiple_choice_grade| 19.37%| ± 0.94% |
|
110 |
+
| bigbench_tracking_shuffled_objects_three_objects| 0 | multiple_choice_grade| 59.33%| ± 2.84% |
|
111 |
+
Average: 55.37%
|
112 |
+
|
113 |
+
# Openllm Benchmark
|
114 |
+
|
115 |
+
| Task |Version| Metric |Value| |Stderr|
|
116 |
+
|-------------|------:|--------|----:|---|-----:|
|
117 |
+
|arc_challenge| 0|acc |70.12|± | 1.30|
|
118 |
+
| | |acc_norm|73.27|± | 1.29|
|
119 |
+
|hellaswag | 0|acc |71.80|± | 0.44|
|
120 |
+
| | |acc_norm|89.20|± | 0.30|
|
121 |
+
|gsm8k | 0|acc |66.77|± | 1.2 |
|
122 |
+
|winogrande | 0|acc |84.6 |± | 1.0 |
|
123 |
+
|
124 |
+
Average: 73.5%
|
125 |
+
|
126 |
+
### TruthfulQA
|
127 |
+
| Task |Version|Metric|Value| |Stderr|
|
128 |
+
|-------------|------:|------|----:|---|-----:|
|
129 |
+
|truthfulqa_mc| 1|mc1 |62.79|± | 1.69|
|
130 |
+
| | |mc2 |77.90|± | 1.37|
|
131 |
|
132 |
### Training hyperparameters
|
133 |
|
|
|
145 |
|
146 |
|
147 |
|
|
|
|
|
|
|
148 |
### 📝 Axolotl Configuration
|
149 |
|
150 |
```yaml
|