Update README.md
Browse filesAdded interpretation
README.md
CHANGED
@@ -1,3 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
Model Card: Uplimit Project 1 part 1
|
2 |
Model Description:
|
3 |
This is a model to test run publishing models. It has no real model assessment value.
|
@@ -17,5 +23,23 @@ hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwar
|
|
17 |
|hellaswag| 1|none | 0|acc |↑ |0.2872|± |0.0045|
|
18 |
| | |none | 0|acc_norm|↑ |0.3082|± |0.0046|
|
19 |
|
20 |
-
|
21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
datasets:
|
3 |
+
- DIBT/10k_prompts_ranked
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
---
|
7 |
Model Card: Uplimit Project 1 part 1
|
8 |
Model Description:
|
9 |
This is a model to test run publishing models. It has no real model assessment value.
|
|
|
23 |
|hellaswag| 1|none | 0|acc |↑ |0.2872|± |0.0045|
|
24 |
| | |none | 0|acc_norm|↑ |0.3082|± |0.0046|
|
25 |
|
26 |
+
Interpretation (curtesy for Prepelexit.ai):
|
27 |
+
|
28 |
+
Accuracy Metrics:
|
29 |
+
Standard accuracy: 0.2872 (28.72%)
|
30 |
+
Normalized accuracy: 0.3082 (30.82%)
|
31 |
+
|
32 |
+
Context:
|
33 |
+
The HellaSwag task is a challenging commonsense reasoning benchmark that tests a model's ability to complete sentences or scenarios in a sensible way.
|
34 |
+
The task is considered difficult even for larger language models.
|
35 |
+
|
36 |
+
Interpretation
|
37 |
+
Baseline Performance: The model achieves an accuracy of 28.72% on the standard HellaSwag task, which is significantly above random guessing (25% for a 4-way multiple choice task)1.
|
38 |
+
|
39 |
+
Normalized Performance: The normalized accuracy of 30.82% is slightly higher than the standard accuracy, suggesting that the model performs marginally better when accounting for potential biases in the task1.
|
40 |
+
|
41 |
+
Model Size Consideration: Given that Pythia 160M is a relatively small language model (160 million parameters), these results are not unexpected2.
|
42 |
+
|
43 |
+
Comparative Analysis: While not directly comparable without benchmarks from other models, this performance is likely lower than what larger models (e.g., GPT-3, PaLM) would achieve on the same task2.
|
44 |
+
|
45 |
+
Learning Progress: As this is an intermediate checkpoint (step 100000), it's possible that the model's performance could improve with further training
|