rshacter commited on
Commit
248e07a
·
verified ·
1 Parent(s): c00127a

Update README.md

Browse files

Added interpretation

Files changed (1) hide show
  1. README.md +26 -2
README.md CHANGED
@@ -1,3 +1,9 @@
 
 
 
 
 
 
1
  Model Card: Uplimit Project 1 part 1
2
  Model Description:
3
  This is a model to test run publishing models. It has no real model assessment value.
@@ -17,5 +23,23 @@ hf (pretrained=EleutherAI/pythia-160m,revision=step100000,dtype=float), gen_kwar
17
  |hellaswag| 1|none | 0|acc |↑ |0.2872|± |0.0045|
18
  | | |none | 0|acc_norm|↑ |0.3082|± |0.0046|
19
 
20
- How to Use
21
- To use this model, simply download the checkpoint and load it into your preferred deep learning framework.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - DIBT/10k_prompts_ranked
4
+ language:
5
+ - en
6
+ ---
7
  Model Card: Uplimit Project 1 part 1
8
  Model Description:
9
  This is a model to test run publishing models. It has no real model assessment value.
 
23
  |hellaswag| 1|none | 0|acc |↑ |0.2872|± |0.0045|
24
  | | |none | 0|acc_norm|↑ |0.3082|± |0.0046|
25
 
26
+ Interpretation (curtesy for Prepelexit.ai):
27
+
28
+ Accuracy Metrics:
29
+ Standard accuracy: 0.2872 (28.72%)
30
+ Normalized accuracy: 0.3082 (30.82%)
31
+
32
+ Context:
33
+ The HellaSwag task is a challenging commonsense reasoning benchmark that tests a model's ability to complete sentences or scenarios in a sensible way.
34
+ The task is considered difficult even for larger language models.
35
+
36
+ Interpretation
37
+ Baseline Performance: The model achieves an accuracy of 28.72% on the standard HellaSwag task, which is significantly above random guessing (25% for a 4-way multiple choice task)1.
38
+
39
+ Normalized Performance: The normalized accuracy of 30.82% is slightly higher than the standard accuracy, suggesting that the model performs marginally better when accounting for potential biases in the task1.
40
+
41
+ Model Size Consideration: Given that Pythia 160M is a relatively small language model (160 million parameters), these results are not unexpected2.
42
+
43
+ Comparative Analysis: While not directly comparable without benchmarks from other models, this performance is likely lower than what larger models (e.g., GPT-3, PaLM) would achieve on the same task2.
44
+
45
+ Learning Progress: As this is an intermediate checkpoint (step 100000), it's possible that the model's performance could improve with further training