Update README.md
Browse files
README.md
CHANGED
@@ -1,199 +1,162 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
# Model Card for Model ID
|
7 |
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
|
12 |
## Model Details
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
-
|
22 |
-
-
|
23 |
-
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
### Direct Use
|
41 |
-
|
42 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
43 |
-
|
44 |
-
[More Information Needed]
|
45 |
-
|
46 |
-
### Downstream Use [optional]
|
47 |
-
|
48 |
-
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
49 |
-
|
50 |
-
[More Information Needed]
|
51 |
-
|
52 |
-
### Out-of-Scope Use
|
53 |
-
|
54 |
-
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
55 |
-
|
56 |
-
[More Information Needed]
|
57 |
-
|
58 |
-
## Bias, Risks, and Limitations
|
59 |
-
|
60 |
-
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
61 |
|
62 |
-
|
63 |
-
|
64 |
-
### Recommendations
|
65 |
-
|
66 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
67 |
-
|
68 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
69 |
-
|
70 |
-
## How to Get Started with the Model
|
71 |
-
|
72 |
-
Use the code below to get started with the model.
|
73 |
-
|
74 |
-
[More Information Needed]
|
75 |
-
|
76 |
-
## Training Details
|
77 |
-
|
78 |
-
### Training Data
|
79 |
-
|
80 |
-
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
81 |
-
|
82 |
-
[More Information Needed]
|
83 |
-
|
84 |
-
### Training Procedure
|
85 |
-
|
86 |
-
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
87 |
-
|
88 |
-
#### Preprocessing [optional]
|
89 |
-
|
90 |
-
[More Information Needed]
|
91 |
-
|
92 |
-
|
93 |
-
#### Training Hyperparameters
|
94 |
-
|
95 |
-
- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
96 |
-
|
97 |
-
#### Speeds, Sizes, Times [optional]
|
98 |
-
|
99 |
-
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
|
100 |
|
101 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
|
103 |
## Evaluation
|
104 |
|
105 |
-
|
106 |
-
|
107 |
-
### Testing Data, Factors & Metrics
|
108 |
-
|
109 |
-
#### Testing Data
|
110 |
-
|
111 |
-
<!-- This should link to a Dataset Card if possible. -->
|
112 |
-
|
113 |
-
[More Information Needed]
|
114 |
-
|
115 |
-
#### Factors
|
116 |
-
|
117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
118 |
-
|
119 |
-
[More Information Needed]
|
120 |
-
|
121 |
-
#### Metrics
|
122 |
-
|
123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
124 |
-
|
125 |
-
[More Information Needed]
|
126 |
-
|
127 |
-
### Results
|
128 |
-
|
129 |
-
[More Information Needed]
|
130 |
-
|
131 |
-
#### Summary
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
## Model Examination [optional]
|
136 |
-
|
137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
138 |
-
|
139 |
-
[More Information Needed]
|
140 |
-
|
141 |
-
## Environmental Impact
|
142 |
-
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
-
|
145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
-
|
147 |
-
- **Hardware Type:** [More Information Needed]
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
-
|
153 |
-
## Technical Specifications [optional]
|
154 |
-
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
-
#### Hardware
|
164 |
-
|
165 |
-
[More Information Needed]
|
166 |
-
|
167 |
-
#### Software
|
168 |
-
|
169 |
-
[More Information Needed]
|
170 |
-
|
171 |
-
## Citation [optional]
|
172 |
-
|
173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
-
|
175 |
-
**BibTeX:**
|
176 |
-
|
177 |
-
[More Information Needed]
|
178 |
-
|
179 |
-
**APA:**
|
180 |
-
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
-
[More Information Needed]
|
188 |
-
|
189 |
-
## More Information [optional]
|
190 |
-
|
191 |
-
[More Information Needed]
|
192 |
-
|
193 |
-
## Model Card Authors [optional]
|
194 |
-
|
195 |
-
[More Information Needed]
|
196 |
-
|
197 |
-
## Model Card Contact
|
198 |
|
199 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
tags:
|
5 |
+
- pytorch
|
6 |
+
- causal-lm
|
7 |
+
- pythia
|
8 |
+
license: apache-2.0
|
9 |
+
datasets:
|
10 |
+
- EleutherAI/pile
|
11 |
---
|
12 |
|
13 |
# Model Card for Model ID
|
14 |
|
15 |
+
The Pythia 160m model is part of a collection of models developed to facilitate
|
16 |
+
interpretability research [(see repository)](https://huggingface.co/EleutherAI/pythia-160m/edit/main/README.md) trained on the Pile. We have evalutated it on hellaswag using the Eleuther evaluation harness.
|
|
|
17 |
|
18 |
## Model Details
|
19 |
|
20 |
+
- Developed by: [EleutherAI](http://eleuther.ai)
|
21 |
+
- Model type: Transformer-based Language Model
|
22 |
+
- Language: English
|
23 |
+
- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia)
|
24 |
+
for training procedure, config files, and details on how to use.
|
25 |
+
[See paper](https://arxiv.org/pdf/2304.01373.pdf) for more evals and implementation
|
26 |
+
details.
|
27 |
+
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox)
|
28 |
+
- License: Apache 2.0
|
29 |
+
- Contact: to ask questions about this model, join the [EleutherAI
|
30 |
+
Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`.
|
31 |
+
Please read the existing *Pythia* documentation before asking about it in the
|
32 |
+
EleutherAI Discord. For general correspondence: [contact@eleuther.
|
33 |
+
ai](mailto:[email protected]).
|
34 |
+
|
35 |
+
<figure>
|
36 |
+
|
37 |
+
| Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models |
|
38 |
+
| -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: |
|
39 |
+
| 160M | 85,056,000 | 12 | 768 | 12 | 2M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M |
|
40 |
+
<figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and
|
41 |
+
non-deduped models of a given size have the same hyperparameters. “Equivalent”
|
42 |
+
models have <b>exactly</b> the same architecture, and the same number of
|
43 |
+
non-embedding parameters.</figcaption>
|
44 |
+
</figure>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
+
### Model Description
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
|
48 |
+
This is the model card of Pythia 160m evaluated on the Eleuther evaluation harness.
|
49 |
+
|
50 |
+
- **Developed by:** [EleutherAI](http://eleuther.ai)
|
51 |
+
- **Model type:** Pythia 160m
|
52 |
+
- **Language(s) (NLP):** EN
|
53 |
+
- **License:** Apache 2.0
|
54 |
+
|
55 |
+
### Model Sources
|
56 |
+
|
57 |
+
- **Repository:** https://huggingface.co/EleutherAI/pythia-160m/edit/main/README.md
|
58 |
+
|
59 |
+
## Uses and Limitations
|
60 |
+
|
61 |
+
### Intended Use
|
62 |
+
|
63 |
+
The primary intended use of Pythia is research on the behavior, functionality,
|
64 |
+
and limitations of large language models. This suite is intended to provide
|
65 |
+
a controlled setting for performing scientific experiments. We also provide
|
66 |
+
154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints
|
67 |
+
`step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to
|
68 |
+
`step143000`. These checkpoints are hosted on Hugging Face as branches. Note
|
69 |
+
that branch `143000` corresponds exactly to the model checkpoint on the `main`
|
70 |
+
branch of each model.
|
71 |
+
|
72 |
+
You may also further fine-tune and adapt Pythia-160M for deployment,
|
73 |
+
as long as your use is in accordance with the Apache 2.0 license. Pythia
|
74 |
+
models work with the Hugging Face [Transformers
|
75 |
+
Library](https://huggingface.co/docs/transformers/index). If you decide to use
|
76 |
+
pre-trained Pythia-160M as a basis for your fine-tuned model, please
|
77 |
+
conduct your own risk and bias assessment.
|
78 |
+
|
79 |
+
### Out-of-scope use
|
80 |
+
|
81 |
+
The Pythia Suite is **not** intended for deployment. It is not a in itself
|
82 |
+
a product and cannot be used for human-facing interactions. For example,
|
83 |
+
the model may generate harmful or offensive text. Please evaluate the risks
|
84 |
+
associated with your particular use case.
|
85 |
+
|
86 |
+
Pythia models are English-language only, and are not suitable for translation
|
87 |
+
or generating text in other languages.
|
88 |
+
|
89 |
+
Pythia-160M has not been fine-tuned for downstream contexts in which
|
90 |
+
language models are commonly deployed, such as writing genre prose,
|
91 |
+
or commercial chatbots. This means Pythia-160M will **not**
|
92 |
+
respond to a given prompt the way a product like ChatGPT does. This is because,
|
93 |
+
unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement
|
94 |
+
Learning from Human Feedback (RLHF) to better “follow” human instructions.
|
95 |
+
|
96 |
+
### Limitations and biases
|
97 |
+
|
98 |
+
The core functionality of a large language model is to take a string of text
|
99 |
+
and predict the next token. The token used by the model need not produce the
|
100 |
+
most “accurate” text. Never rely on Pythia-160M to produce factually accurate
|
101 |
+
output.
|
102 |
+
|
103 |
+
This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset
|
104 |
+
known to contain profanity and texts that are lewd or otherwise offensive.
|
105 |
+
See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a
|
106 |
+
discussion of documented biases with regards to gender, religion, and race.
|
107 |
+
Pythia-160M may produce socially unacceptable or undesirable text, *even if*
|
108 |
+
the prompt itself does not include anything explicitly offensive.
|
109 |
+
|
110 |
+
If you plan on using text generated through, for example, the Hosted Inference
|
111 |
+
API, we recommend having a human curate the outputs of this language model
|
112 |
+
before presenting it to other people. Please inform your audience that the
|
113 |
+
text was generated by Pythia-160M.
|
114 |
+
|
115 |
+
## Training
|
116 |
+
|
117 |
+
### Training data
|
118 |
+
|
119 |
+
[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in
|
120 |
+
English. It was created by EleutherAI specifically for training large language
|
121 |
+
models. It contains texts from 22 diverse sources, roughly broken down into
|
122 |
+
five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl),
|
123 |
+
prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and
|
124 |
+
miscellaneous (e.g. GitHub, Enron Emails). See [the Pile
|
125 |
+
paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources,
|
126 |
+
methodology, and a discussion of ethical implications. Consult [the
|
127 |
+
datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation
|
128 |
+
about the Pile and its component datasets. The Pile can be downloaded from
|
129 |
+
the [official website](https://pile.eleuther.ai/), or from a [community
|
130 |
+
mirror](https://the-eye.eu/public/AI/pile/).<br>
|
131 |
+
The Pile was **not** deduplicated before being used to train Pythia-160M.
|
132 |
+
|
133 |
+
### Training procedure
|
134 |
+
|
135 |
+
All models were trained on the exact same data, in the exact same order. Each
|
136 |
+
model saw 299,892,736,000 tokens during training, and 143 checkpoints for each
|
137 |
+
model are saved every 2,097,152,000 tokens, spaced evenly throughout training,
|
138 |
+
from `step1000` to `step143000` (which is the same as `main`). In addition, we
|
139 |
+
also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`.
|
140 |
+
This corresponds to training for just under 1 epoch on the Pile for
|
141 |
+
non-deduplicated models, and about 1.5 epochs on the deduplicated Pile.
|
142 |
+
|
143 |
+
All *Pythia* models trained for 143000 steps at a batch size
|
144 |
+
of 2M (2,097,152 tokens).<br>
|
145 |
+
See [GitHub](https://github.com/EleutherAI/pythia) for more details on training
|
146 |
+
procedure, including [how to reproduce
|
147 |
+
it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br>
|
148 |
+
Pythia uses the same tokenizer as [GPT-NeoX-
|
149 |
+
20B](https://huggingface.co/EleutherAI/gpt-neox-20b).
|
150 |
|
151 |
## Evaluation
|
152 |
|
153 |
+
This model has been evaluated on hellaswag using the Eleuther evaluation harness.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
154 |
|
155 |
+
<figure>
|
156 |
+
|
157 |
+
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr|
|
158 |
+
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|
159 |
+
|hellaswag| 1|none | 0|acc |↑ |0.2872|± |0.0045|
|
160 |
+
| | |none | 0|acc_norm|↑ |0.3082|± |0.0046|
|
161 |
+
<figcaption>Evaluation results.</figcaption>
|
162 |
+
</figure>
|