|
--- |
|
language: |
|
- en |
|
tags: |
|
- pytorch |
|
- causal-lm |
|
- pythia |
|
license: apache-2.0 |
|
datasets: |
|
- EleutherAI/pile |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
The Pythia 160m model is part of a collection of models developed to facilitate |
|
interpretability research [(see repository)](https://huggingface.co/EleutherAI/pythia-160m/edit/main/README.md) trained on the Pile. We have evalutated it on hellaswag using the Eleuther evaluation harness. |
|
|
|
<figure> |
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|---------|------:|------|-----:|--------|---|-----:|---|-----:| |
|
|hellaswag| 1|none | 0|acc |↑ |0.2872|± |0.0045| |
|
| | |none | 0|acc_norm|↑ |0.3082|± |0.0046| |
|
<figcaption>Evaluation results.</figcaption> |
|
</figure> |
|
|
|
## Model Details |
|
|
|
- Developed by: [EleutherAI](http://eleuther.ai) |
|
- Model type: Transformer-based Language Model |
|
- Language: English |
|
- Learn more: [Pythia's GitHub repository](https://github.com/EleutherAI/pythia) |
|
for training procedure, config files, and details on how to use. |
|
[See paper](https://arxiv.org/pdf/2304.01373.pdf) for more evals and implementation |
|
details. |
|
- Library: [GPT-NeoX](https://github.com/EleutherAI/gpt-neox) |
|
- License: Apache 2.0 |
|
- Contact: to ask questions about this model, join the [EleutherAI |
|
Discord](https://discord.gg/zBGx3azzUn), and post them in `#release-discussion`. |
|
Please read the existing *Pythia* documentation before asking about it in the |
|
EleutherAI Discord. For general correspondence: [contact@eleuther. |
|
ai](mailto:[email protected]). |
|
|
|
<figure> |
|
|
|
| Pythia model | Non-Embedding Params | Layers | Model Dim | Heads | Batch Size | Learning Rate | Equivalent Models | |
|
| -----------: | -------------------: | :----: | :-------: | :---: | :--------: | :-------------------: | :--------------------: | |
|
| 160M | 85,056,000 | 12 | 768 | 12 | 2M | 6.0 x 10<sup>-4</sup> | GPT-Neo 125M, OPT-125M | |
|
<figcaption>Engineering details for the <i>Pythia Suite</i>. Deduped and |
|
non-deduped models of a given size have the same hyperparameters. “Equivalent” |
|
models have <b>exactly</b> the same architecture, and the same number of |
|
non-embedding parameters.</figcaption> |
|
</figure> |
|
|
|
### Model Description |
|
|
|
This is the model card of Pythia 160m evaluated on the Eleuther evaluation harness. |
|
|
|
- **Developed by:** [EleutherAI](http://eleuther.ai) |
|
- **Model type:** Pythia 160m |
|
- **Language(s) (NLP):** EN |
|
- **License:** Apache 2.0 |
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://huggingface.co/EleutherAI/pythia-160m/edit/main/README.md |
|
|
|
## Uses and Limitations |
|
|
|
### Intended Use |
|
|
|
The primary intended use of Pythia is research on the behavior, functionality, |
|
and limitations of large language models. This suite is intended to provide |
|
a controlled setting for performing scientific experiments. We also provide |
|
154 checkpoints per model: initial `step0`, 10 log-spaced checkpoints |
|
`step{1,2,4...512}`, and 143 evenly-spaced checkpoints from `step1000` to |
|
`step143000`. These checkpoints are hosted on Hugging Face as branches. Note |
|
that branch `143000` corresponds exactly to the model checkpoint on the `main` |
|
branch of each model. |
|
|
|
You may also further fine-tune and adapt Pythia-160M for deployment, |
|
as long as your use is in accordance with the Apache 2.0 license. Pythia |
|
models work with the Hugging Face [Transformers |
|
Library](https://huggingface.co/docs/transformers/index). If you decide to use |
|
pre-trained Pythia-160M as a basis for your fine-tuned model, please |
|
conduct your own risk and bias assessment. |
|
|
|
### Out-of-scope use |
|
|
|
The Pythia Suite is **not** intended for deployment. It is not a in itself |
|
a product and cannot be used for human-facing interactions. For example, |
|
the model may generate harmful or offensive text. Please evaluate the risks |
|
associated with your particular use case. |
|
|
|
Pythia models are English-language only, and are not suitable for translation |
|
or generating text in other languages. |
|
|
|
Pythia-160M has not been fine-tuned for downstream contexts in which |
|
language models are commonly deployed, such as writing genre prose, |
|
or commercial chatbots. This means Pythia-160M will **not** |
|
respond to a given prompt the way a product like ChatGPT does. This is because, |
|
unlike this model, ChatGPT was fine-tuned using methods such as Reinforcement |
|
Learning from Human Feedback (RLHF) to better “follow” human instructions. |
|
|
|
### Limitations and biases |
|
|
|
The core functionality of a large language model is to take a string of text |
|
and predict the next token. The token used by the model need not produce the |
|
most “accurate” text. Never rely on Pythia-160M to produce factually accurate |
|
output. |
|
|
|
This model was trained on [the Pile](https://pile.eleuther.ai/), a dataset |
|
known to contain profanity and texts that are lewd or otherwise offensive. |
|
See [Section 6 of the Pile paper](https://arxiv.org/abs/2101.00027) for a |
|
discussion of documented biases with regards to gender, religion, and race. |
|
Pythia-160M may produce socially unacceptable or undesirable text, *even if* |
|
the prompt itself does not include anything explicitly offensive. |
|
|
|
If you plan on using text generated through, for example, the Hosted Inference |
|
API, we recommend having a human curate the outputs of this language model |
|
before presenting it to other people. Please inform your audience that the |
|
text was generated by Pythia-160M. |
|
|
|
## Training |
|
|
|
### Training data |
|
|
|
[The Pile](https://pile.eleuther.ai/) is a 825GiB general-purpose dataset in |
|
English. It was created by EleutherAI specifically for training large language |
|
models. It contains texts from 22 diverse sources, roughly broken down into |
|
five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), |
|
prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and |
|
miscellaneous (e.g. GitHub, Enron Emails). See [the Pile |
|
paper](https://arxiv.org/abs/2101.00027) for a breakdown of all data sources, |
|
methodology, and a discussion of ethical implications. Consult [the |
|
datasheet](https://arxiv.org/abs/2201.07311) for more detailed documentation |
|
about the Pile and its component datasets. The Pile can be downloaded from |
|
the [official website](https://pile.eleuther.ai/), or from a [community |
|
mirror](https://the-eye.eu/public/AI/pile/).<br> |
|
The Pile was **not** deduplicated before being used to train Pythia-160M. |
|
|
|
### Training procedure |
|
|
|
All models were trained on the exact same data, in the exact same order. Each |
|
model saw 299,892,736,000 tokens during training, and 143 checkpoints for each |
|
model are saved every 2,097,152,000 tokens, spaced evenly throughout training, |
|
from `step1000` to `step143000` (which is the same as `main`). In addition, we |
|
also provide frequent early checkpoints: `step0` and `step{1,2,4...512}`. |
|
This corresponds to training for just under 1 epoch on the Pile for |
|
non-deduplicated models, and about 1.5 epochs on the deduplicated Pile. |
|
|
|
All *Pythia* models trained for 143000 steps at a batch size |
|
of 2M (2,097,152 tokens).<br> |
|
See [GitHub](https://github.com/EleutherAI/pythia) for more details on training |
|
procedure, including [how to reproduce |
|
it](https://github.com/EleutherAI/pythia/blob/main/README.md#reproducing-training).<br> |
|
Pythia uses the same tokenizer as [GPT-NeoX- |
|
20B](https://huggingface.co/EleutherAI/gpt-neox-20b). |
|
|
|
## Evaluation |
|
|
|
This model has been evaluated on hellaswag using the Eleuther evaluation harness. |
|
|
|
<figure> |
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|---------|------:|------|-----:|--------|---|-----:|---|-----:| |
|
|hellaswag| 1|none | 0|acc |↑ |0.2872|± |0.0045| |
|
| | |none | 0|acc_norm|↑ |0.3082|± |0.0046| |
|
<figcaption>Evaluation results.</figcaption> |
|
</figure> |