|
--- |
|
license: mit |
|
language: |
|
- en |
|
tags: |
|
- gpt2 |
|
- exbert |
|
inference: false |
|
--- |
|
|
|
# GPT2-Linear-XL |
|
|
|
A conversion of [gpt2-xl](https://hf.co/gpt2-xl) that uses linear layers instead of convolutional layers. This is not an official OpenAI project. |
|
|
|
> Pretrained model on English language using a causal language modeling (CLM) objective. It was introduced in |
|
[this paper](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) |
|
and first released at [this page](https://openai.com/blog/better-language-models/). |
|
|
|
> GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This |
|
means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots |
|
of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, |
|
it was trained to guess the next word in sentences. |
|
> More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, |
|
shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the |
|
predictions for the token `i` only uses the inputs from `1` to `i` but not the future tokens. |
|
> This way, the model learns an inner representation of the English language that can then be used to extract features |
|
useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a |
|
prompt. |
|
|
|
- Main model: [crumbly/gpt2-linear-xl](https://hf.co/crumbly/gpt2-linear-xl) |
|
- Sharded model: [crumbly/gpt2-linear-xl-sharded](https://hf.co/crumbly/gpt2-linear-xl-sharded) |
|
- Sharded + Brain-float 16bit model: [crumbly/gpt2-linear-xl-sharded-bf16](https://hf.co/crumbly/gpt2-linear-xl-sharded-bf16) |
|
|
|
Config: |
|
|
|
``` |
|
{ |
|
"n_embd": 1600, |
|
"n_head": 25, |
|
"n_layer": 48, |
|
"n_positions": 1024, |
|
} |
|
``` |
|
|
|
### Usage |
|
|
|
Inference on GPU with 4-bit quantization: |
|
|
|
``` |
|
%pip install -qq transformers accelerate bitsandbytes |
|
``` |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
from transformers import BitsAndBytesConfig |
|
import torch |
|
|
|
model_id = "crumbly/gpt2-linear-xl-sharded-bf16" |
|
bnb_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_use_double_quant=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_compute_dtype=torch.bfloat16 |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
trust_remote_code=True, |
|
device_map={"":0}, |
|
quantization_config=bnb_config |
|
) |
|
``` |
|
```python |
|
inputs = tokenizer("Once upon a time,", return_tensors='pt') |
|
inputs = { |
|
k:v.cuda() for k,v in inputs.items() |
|
} |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=32, |
|
temperature=0.7, |
|
do_sample=True |
|
) |
|
tokenizer.decode(outputs[0]) |
|
``` |
|
|
|
TODO |
|
- ~~test to see if model works with .from_pretrained~~ <br> |
|
- ~~test fp32, fp16, 8 and 4 bit~~ |
|
- ~~shard model to max 1gb for use in even lower vram settings~~ <br> |
|
- safetensors <br> |
|
- ~~upload bf16 version of model~~ <br> |
|
- upload 8bit model and 4bit model <br> |
|
- ~~convert other base gpt2 models~~ |
|
- open orca QLoRA on XL |
|
- ReLoRA continued pretraining on RefinedWeb or RedPajama to reach 1T tokens |