Spaces:
Paused
Paused
title: StoryLlama | |
emoji: π | |
colorFrom: blue | |
colorTo: purple | |
sdk: gradio | |
sdk_version: 5.21.0 | |
app_file: app.py | |
pinned: false | |
# Introducing StoryLlama - A Smaller Language Model for Bedtime Stories! | |
- So, I trained a Llama a 88M architecture I coded from ground up to build a small instruct model, going through the below-mentioned stages from scratch. | |
- Trained on TiyStories dataset form HuggingFace consisting of 4B tokens for a total of 5000 steps | |
### Pretraining | |
#### Dataset | |
- I used the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset from HuggingFace. | |
1) Train dataset - 2 M records approx | |
2) Val dataset - 26K records approx | |
--- | |
#### ModelArgs (Hyperparameters) | |
Below is a table summarizing the configuration parameters for the model: | |
| Parameter | Description | Default Value | Type | | |
|--------------------------------|-----------------------------------------------------------------------------|-----------------------------------|-----------| | |
| `epochs` | Number of training epochs | `4` | `int` | | |
| `block_size` | Size of each block (context length) | `512` | `int` | | |
| `batch_size` | Batch size for training | `64` | `int` | | |
| `inference` | Inference mode (not specified) | `None` | `None` | | |
| `embeddings_dims` | Dimensionality of embeddings | `512` | `int` | | |
| `attn_dropout` | Dropout rate for attention layers | `0.1` | `float` | | |
| `no_of_heads` | Number of attention heads | `8` | `int` | | |
| `dropout` | Dropout rate for the model | `0.1` | `float` | | |
| `val_epochs` | Number of validation epochs | `2` | `int` | | |
| `max_lr` | Maximum learning rate | `6e-4` | `float` | | |
| `no_of_decoder_layers` | Number of decoder layers | `8` | `int` | | |
| `weight_decay_optim` | Weight decay for the optimizer | `0.1` | `float` | | |
| `beta_1` | Beta 1 for Adam optimizer | `0.9` | `float` | | |
| `beta_2` | Beta 2 for Adam optimizer | `0.95` | `float` | | |
| `clip` | Gradient clipping value | `1.0` | `float` | | |
| `device` | Device to run the model (`cuda` or `cpu`) | `'cuda'` | `str` | | |
| `no_kv_heads` | Number of key-value heads | `2` | `int` | | |
| `vocab_size` | Size of the vocabulary | `50304` | `int` | | |
| `eps` | Epsilon value for numerical stability | `1e-5` | `float` | | |
| `dtype` | Data type for tensors (`bfloat16` if supported, else `float16`) | `'bfloat16'` or `'float16'` | `str` | | |
| `save_checkpoint_dir` | Directory to save model checkpoints | `"checkpoints"` | `str` | | |
| `prompt` | Default prompt for inference | `"Once upon a time"` | `str` | | |
| `save_checkpoint_iter` | Save checkpoint every N iterations | `50` | `int` | | |
| `total_iters` | Total number of training iterations | `10000` | `int` | | |
| `eval_iters` | Evaluate model every N iterations | `50` | `int` | | |
| `eval_check` | Check evaluation metrics every N iterations | `100` | `int` | | |
| `warmup_iters` | Number of warmup iterations for learning rate scheduling | `700` | `int` | | |
| `min_lr` | Minimum learning rate (10% of `max_lr`) | `0.1 * max_lr` | `float` | | |
| `lr_decay_iters` | Number of iterations for learning rate decay | `10000` | `int` | | |
| `total_batch_size` | Total batch size across all devices | `524288` | `int` | | |
| `micro_batch_size` | Micro batch size per device | `batch_size` | `int` | | |
| `gradient_accumulation_steps` | Gradient accumulation steps | 524288 | `int` | | |
--- | |
#### Hardware Setup | |
- Used DPP using Pytorch torchrun consisting of 2x GeForce RTX A100 AXM (80gb VRAM each) rented on runpod.io | |
- The model is a 0.768GB in size but needs around 4 GB of VRAM when loaded in fp32 precision | |
--- | |
#### Frameworks: | |
**Pytorch** | |
--- | |
#### Epochs/Steps | |
- Iterations (train) = 5k | |
- Val iterations = every 50 steps | |
--- | |
#### Losses | |
- Train loss - 1.43 | |
- Val loss - 1.45 | |
--- | |
#### Screenshots of the loss curves | |
- Loss Curves (Train and Val) | |
 | |
--- | |
#### Output | |
- Prompt: Once upon a time | |
 | |
--- | |
### Local setup | |
### Requirements | |
```python | |
git [clone the repo](https://github.com/YuvrajSingh-mist/StoryLlama.git) | |
cd StoryLlama | |
bash ./install.sh | |
``` | |
- A wandb.ai account for plotting graphs for your loss curves | |
- On your terminal run | |
```python | |
wandb login | |
``` | |
- Enter the api key and follow the instructions and once you are succesfully logged in follow the given steps | |
- Download the model | |
```python | |
python download_model_weight.py | |
``` | |
--- | |
### Running | |
#### Training a model | |
- Kindly change 'device' to any of your available cuda gpus. | |
To run: | |
```python | |
bash ./install.sh | |
``` | |
```python | |
torchrun --standalone --nproc_per_node=gpu trainer.py \ | |
--epochs 10 \ | |
--block_size 256 \ | |
--batch_size 128 \ | |
--embeddings_dims 768 \ | |
--attn_dropout 0.2 \ | |
--no_of_heads 12 \ | |
--dropout 0.2 \ | |
--val_epochs 3 \ | |
--max_lr 5e-4 \ | |
--no_of_decoder_layers 6 \ | |
--weight_decay_optim 0.01 \ | |
--beta_1 0.85 \ | |
--beta_2 0.99 \ | |
--clip 0.5 \ | |
--device "cuda" \ | |
--no_kv_heads 4 \ | |
--vocab_size 50257 \ | |
--eps 1e-6 \ | |
--dtype "float16" \ | |
--save_checkpoint_dir "model_checkpoints" \ | |
--prompt "Once upon a time" \ | |
--save_checkpoint_iter 100 \ | |
--total_iters 5000 \ | |
--eval_iters 200 \ | |
--eval_check 500 \ | |
--warmup_iters 1000 \ | |
--min_lr 1e-5 \ | |
--lr_decay_iters 2000 \ | |
--total_batch_size 262144 \ | |
--micro_batch_size 128 \ | |
--gradient_accumulation_steps 4 | |
``` | |
--standalone - if all the gpu are on one server | |
--npro_per_node - number of gpus available and use the keyword gpu to use all | |
#### Inference on a model | |
```python | |
python inference.py --prompt "Once upon a time" --max_length 100 --temperature 0.8 --topk 50 | |
``` |