StoryLlama / README.md
YuvrajSingh9886's picture
Update README.md
4c291d2 verified
---
title: StoryLlama
emoji: πŸ“–
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 5.21.0
app_file: app.py
pinned: false
---
# Introducing StoryLlama - A Smaller Language Model for Bedtime Stories!
- So, I trained a Llama a 88M architecture I coded from ground up to build a small instruct model, going through the below-mentioned stages from scratch.
- Trained on TiyStories dataset form HuggingFace consisting of 4B tokens for a total of 5000 steps
### Pretraining
#### Dataset
- I used the [TinyStories](https://huggingface.co/datasets/roneneldan/TinyStories) dataset from HuggingFace.
1) Train dataset - 2 M records approx
2) Val dataset - 26K records approx
---
#### ModelArgs (Hyperparameters)
Below is a table summarizing the configuration parameters for the model:
| Parameter | Description | Default Value | Type |
|--------------------------------|-----------------------------------------------------------------------------|-----------------------------------|-----------|
| `epochs` | Number of training epochs | `4` | `int` |
| `block_size` | Size of each block (context length) | `512` | `int` |
| `batch_size` | Batch size for training | `64` | `int` |
| `inference` | Inference mode (not specified) | `None` | `None` |
| `embeddings_dims` | Dimensionality of embeddings | `512` | `int` |
| `attn_dropout` | Dropout rate for attention layers | `0.1` | `float` |
| `no_of_heads` | Number of attention heads | `8` | `int` |
| `dropout` | Dropout rate for the model | `0.1` | `float` |
| `val_epochs` | Number of validation epochs | `2` | `int` |
| `max_lr` | Maximum learning rate | `6e-4` | `float` |
| `no_of_decoder_layers` | Number of decoder layers | `8` | `int` |
| `weight_decay_optim` | Weight decay for the optimizer | `0.1` | `float` |
| `beta_1` | Beta 1 for Adam optimizer | `0.9` | `float` |
| `beta_2` | Beta 2 for Adam optimizer | `0.95` | `float` |
| `clip` | Gradient clipping value | `1.0` | `float` |
| `device` | Device to run the model (`cuda` or `cpu`) | `'cuda'` | `str` |
| `no_kv_heads` | Number of key-value heads | `2` | `int` |
| `vocab_size` | Size of the vocabulary | `50304` | `int` |
| `eps` | Epsilon value for numerical stability | `1e-5` | `float` |
| `dtype` | Data type for tensors (`bfloat16` if supported, else `float16`) | `'bfloat16'` or `'float16'` | `str` |
| `save_checkpoint_dir` | Directory to save model checkpoints | `"checkpoints"` | `str` |
| `prompt` | Default prompt for inference | `"Once upon a time"` | `str` |
| `save_checkpoint_iter` | Save checkpoint every N iterations | `50` | `int` |
| `total_iters` | Total number of training iterations | `10000` | `int` |
| `eval_iters` | Evaluate model every N iterations | `50` | `int` |
| `eval_check` | Check evaluation metrics every N iterations | `100` | `int` |
| `warmup_iters` | Number of warmup iterations for learning rate scheduling | `700` | `int` |
| `min_lr` | Minimum learning rate (10% of `max_lr`) | `0.1 * max_lr` | `float` |
| `lr_decay_iters` | Number of iterations for learning rate decay | `10000` | `int` |
| `total_batch_size` | Total batch size across all devices | `524288` | `int` |
| `micro_batch_size` | Micro batch size per device | `batch_size` | `int` |
| `gradient_accumulation_steps` | Gradient accumulation steps | 524288 | `int` |
---
#### Hardware Setup
- Used DPP using Pytorch torchrun consisting of 2x GeForce RTX A100 AXM (80gb VRAM each) rented on runpod.io
- The model is a 0.768GB in size but needs around 4 GB of VRAM when loaded in fp32 precision
---
#### Frameworks:
**Pytorch**
---
#### Epochs/Steps
- Iterations (train) = 5k
- Val iterations = every 50 steps
---
#### Losses
- Train loss - 1.43
- Val loss - 1.45
---
#### Screenshots of the loss curves
- Loss Curves (Train and Val)
![Loss Curves (Train and Val)](images/loss_curves.jpg)
---
#### Output
- Prompt: Once upon a time
![Prompt: Once upon a time](images/sample.jpg)
---
### Local setup
### Requirements
```python
git [clone the repo](https://github.com/YuvrajSingh-mist/StoryLlama.git)
cd StoryLlama
bash ./install.sh
```
- A wandb.ai account for plotting graphs for your loss curves
- On your terminal run
```python
wandb login
```
- Enter the api key and follow the instructions and once you are succesfully logged in follow the given steps
- Download the model
```python
python download_model_weight.py
```
---
### Running
#### Training a model
- Kindly change 'device' to any of your available cuda gpus.
To run:
```python
bash ./install.sh
```
```python
torchrun --standalone --nproc_per_node=gpu trainer.py \
--epochs 10 \
--block_size 256 \
--batch_size 128 \
--embeddings_dims 768 \
--attn_dropout 0.2 \
--no_of_heads 12 \
--dropout 0.2 \
--val_epochs 3 \
--max_lr 5e-4 \
--no_of_decoder_layers 6 \
--weight_decay_optim 0.01 \
--beta_1 0.85 \
--beta_2 0.99 \
--clip 0.5 \
--device "cuda" \
--no_kv_heads 4 \
--vocab_size 50257 \
--eps 1e-6 \
--dtype "float16" \
--save_checkpoint_dir "model_checkpoints" \
--prompt "Once upon a time" \
--save_checkpoint_iter 100 \
--total_iters 5000 \
--eval_iters 200 \
--eval_check 500 \
--warmup_iters 1000 \
--min_lr 1e-5 \
--lr_decay_iters 2000 \
--total_batch_size 262144 \
--micro_batch_size 128 \
--gradient_accumulation_steps 4
```
--standalone - if all the gpu are on one server
--npro_per_node - number of gpus available and use the keyword gpu to use all
#### Inference on a model
```python
python inference.py --prompt "Once upon a time" --max_length 100 --temperature 0.8 --topk 50
```