|
--- |
|
title: SmoLLMv2 |
|
emoji: 🐢 |
|
colorFrom: indigo |
|
colorTo: blue |
|
sdk: gradio |
|
sdk_version: 5.13.1 |
|
app_file: app.py |
|
pinned: false |
|
license: mit |
|
short_description: Text generation using smollmv2-135M model |
|
--- |
|
|
|
# SmoLLMv2: A Small but Efficient Language Model |
|
|
|
[Training Repo Link](https://github.com/Shilpaj1994/SmoLLMv2) |
|
[Gradio App Link](https://huggingface.co/spaces/Shilpaj/SmoLLMv2) |
|
|
|
|
|
SmoLLMv2 is a 135M parameter language model designed for efficient text generation. It incorporates several modern architectural improvements while maintaining a small footprint. |
|
|
|
|
|
|
|
## Features |
|
|
|
- **Efficient Architecture**: |
|
- 30 transformer layers |
|
- 9 attention heads |
|
- 576 embedding dimension |
|
- Memory-efficient attention with reduced KV dimensions |
|
- Rotary Position Embeddings (RoPE) |
|
- SwiGLU activation function |
|
|
|
- **Training Optimizations**: |
|
- Mixed precision training (16-bit) |
|
- Gradient accumulation |
|
- OneCycleLR scheduler |
|
- Streaming dataset support |
|
- Automatic model compilation (with PyTorch 2.0+) |
|
|
|
|
|
|
|
## Model Architecture |
|
|
|
SmoLLMv2 incorporates several efficiency improvements: |
|
|
|
1. **Reduced KV Dimensions**: Uses 189-dimensional key/value projections (instead of full 576) to save memory and computation. |
|
2. **RoPE Attention**: Implements Rotary Position Embeddings for better handling of sequential information. |
|
3. **SwiGLU Activation**: Uses the SwiGLU activation function in the MLP layers for better performance. |
|
4. **Weight Sharing**: Shares weights between input embeddings and output projection. |
|
|
|
|
|
|
|
## Configuration |
|
|
|
The model's behavior can be customized through various configuration classes in `config.py`: |
|
|
|
- `SmollmConfig`: Core model architecture and training parameters |
|
- `RoPEConfig`: Rotary Position Embedding settings |
|
- `OptimizerConfig`: Optimization and learning rate settings |
|
- `DataConfig`: Dataset and tokenizer configuration |
|
- `TrainerConfig`: Training infrastructure settings |
|
|
|
|
|
|
|
## Dataset |
|
|
|
The model is trained on the Cosmopedia dataset, which is streamed during training to handle large-scale data efficiently. |
|
|
|
|
|
|
|
## Requirements |
|
|
|
See `requirements.txt` for full dependencies. Key requirements: |
|
|
|
- PyTorch ≥ 2.0.0 |
|
- Transformers ≥ 4.30.0 |
|
- Lightning ≥ 2.0.0 |
|
- Gradio ≥ 5.13.1 |
|
|
|
|