IJK Technology – ByteGPT-small

<h1 align="center">
  <img src="logo.png" alt="IJK Technology" width="100" style="vertical-align: middle; margin-right: 10px;">
  IJK Technology – ByteGPT-small
</h1>


**ByteGPT-small** is a small GPT-style language model trained using byte tokenization inspired by the ByT5 paper. It is designed for use on compute- and memory-constrained devices, such as mobile phones and embedded systems.

## 🚀 Overview
- **Model Type:** GPT-style causal language model  
- **Tokenizer:** Byte-level tokenization (from ByT5)  
- **Intended Use:** Edge devices, mobile phones, embedded systems  
- **Size:** Small (initial prototype)  
- **Training:** Custom-trained from scratch  

## 🧠 Why Byte Tokenization?
Byte tokenization offers several advantages for small-scale, efficient models:

1. **Reduced Memory Footprint:**  
   Byte-level tokenization drastically reduces the size of the embedding layer, making the model suitable for devices with limited RAM.

2. **No External Dependencies:**  
   Unlike subword tokenizers (e.g., SentencePiece, BPE), byte tokenization requires no external libraries for tokenization. A simple Python script can handle tokenization.

3. **Robustness to Noise:**  
   Byte-level models are more robust to misspellings, typos, and out-of-vocabulary tokens.

## 💡 Future Plans
This is the **first** in a series of models. While this model is not yet highly useful due to its small size, it represents the foundation for future versions. Upcoming releases will include:

- **Larger Models:** Scaled-up versions with better performance  
- **Distilled Models:** Using GPRO distillation to create highly efficient small models  
- **Benchmark Results:** Comparative performance on mobile devices  

## 💻 Usage

### **Quick Start (with `transformers`):**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ijktech/ByteGPT-small", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small")

input_text = "What is the capital of France?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Tokenizer

The tokenizer is byte-level, compatible with AutoTokenizer from Hugging Face:

```python
tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small")
```

## 📜 License
📍 **Non-Commercial License**: Free for hobbyists and personal projects.

💼 **Commercial Use**: Contact IJK Technology Ltd for licensing.

## 🛠️ About IJK Technology Ltd
IJK Technology Ltd (IJKTech) develops innovative machine learning models optimized for on-device inference. Our focus is on efficiency, privacy, and usability across mobile and embedded platforms.