File size: 2,810 Bytes

ec4a92e
 
 
 
d95f1d1
 
ec4a92e
d95f1d1
ec4a92e
 
 
 
 
 
d95f1d1
ec4a92e
 
d95f1d1
ec4a92e
 
d95f1d1
ec4a92e
 
d95f1d1
ec4a92e
 
d95f1d1
ec4a92e
 
d95f1d1
ec4a92e
 
 
d95f1d1
ec4a92e
d95f1d1
ec4a92e
 
 
d95f1d1
ec4a92e
 
d95f1d1
ec4a92e
 
 
d95f1d1
ec4a92e
 
d95f1d1
ec4a92e
d95f1d1
ec4a92e
d95f1d1
ec4a92e
 
 
d95f1d1
ec4a92e
 
d95f1d1
ec4a92e
d95f1d1
ec4a92e

<h1 align="center">
  <img src="logo.png" alt="IJK Technology" width="100" style="vertical-align: middle; margin-right: 10px;">
  IJK Technology – ByteGPT-small
</h1>


**ByteGPT-small** is a small GPT-style language model trained using byte tokenization inspired by the ByT5 paper. It is designed for use on compute- and memory-constrained devices, such as mobile phones and embedded systems.

## 🚀 Overview
- **Model Type:** GPT-style causal language model  
- **Tokenizer:** Byte-level tokenization (from ByT5)  
- **Intended Use:** Edge devices, mobile phones, embedded systems  
- **Size:** Small (initial prototype)  
- **Training:** Custom-trained from scratch  

## 🧠 Why Byte Tokenization?
Byte tokenization offers several advantages for small-scale, efficient models:

1. **Reduced Memory Footprint:**  
   Byte-level tokenization drastically reduces the size of the embedding layer, making the model suitable for devices with limited RAM.

2. **No External Dependencies:**  
   Unlike subword tokenizers (e.g., SentencePiece, BPE), byte tokenization requires no external libraries for tokenization. A simple Python script can handle tokenization.

3. **Robustness to Noise:**  
   Byte-level models are more robust to misspellings, typos, and out-of-vocabulary tokens.

## 💡 Future Plans
This is the **first** in a series of models. While this model is not yet highly useful due to its small size, it represents the foundation for future versions. Upcoming releases will include:

- **Larger Models:** Scaled-up versions with better performance  
- **Distilled Models:** Using GPRO distillation to create highly efficient small models  
- **Benchmark Results:** Comparative performance on mobile devices  

## 💻 Usage

### **Quick Start (with `transformers`):**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ijktech/ByteGPT-small", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small")

input_text = "What is the capital of France?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

### Tokenizer

The tokenizer is byte-level, compatible with AutoTokenizer from Hugging Face:

```python
tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small")
```

## 📜 License
📍 **Non-Commercial License**: Free for hobbyists and personal projects.

💼 **Commercial Use**: Contact IJK Technology Ltd for licensing.

## 🛠️ About IJK Technology Ltd
IJK Technology Ltd (IJKTech) develops innovative machine learning models optimized for on-device inference. Our focus is on efficiency, privacy, and usability across mobile and embedded platforms.