IJK Technology – ByteGPT-small

ByteGPT-small is a small GPT-style language model trained using byte tokenization inspired by the ByT5 paper. It is designed for use on compute- and memory-constrained devices, such as mobile phones and embedded systems.

🚀 Overview

Model Type: GPT-style causal language model
Tokenizer: Byte-level tokenization (from ByT5)
Intended Use: Edge devices, mobile phones, embedded systems
Size: Small (initial prototype)
Training: Custom-trained from scratch

🧠 Why Byte Tokenization?

Byte tokenization offers several advantages for small-scale, efficient models:

Reduced Memory Footprint:
Byte-level tokenization drastically reduces the size of the embedding layer, making the model suitable for devices with limited RAM.
No External Dependencies:
Unlike subword tokenizers (e.g., SentencePiece, BPE), byte tokenization requires no external libraries for tokenization. A simple Python script can handle tokenization.
Robustness to Noise:
Byte-level models are more robust to misspellings, typos, and out-of-vocabulary tokens.

💡 Future Plans

This is the first in a series of models. While this model is not yet highly useful due to its small size, it represents the foundation for future versions. Upcoming releases will include:

Larger Models: Scaled-up versions with better performance
Distilled Models: Using GPRO distillation to create highly efficient small models
Benchmark Results: Comparative performance on mobile devices

💻 Usage

Quick Start (with `transformers`):

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ijktech/ByteGPT-small", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small")

input_text = "What is the capital of France?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Tokenizer

The tokenizer is byte-level, compatible with AutoTokenizer from Hugging Face:

tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small")

📜 License

📍 Non-Commercial License: Free for hobbyists and personal projects.

💼 Commercial Use: Contact IJK Technology Ltd for licensing.

🛠️ About IJK Technology Ltd

IJK Technology Ltd (IJKTech) develops innovative machine learning models optimized for on-device inference. Our focus is on efficiency, privacy, and usability across mobile and embedded platforms.