IJK Technology IJK Technology – ByteGPT-small

**ByteGPT-small** is a small GPT-style language model trained using byte tokenization inspired by the ByT5 paper. It is designed for use on compute- and memory-constrained devices, such as mobile phones and embedded systems. ## 🚀 Overview - **Model Type:** GPT-style causal language model - **Tokenizer:** Byte-level tokenization (from ByT5) - **Intended Use:** Edge devices, mobile phones, embedded systems - **Size:** Small (initial prototype) - **Training:** Custom-trained from scratch ## 🧠 Why Byte Tokenization? Byte tokenization offers several advantages for small-scale, efficient models: 1. **Reduced Memory Footprint:** Byte-level tokenization drastically reduces the size of the embedding layer, making the model suitable for devices with limited RAM. 2. **No External Dependencies:** Unlike subword tokenizers (e.g., SentencePiece, BPE), byte tokenization requires no external libraries for tokenization. A simple Python script can handle tokenization. 3. **Robustness to Noise:** Byte-level models are more robust to misspellings, typos, and out-of-vocabulary tokens. ## 💡 Future Plans This is the **first** in a series of models. While this model is not yet highly useful due to its small size, it represents the foundation for future versions. Upcoming releases will include: - **Larger Models:** Scaled-up versions with better performance - **Distilled Models:** Using GPRO distillation to create highly efficient small models - **Benchmark Results:** Comparative performance on mobile devices ## 💻 Usage ### **Quick Start (with `transformers`):** ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("ijktech/ByteGPT-small", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small") input_text = "What is the capital of France?" inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Tokenizer The tokenizer is byte-level, compatible with AutoTokenizer from Hugging Face: ```python tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small") ``` ## 📜 License 📍 **Non-Commercial License**: Free for hobbyists and personal projects. 💼 **Commercial Use**: Contact IJK Technology Ltd for licensing. ## 🛠️ About IJK Technology Ltd IJK Technology Ltd (IJKTech) develops innovative machine learning models optimized for on-device inference. Our focus is on efficiency, privacy, and usability across mobile and embedded platforms.