ByteGPT-small / README.md
ijktech-jk's picture
Add README with project details
ec4a92e verified
|
raw
history blame
2.81 kB

IJK Technology IJK Technology – ByteGPT-small

ByteGPT-small is a small GPT-style language model trained using byte tokenization inspired by the ByT5 paper. It is designed for use on compute- and memory-constrained devices, such as mobile phones and embedded systems.

πŸš€ Overview

  • Model Type: GPT-style causal language model
  • Tokenizer: Byte-level tokenization (from ByT5)
  • Intended Use: Edge devices, mobile phones, embedded systems
  • Size: Small (initial prototype)
  • Training: Custom-trained from scratch

🧠 Why Byte Tokenization?

Byte tokenization offers several advantages for small-scale, efficient models:

  1. Reduced Memory Footprint:
    Byte-level tokenization drastically reduces the size of the embedding layer, making the model suitable for devices with limited RAM.

  2. No External Dependencies:
    Unlike subword tokenizers (e.g., SentencePiece, BPE), byte tokenization requires no external libraries for tokenization. A simple Python script can handle tokenization.

  3. Robustness to Noise:
    Byte-level models are more robust to misspellings, typos, and out-of-vocabulary tokens.

πŸ’‘ Future Plans

This is the first in a series of models. While this model is not yet highly useful due to its small size, it represents the foundation for future versions. Upcoming releases will include:

  • Larger Models: Scaled-up versions with better performance
  • Distilled Models: Using GPRO distillation to create highly efficient small models
  • Benchmark Results: Comparative performance on mobile devices

πŸ’» Usage

Quick Start (with transformers):

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ijktech/ByteGPT-small", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small")

input_text = "What is the capital of France?"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Tokenizer

The tokenizer is byte-level, compatible with AutoTokenizer from Hugging Face:

tokenizer = AutoTokenizer.from_pretrained("ijktech/ByteGPT-small")

πŸ“œ License

πŸ“ Non-Commercial License: Free for hobbyists and personal projects.

πŸ’Ό Commercial Use: Contact IJK Technology Ltd for licensing.

πŸ› οΈ About IJK Technology Ltd

IJK Technology Ltd (IJKTech) develops innovative machine learning models optimized for on-device inference. Our focus is on efficiency, privacy, and usability across mobile and embedded platforms.