Shoonya Model v0.2 - DeepSeek CPU-Optimized

This model is a CPU-optimized version of the Shoonya language model, incorporating techniques from the DeepSeek team for efficient inference on CPU hardware.

Model Description

Shoonya Model v0.2 is a lightweight transformer-based language model designed for efficient CPU inference. It incorporates architectural optimizations inspired by DeepSeek's research to achieve better performance on CPU hardware while maintaining good generation quality.

Model Details

  • Developed by: VaidhyaMegha
  • Model type: Transformer-based language model
  • Language(s): English
  • Training Data: TinyStories dataset
  • Parameters: 16.41M
  • Context Length: 512 tokens
  • Hidden Size: 256
  • Attention Heads: 8
  • Key-Value Heads: 4
  • Hidden Layers: 6
  • License: MIT
  • Repository: GitHub - VaidhyaMegha/Shoonya

DeepSeek CPU Optimizations

This model incorporates the following optimizations from the DeepSeek team:

  1. Grouped-Query Attention (GQA) with a 2:1 ratio - Reduces memory usage and computational cost by sharing key and value projections across multiple query heads
  2. Rotary Position Embeddings (RoPE) - Provides better positional encoding with improved extrapolation to longer sequences
  3. RMSNorm - Offers improved training stability compared to LayerNorm
  4. SwiGLU activation - Provides better performance in feed-forward networks compared to standard GELU
  5. Sliding Window Attention with window size 256 - Reduces memory usage for longer sequences by limiting attention to a local window
  6. ONNX export - Enables optimized runtime on various hardware platforms

Intended Uses & Limitations

Intended Uses:

  • Educational purposes to understand transformer architecture and optimizations
  • Research on efficient language model deployment
  • Text generation for simple creative writing tasks
  • Baseline for further fine-tuning on specific tasks

Limitations:

  • The model is trained on a limited dataset (TinyStories) and has a relatively small parameter count
  • It may not perform well on complex reasoning tasks or specialized domains
  • The model has not been extensively evaluated for biases or harmful outputs

Training Procedure

Training Data

The model was trained on the TinyStories dataset, which contains simple stories suitable for young children, generated by GPT-3.5/4.

Training Hyperparameters

  • Optimizer: AdamW
  • Learning Rate: 5e-5
  • Batch Size: 4
  • Weight Decay: 0.01
  • Warmup Steps: 100
  • Gradient Accumulation Steps: 4
  • Training Device: CPU (Mac Mini M4)
  • Training Epochs: 5

Note on Quantization

The quantized version of this model is not included due to PyTorch quantization limitations on Mac M-series chips. See quantization_note.md for instructions on how to quantize the model on a compatible system.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("VaidhyaMegha/Shoonya")
tokenizer = AutoTokenizer.from_pretrained("VaidhyaMegha/Shoonya")

# Generate text
input_text = "Once upon a time"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
output = model.generate(input_ids, max_length=100, temperature=0.7, top_p=0.9, repetition_penalty=1.1)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Evaluation Results

The model achieved the following metrics during training:

  • Final Loss: 7.21
  • Final Perplexity: 1358.28

Ethical Considerations

This model is trained on the TinyStories dataset, which was designed to be suitable for children and contains simple, non-harmful content. However, as with any language model, it may still produce unexpected or potentially problematic outputs. Users should exercise caution and implement appropriate content filtering if deploying this model in production environments.

Citations

@article{eldan2023tinystories,
  title={{TinyStories: How Small Can Language Models Be and Still Speak Coherent English?}},
  author={Eldan, Ronen and Li, Yuanzhi},
  journal={arXiv preprint arXiv:2305.07759},
  year={2023}
}

License

This model is released under the MIT License.

Downloads last month
33
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-generation models for pytorch library.

Dataset used to train VaidhyaMegha/Shoonya