EXL2 4.5bpw Quantization of calme-3.2-instruct-78b

Calme-3 Models

This repository hosts the 4.5 bits per weight (bpw) quantization of the calme-3.2-instruct-78b model, leveraging the ExLlamaV2 format for efficient inference with high-context capabilities. This model is a Qwen 2.5 finetune.

Quantization Details

  • Format: ExLlamaV2 4.5bpw
  • Version: ExLlamaV2 0.2.6
  • Model Size: 78 billion parameters
  • VRAM Usage: Approx. 44GB (32,000 context)
  • Calibration:
    • Rows: 115
    • Length: 2048
    • Dataset: (default)

The quantization process reduces memory usage and inference latency while maintaining high performance for generative text tasks.

Prompt Template

This model uses the ChatML prompt template for interaction:

<|im_start|>system
{System}
<|im_end|>
<|im_start|>user
{User}
<|im_end|>
<|im_start|>assistant
{Assistant}

Model Usage

Example: Inference with ExLlamaV2

To use this quantized model, ensure you have the ExLlamaV2 library installed:

pip install exllamav2
from exllamav2 import ExLlamaModel, ExLlamaTokenizer, ExLlamaPipeline

# Load model and tokenizer
model = ExLlamaModel.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw")
tokenizer = ExLlamaTokenizer.from_pretrained("DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw")

# Create pipeline
pipeline = ExLlamaPipeline(model, tokenizer)

# Generate text
messages = [{"role": "user", "content": "What is EXL2 quantization?"}]
response = pipeline(messages)
print(response)

Features

  • EXL2 format requires Nvidia hardware but runs faster and with less RAM than GGUF.
  • Supports 44GB VRAM with 32,000 context window.
  • 40GB minimum 1024 context window
  • Highly optimized for inference, making it ideal for resource-constrained environments.
  • Compatible with ChatML-based prompting systems.

Acknowledgments

Download Instructions

To download the model files:

huggingface-cli install huggingface_hub
huggingface-cli login
huggingface-cli download DavidCatalano/calme-3.2-instruct-78b-exl2-4.5bpw --include "*" --local-dir ./local-folder

Downloads last month
74
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Model tree for DavidCatalano/calme-3.2-instruct-78b-exl2

Quantized
(7)
this model

Evaluation results