This repository contains alternative Mixtral-instruct-8x7B (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance differences in actual usage. Perplexity is lower compared to the "official" llama.cpp quantization, but perplexity is not necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table comparing perplexities of these quantized models to the current llama.cpp quantization approach on Wikitext for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization Model file PPL(llama.cpp) Quantization Error PPL(new quants) Quantization Error
Q2_K mixtral-instruct-8x7b-q2k.gguf 6.8953 56.4% 5.2679 19.5%
Q3_K_S mixtral-instruct-8x7b-q3k-small.gguf 4.7038 6.68% 4.6401 5.24%
Q3_K_M mixtral-instruct-8x7b-q3k-medium.gguf 4.6663 5.83% 4.5608 3.44%
Q4_K_S mixtral-instruct-8x7b-q4k-small.gguf 4.5105 2.30% 4.4630 1.22%
Q4_K_M mixtral-instruct-8x7b-q4k-medium.gguf 4.5105 2.30% 4.4568 1.08%
Q5_K_S mixtral-instruct-8x7b-q5k-small.gguf 4.4402 0.71% 4.4277 0.42%
Q4_0 mixtral-instruct-8x7b-q40.gguf 4.5102 2.29% 4.4908 1.85%
Q4_1 mixtral-instruct-8x7b-q41.gguf 4.5415 3.00% 4.4612 1.18%
Q5_0 mixtral-instruct-8x7b-q50.gguf 4.4361 0.61% 4.4297 0.47%
Downloads last month
154
GGUF
Model size
46.7B params
Architecture
llama
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.