ikawrakow/mixtral-instruct-8x7b-quantized-gguf

This repository contains alternative Mixtral-instruct-8x7B (https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1) quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance differences in actual usage. Perplexity is lower compared to the "official" llama.cpp quantization, but perplexity is not necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table comparing perplexities of these quantized models to the current llama.cpp quantization approach on Wikitext for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization	Model file	PPL(llama.cpp)	Quantization Error	PPL(new quants)	Quantization Error
Q2_K	mixtral-instruct-8x7b-q2k.gguf	6.8953	56.4%	5.2679	19.5%
Q3_K_S	mixtral-instruct-8x7b-q3k-small.gguf	4.7038	6.68%	4.6401	5.24%
Q3_K_M	mixtral-instruct-8x7b-q3k-medium.gguf	4.6663	5.83%	4.5608	3.44%
Q4_K_S	mixtral-instruct-8x7b-q4k-small.gguf	4.5105	2.30%	4.4630	1.22%
Q4_K_M	mixtral-instruct-8x7b-q4k-medium.gguf	4.5105	2.30%	4.4568	1.08%
Q5_K_S	mixtral-instruct-8x7b-q5k-small.gguf	4.4402	0.71%	4.4277	0.42%
Q4_0	mixtral-instruct-8x7b-q40.gguf	4.5102	2.29%	4.4908	1.85%
Q4_1	mixtral-instruct-8x7b-q41.gguf	4.5415	3.00%	4.4612	1.18%
Q5_0	mixtral-instruct-8x7b-q50.gguf	4.4361	0.61%	4.4297	0.47%