ikawrakow/mistral-instruct-7b-quantized-gguf

This repository contains alternative Mistral-instruct-7B (https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out-of-the-box.

I'm carefull to say "alternative" rather than "better" or "improved" as I have not put any effort into evaluating performance differences in actual usage. Perplexity is lower compared to the "official" llama.cpp quantization, but perplexity is not necessarily a good measure for real world performance. Nevertheless, perplexity does measure quantization error, so below is a table comparing perplexities of these quantized models to the current llama.cpp quantization approach on Wikitext for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization	Model file	PPL(llama.cpp)	Quantization Error	PPL(new quants)	Quantization Error
Q3_K_S	mistral-instruct-7b-q3k-small.gguf	6.9959	4.27%	6.8920	2.72%
Q3_K_M	mistral-instruct-7b-q3k-medium.gguf	6.8892	2.68%	6.8089	1.48%
Q4_K_S	mistral-instruct-7b-q4k-small.gguf	6.7649	0.82%	6.7351	0.38%
Q5_K_S	mistral-instruct-7b-q5k-small.gguf	6.7197	0.15%	6.7186	0.13%
Q4_0	mistral-instruct-7b-q40.gguf	6.7728	0.94%	6.7191	0.14%