ikawrakow/mistral-7b-quantized-gguf

This repository contains improved Mistral-7B quantized models in GGUF format for use with llama.cpp. The models are fully compatible with the oficial llama.cpp release and can be used out=of-the-box.

The table shows a comparison between these models and the current llama.cpp quantization approach using Wikitext perplexities for a context length of 512 tokens. The "Quantization Error" columns in the table are defined as (PPL(quantized model) - PPL(fp16))/PPL(fp16).

Quantization	Model file	PPL(llama.cpp)	Quantization Error	PPL(new quants)	Quantization Error
Q3_K_S	mistral-7b-q3ks.gguf	6.0692	6.62%	6.0021	5.44%
Q3_K_M	mistral-7b-q3km.gguf	5.8894	3.46%	5.8489	2.75%
Q4_K_S	mistral-7b-q4ks.gguf	5.7764	1.48%	5.7349	0.75%
Q4_K_M	mistral-7b-q4km.gguf	5.7539	1.08%	5.7259	0.59%
Q5_K_S	mistral-7b-q5ks.gguf	5.7258	0.59%	5.7100	0.31%
Q4_0	mistral-7b-q40.gguf	5.8189	2.23%	5.7924	1.76%
Q4_1	mistral-7b-q41.gguf	5.8244	2.32%	5.7455	0.94%
Q5_0	mistral-7b-q50.gguf	5.7180	0.45%	5.7070	0.26%
Q5_1	mistral-7b-q51.gguf	5.7128	0.36%	5.7057	0.24%

In addition, a 2-bit model is provided (mistral-7b-q2k-extra-small.gguf). It has a perplexity of 6.7099 for a context length of 512, and 5.5744 for a context of 4096.