Comparison to older-type AWQ quants?
Hi, forgive me for bothering you again, but I'd like to understand more about the bfloat16
-weight AWQ since it is new.
For this Mistral, there is your quant with bfloat16
and a float16
version (stelterlab/Mistral-Small-24B-Instruct-2501-AWQ
) that resembles older, standard AWQ quants. I have done tests with both and am not sure what to think of their performance: the float16
version answered some math questions better but your bfloat16
version was better at long-context questions and its responses seemed closer to the full version.
All this is probably normal, but is there something systematic that can be said about choosing bfloat16
vs float16
in the context of your quantisation method? Or is it really exactly the same considerations with overflow/underflow when recasting non-quantised bfloat16
models? Many thanks!