Comparison to older-type AWQ quants?

#1
by nfunctor - opened

Hi, forgive me for bothering you again, but I'd like to understand more about the bfloat16-weight AWQ since it is new.

For this Mistral, there is your quant with bfloat16 and a float16 version (stelterlab/Mistral-Small-24B-Instruct-2501-AWQ) that resembles older, standard AWQ quants. I have done tests with both and am not sure what to think of their performance: the float16 version answered some math questions better but your bfloat16 version was better at long-context questions and its responses seemed closer to the full version.

All this is probably normal, but is there something systematic that can be said about choosing bfloat16 vs float16 in the context of your quantisation method? Or is it really exactly the same considerations with overflow/underflow when recasting non-quantised bfloat16 models? Many thanks!

Sign up or log in to comment