What is the base precision type(FP32/FP16) used in Q2/Q1 quantization?

#23
by ArYuZzz1 - opened

Hi @Unsloth team,
Thanks for your contribution! I’m curious about the base precision type used in your Q2/Q1 quantization.
For some specific reasons, I need to rely on a large number of potato servers for CPU inference. The CPUs in these servers are either Xeon-2 or Zen-1 (more specifically, Hygon CPUs), which only support up to AVX-2.
Here’s the tricky part: AVX-2 only supports FP32/FP64. As a result, any precision lower than FP32 will first need to be aligned to FP32 before computation, which significantly increases memory usage and reduces inference speed.
In this case, if the Q2/Q1 models use FP32 as the base precision, they should perform well on my servers. Could you clarify for me?

Sign up or log in to comment