turboderp/Llama-3.1-Nemotron-Ultra-253B-v1-exl3

May 26

Hey there! I saw this 2.0 quant, but when I try to quantize the same model to 4.0 bpw I get:
Traceback (most recent call last):
File "E:\exllamav3\convert.py", line 11, in
main(_in_args, _job_state)
File "C:\Users*****\miniconda3\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "E:\exllamav3\exllamav3\conversion\convert_model.py", line 354, in main
proxy_err = linear.convert_exl3(
^^^^^^^^^^^^^^^^^^^^
File "E:\exllamav3\exllamav3\modules\linear.py", line 226, in convert_exl3
weight_q, proxy_err, out_tensors = quantize_exl3(
^^^^^^^^^^^^^^
File "E:\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 765, in quantize_exl3
H, L, su, H_diag = finalize_capture_H(H_data, quant_args, verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 472, in finalize_capture_H
L, H = block_ldl(H, 16, verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "E:\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 279, in block_ldl
raise e
File "E:\exllamav3\exllamav3\modules\quant\exl3_lib\quantize.py", line 266, in block_ldl
L = torch.linalg.cholesky(H)
^^^^^^^^^^^^^^^^^^^^^^^^
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 40962 is not positive-definite).

Do you know of any solutions to this? I have tested with different CR and tried several ways to try and get it to go... but it always fails on layer 16 with that same exact error.
Also, thanks for all the hard work on EXL3.

CarouselAether

May 26

It does this even in the dev branch, but the dev branch does go like 25% faster for me.

turboderp

Owner May 26

This model has individual layers with about 20 billion parameters each. It's very unwieldy, and I haven't tested quantizing it with every new update so it's not impossible that something has broken along the way. You can try quantizing with the --no_out_scales argument, which might help.

What bitrate are you targeting and how far does it get before it fails?

CarouselAether

May 26

I am shooting for 4.0 bpw.
It gets to layer 16 like clockwork and then gives me the error.
I can test the no out scales argument.

turboderp
/

Llama-3.1-Nemotron-Ultra-253B-v1-exl3

Quantizing with EXL3