This repository contains 2-bit quantized LLaMA-v2 models in GGUF format for use with llama.cpp. All tensors are quantized with Q2_K, except for output.weight, which is Q6_K, and, in the case of LLaMA-v2-70B, attn_v, which is Q4_K. The quantized models differ from the standard llama.cpp 2-bit quantization in two ways:

  • These are actual 2-bit quantized models instead of the mostly 3-bit quantization provided by the standard llama.cpp Q2_K quantization method
  • The models were prepared with a refined (but not yet published) k-quants quantization approach
Downloads last month
18
GGUF
Model size
13B params
Architecture
llama
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.