Llama 3 70B Instruct quantized for use with MLC-LLM

MLC-LLM is an Apache TVM based inference framework with a neat trick: out of all of the frameworks that support tensor parallel inference, MLC-LLM is by far the easiest to install for single-user inference. This allows for linear-ish performance scaling on 2x 3090, 2x 4090, or 2x 7900 XTX, achieving about 30 tokens per second. In particular, this is faster than the performance from a single 48GB card costing much more.

MLC-LLM requires Ubuntu 22.04 or above (the prebuilt wheels will not work on Ubuntu 20.04). For CUDA users:

python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122

git clone https://huggingface.co/bayley/Meta-Llama-3-70B-Instruct-q4f16_1-MLC

mlc_llm compile Meta-Llama-3-70B-Instruct-q4f16_1-MLC/mlc-chat-config.json --device cuda --overrides "tensor_parallel_shards=<number of gpus>" -o Meta-Llama-3-70B-Instruct-q4f16_1-cuda.so

mlc_llm serve Meta-Llama-3-70B-Instruct-q4f16_1-MLC --model-lib Meta-Llama-3-70B-Instruct-q4f16_1-cuda.so --host 0.0.0.0

This should start an OpenAI-compatible REST server serving a chat completions endpoint that you can connect to with your favorite frontend.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.