bayley/Meta-Llama-3-70B-Instruct-q4f16_1-MLC

Llama 3 70B Instruct quantized for use with MLC-LLM

MLC-LLM is an Apache TVM based inference framework with a neat trick: out of all of the frameworks that support tensor parallel inference, MLC-LLM is by far the easiest to install for single-user inference. This allows for linear-ish performance scaling on 2x 3090, 2x 4090, or 2x 7900 XTX, achieving about 30 tokens per second. In particular, this is faster than the performance from a single 48GB card costing much more.

MLC-LLM requires Ubuntu 22.04 or above (the prebuilt wheels will not work on Ubuntu 20.04). For CUDA users:

python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cu122 mlc-ai-nightly-cu122

git clone https://huggingface.co/bayley/Meta-Llama-3-70B-Instruct-q4f16_1-MLC

mlc_llm compile Meta-Llama-3-70B-Instruct-q4f16_1-MLC/mlc-chat-config.json --device cuda --overrides "tensor_parallel_shards=<number of gpus>" -o Meta-Llama-3-70B-Instruct-q4f16_1-cuda.so

mlc_llm serve Meta-Llama-3-70B-Instruct-q4f16_1-MLC --model-lib Meta-Llama-3-70B-Instruct-q4f16_1-cuda.so --host 0.0.0.0

This should start an OpenAI-compatible REST server serving a chat completions endpoint that you can connect to with your favorite frontend.