Update README.md
#4
by
Motit
- opened
README.md
CHANGED
@@ -52,7 +52,7 @@ You also have to have the model on a CUDA device.
|
|
52 |
|
53 |
The recommended way to perform efficient inference with Jamba Large 1.6 is using [vLLM](https://docs.vllm.ai/en/latest/). First, make sure to install vLLM (version 0.5.5 or higher is required)
|
54 |
```bash
|
55 |
-
pip install vllm>=0.5.5
|
56 |
```
|
57 |
|
58 |
Jamba Large 1.6 is too large to be loaded in full (FP32) or half (FP16/BF16) precision on a single node of 8 80GB GPUs. Therefore, quantization is required. We've developed an innovative and efficient quantization technique, [ExpertsInt8](https://www.ai21.com/blog/announcing-jamba-model-family#:~:text=Like%20all%20models%20in%20its%20size%20class%2C%20Jamba%201.6%20Large%20can%E2%80%99t%20be%20loaded%20in%20full%20(FP32)%20or%20half%20(FP16/BF16)%20precision%20on%20a%20single%20node%20of%208%20GPUs.%20Dissatisfied%20with%20currently%20available%20quantization%20techniques%2C%20we%20developed%20ExpertsInt8%2C%20a%20novel%20quantization%20technique%20tailored%20for%20MoE%20models.), designed for MoE models deployed in vLLM, including Jamba models. Using it, you'll be able to deploy Jamba Large 1.6 on a single node of 8 80GB GPUs.
|
|
|
52 |
|
53 |
The recommended way to perform efficient inference with Jamba Large 1.6 is using [vLLM](https://docs.vllm.ai/en/latest/). First, make sure to install vLLM (version 0.5.5 or higher is required)
|
54 |
```bash
|
55 |
+
pip install "vllm>=0.5.5"
|
56 |
```
|
57 |
|
58 |
Jamba Large 1.6 is too large to be loaded in full (FP32) or half (FP16/BF16) precision on a single node of 8 80GB GPUs. Therefore, quantization is required. We've developed an innovative and efficient quantization technique, [ExpertsInt8](https://www.ai21.com/blog/announcing-jamba-model-family#:~:text=Like%20all%20models%20in%20its%20size%20class%2C%20Jamba%201.6%20Large%20can%E2%80%99t%20be%20loaded%20in%20full%20(FP32)%20or%20half%20(FP16/BF16)%20precision%20on%20a%20single%20node%20of%208%20GPUs.%20Dissatisfied%20with%20currently%20available%20quantization%20techniques%2C%20we%20developed%20ExpertsInt8%2C%20a%20novel%20quantization%20technique%20tailored%20for%20MoE%20models.), designed for MoE models deployed in vLLM, including Jamba models. Using it, you'll be able to deploy Jamba Large 1.6 on a single node of 8 80GB GPUs.
|