File size: 1,781 Bytes

---
license: llama3
---

# Meta-Llama-3-8B-Instruct-ct2-int8

This is a [ctranslate2](https://github.com/OpenNMT/CTranslate2) v4.5.0 int8 conversion of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main) created with:

```
ct2-transformers-converter --model meta-llama/Meta-Llama-3-8B-Instruct --output_dir Meta-Llama-3-8B-Instruct-ct2-int8 --quantization int8
```

## Downloading

ct2 doesn't have hf-hub integration, so you'll need to manually download the model files:

```
huggingface-cli download mike-ravkine/Meta-Llama-3-8B-Instruct-ct2-int8 --local-dir Meta-Llama-3-8B-Instruct-ct2-int8/
```

## Using 
Install dependencies:

```
pip install transformers[torch] ctranslate2
```

Sample inference code:

```python
import sys
import ctranslate2
from transformers import AutoTokenizer

model_dir = sys.argv[1] # download dir
tokenizer_dir = meta-llama/Meta-Llama-3-8B-Instruct

print("Loading the model...")
generator = ctranslate2.Generator(model_dir, device="cuda")
tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)

dialog = [{"role": "user", "content": "What is the meaning of life, the universe and everything?"}]
max_generation_length = 512

prompt_string = tokenizer.apply_chat_template(dialog, add_generation_prompt=True, tokenize=False)
# It seems silly to tokenize=False and then call tokenize, but tokenize=True returns just ids; we need actual tokens
prompt_tokens = tokenizer.tokenize(prompt_string)

step_results = generator.generate_tokens(
    prompt_tokens,
    max_length=max_generation_length,
    sampling_temperature=0.6,
    sampling_topk=20,
    sampling_topp=1,
)
for step_result in step_results:
    word = tokenizer.decode([step_result.token_id])
    print(word, end="", flush=True)
```