How to run fp16?
#5
by
grim3000
- opened
Hi, I'm trying to run the using 4x T4 (16gb each), but encountering the error: bf16 is only supported on A100+ GPUs
While I wait for a quota increase to access A100's or 2x A10's, I'm curious how this model can be run with fp16 instead? I've seen some mentions of this in other discussions and on the Github repo but no clear examples
Separately, are there any plans to update this repo so that the model can be easily deployed on HF inference endpoints? At the moment, it seems to require setting up a custom handler among other things
Any help is appreciated!
just change all torch.bfloat16
to torch.float16
in example.
import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
model = AutoModelForCausalLM.from_pretrained(
'THUDM/cogvlm-chat-hf',
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to('cuda').eval()
# chat example
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.float16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
chenkq
changed discussion status to
closed