eralFlare
/

gemma-2b-it-gptq

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions Community

eralFlare commited on Mar 4, 2024

Commit

6843232

·

verified ·

1 Parent(s): 50bcb93

Update README.md

Files changed (1) hide show

README.md +4 -5

README.md CHANGED Viewed

@@ -18,7 +18,8 @@ license_link: https://ai.google.dev/gemma/terms
 ---
 # Gemma Model Card
-This model card is copied from the original [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) with edits to the code snippets on how to run this auto-gptq quantized version of the model. This auto-gptq quantized version of the model had only been tested to work on cuda GPU.
 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
@@ -67,7 +68,7 @@ model = AutoModelForCausalLM.from_pretrained("eralFlare/gemma-2b-it", device_map
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
-outputs = model.generate(**input_ids)
 print(tokenizer.decode(outputs[0]))
 ```
@@ -84,14 +85,12 @@ from transformers import AutoTokenizer, AutoModelForCausalLM
 import transformers
 import torch
-model_id = "gg-hf/gemma-2b-it"
-dtype = torch.bfloat16
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(
     model_id,
     device_map="cuda",
-    torch_dtype=dtype,
 )
 chat = [

 ---
 # Gemma Model Card
+This model card is copied from the original [google/gemma-2b-it](https://huggingface.co/google/gemma-2b-it) with edits to the code snippets on how to run this auto-gptq quantized version of the model.
+This auto-gptq quantized version of the model had only been tested to work on cuda GPU. This quantized model utilise approximately 2.6GB of VRAM.
 **Model Page**: [Gemma](https://ai.google.dev/gemma/docs)
 input_text = "Write me a poem about Machine Learning."
 input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
+outputs = model.generate(**input_ids, max_new_tokens=1024)
 print(tokenizer.decode(outputs[0]))
 ```
 import transformers
 import torch
+model_id = "eralFlare/gemma-2b-it"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 model = AutoModelForCausalLM.from_pretrained(
     model_id,
     device_map="cuda",
 )
 chat = [