GPT-Vision

A very small Vision-Lanaguge Model , Like Llava and Moondream This model has THREE components combined into one

  • GPT2
  • VIT-224
  • Multimodality-projector

Check the github for more information GPT-Vision-Github

Inference

from transformers import AutoModelForCausalLM
from PIL import Image

model = AutoModelForCausalLM.from_pretrained("damerajee/GPT-Vision", trust_remote_code=True)

image_path = "Your_image_path"
image = Image.open(image_path)
image = image.convert('RGB')

question = "Render a clear and concise summary of the photo."
answer = model.generate(image=image,question=question,max_new_tokens=40)
print("Answer:", answer)

Limitations

A fair warning tho guys , this model is only able to generate very short response sometimes it can also repetitive generate the same tokens but even thought it will understands whats on the image

Further Fine-tuning will make this model better

Downloads last month
236
Safetensors
Model size
216M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Dataset used to train damerajee/GPT-Vision

Collection including damerajee/GPT-Vision