sashakunitsyn
/

vlrm-blip2-opt-2.7b

visual-question-answering

image-captioning

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

sashakunitsyn commited on Apr 2, 2024

Commit

7db57fe

·

verified ·

1 Parent(s): d4d23bb

Update README.md

Files changed (1) hide show

README.md +66 -1

README.md CHANGED Viewed

@@ -9,4 +9,69 @@ tags:
 - image-to-text
 - image-captioning
 base_model: Salesforce/blip2-opt-2.7b
----

 - image-to-text
 - image-captioning
 base_model: Salesforce/blip2-opt-2.7b
+---
+# VLRM
+This repository contains the fine-tuned weights of BLIP-2. You can find the code in the [GitHub Repository](https://github.com/TODO)
+# Running the model
+## Option 1
+<details>
+<summary> Load the whole model from this repo </summary>
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import Blip2Processor, Blip2ForConditionalGeneration
+processor = Blip2Processor.from_pretrained("sashakunitsyn/vlrm-blip2-opt-2.7b")
+model = Blip2ForConditionalGeneration.from_pretrained("sashakunitsyn/vlrm-blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")
+img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
+raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
+inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
+out = model.generate(**inputs, max_new_tokens=60)
+processor.decode(out[0], skip_special_tokens=True).strip()
+>>> 'a woman in a plaid shirt shaking hands with a yellow labrador retriever sitting on the ground at sunset on a beach in florida'
+```
+</details>
+## Option 2
+Since the fine-tuned take only small part of the whole model, you could load only neccesary weights.
+<details>
+<summary> Step 1. Load the original model </summary>
+```python
+import torch
+import requests
+from PIL import Image
+from transformers import Blip2Processor, Blip2ForConditionalGeneration
+processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
+model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")
+img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
+raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
+inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
+out = model.generate(**inputs, max_new_tokens=60)
+processor.decode(out[0], skip_special_tokens=True).strip()
+>>> 'a woman sitting on the beach with a dog'
+```
+</details>
+<details>
+<summary> Step 2. Load the RL-tuned weights </summary>
+```python
+from huggingface_hub import hf_hub_download
+finetuned_weights_state_dict = torch.load(hf_hub_download(repo_id="sashakunitsyn/vlrm-blip2-opt-2.7b", filename="vlrm-blip2-opt-2.7b.pt"))
+model.load_state_dict(finetuned_weights_state_dict, strict=False)
+out = model.generate(**inputs, max_new_tokens=60)
+processor.decode(out[0], skip_special_tokens=True).strip()
+>>> 'a woman in a plaid shirt shaking hands with a yellow labrador retriever sitting on the ground at sunset on a beach in florida'
+```
+</details>