sashakunitsyn commited on
Commit
7db57fe
·
verified ·
1 Parent(s): d4d23bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -1
README.md CHANGED
@@ -9,4 +9,69 @@ tags:
9
  - image-to-text
10
  - image-captioning
11
  base_model: Salesforce/blip2-opt-2.7b
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  - image-to-text
10
  - image-captioning
11
  base_model: Salesforce/blip2-opt-2.7b
12
+ ---
13
+ # VLRM
14
+ This repository contains the fine-tuned weights of BLIP-2. You can find the code in the [GitHub Repository](https://github.com/TODO)
15
+ # Running the model
16
+ ## Option 1
17
+ <details>
18
+ <summary> Load the whole model from this repo </summary>
19
+
20
+ ```python
21
+ import torch
22
+ import requests
23
+ from PIL import Image
24
+ from transformers import Blip2Processor, Blip2ForConditionalGeneration
25
+
26
+ processor = Blip2Processor.from_pretrained("sashakunitsyn/vlrm-blip2-opt-2.7b")
27
+ model = Blip2ForConditionalGeneration.from_pretrained("sashakunitsyn/vlrm-blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")
28
+
29
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
30
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
31
+
32
+ inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
33
+
34
+ out = model.generate(**inputs, max_new_tokens=60)
35
+ processor.decode(out[0], skip_special_tokens=True).strip()
36
+ >>> 'a woman in a plaid shirt shaking hands with a yellow labrador retriever sitting on the ground at sunset on a beach in florida'
37
+ ```
38
+ </details>
39
+
40
+ ## Option 2
41
+ Since the fine-tuned take only small part of the whole model, you could load only neccesary weights.
42
+ <details>
43
+ <summary> Step 1. Load the original model </summary>
44
+
45
+ ```python
46
+ import torch
47
+ import requests
48
+ from PIL import Image
49
+ from transformers import Blip2Processor, Blip2ForConditionalGeneration
50
+
51
+ processor = Blip2Processor.from_pretrained("Salesforce/blip2-opt-2.7b")
52
+ model = Blip2ForConditionalGeneration.from_pretrained("Salesforce/blip2-opt-2.7b", torch_dtype=torch.float16, device_map="auto")
53
+
54
+ img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
55
+ raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
56
+
57
+ inputs = processor(raw_image, return_tensors="pt").to("cuda", torch.float16)
58
+
59
+ out = model.generate(**inputs, max_new_tokens=60)
60
+ processor.decode(out[0], skip_special_tokens=True).strip()
61
+ >>> 'a woman sitting on the beach with a dog'
62
+ ```
63
+ </details>
64
+
65
+ <details>
66
+ <summary> Step 2. Load the RL-tuned weights </summary>
67
+
68
+ ```python
69
+ from huggingface_hub import hf_hub_download
70
+ finetuned_weights_state_dict = torch.load(hf_hub_download(repo_id="sashakunitsyn/vlrm-blip2-opt-2.7b", filename="vlrm-blip2-opt-2.7b.pt"))
71
+ model.load_state_dict(finetuned_weights_state_dict, strict=False)
72
+
73
+ out = model.generate(**inputs, max_new_tokens=60)
74
+ processor.decode(out[0], skip_special_tokens=True).strip()
75
+ >>> 'a woman in a plaid shirt shaking hands with a yellow labrador retriever sitting on the ground at sunset on a beach in florida'
76
+ ```
77
+ </details>