Gengzigang commited on
Commit
28b6081
·
1 Parent(s): c0e59c1

update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -3
README.md CHANGED
@@ -1,3 +1,102 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - CLIP
5
+ - LLM2CLIP
6
+ pipeline_tag: zero-shot-classification
7
+ ---
8
+ <div align="center">
9
+
10
+ <h2><a href="">LLM2CLIP: Extending the Capability Boundaries of CLIP through Large Language Models</a></h2>
11
+ Weiquan Huang<sup>1*</sup>, Aoqi Wu<sup>1*</sup>, Yifan Yang<sup>2†</sup>, Xufang Luo<sup>2</sup>, Yuqing Yang<sup>2</sup>, Liang Hu<sup>1</sup>, Qi Dai<sup>2</sup>, Xiyang Dai<sup>2</sup>, Dongdong Chen<sup>2</sup>, Chong Luo<sup>2</sup>, Lili Qiu<sup>2</sup>
12
+
13
+ <sup>1</sup>Tongji Universiy, <sup>2</sup>Microsoft Corporation <br><sup>*</sup>Equal contribution <br><sup>†</sup> Corresponding to: [email protected]
14
+
15
+ <p><a rel="nofollow" href="https://github.com/microsoft/LLM2CLIP">[📂 GitHub]</a> <a rel="nofollow" href="https://microsoft.github.io/LLM2CLIP/">[🆕 Blog]</a> <a rel="nofollow" href="">[📜 LLM2CLIP]</a>
16
+ </div>
17
+
18
+
19
+ In this paper, we propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP’s potential. By fine-tuning the LLM in the caption space with contrastive learning, we extract its textual capabilities into the output embeddings, significantly improving the output layer’s textual discriminability. We then design an efficient training process where the fine-tuned LLM acts as a powerful teacher for CLIP’s visual encoder. Thanks to the LLM’s presence, we can now incorporate longer and more complex captions without being restricted by vanilla CLIP text encoder’s context window and ability limitations. Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks. Our method directly boosted the performance of the previously SOTA EVA02 model by 16.5% on both long-text and short-text retrieval tasks, transforming a CLIP model trained solely on English data into a state-of-the-art cross-lingual model. Moreover, when integrated into mul- timodal training with models like Llava 1.5, it consistently outperformed CLIP across nearly all benchmarks, demonstrating comprehensive performance improvements.
20
+
21
+ ## LLM2CLIP performance
22
+
23
+ <div align="center">
24
+ <img src="teaser.png" alt="summary_tab" width="85%">
25
+ </div>
26
+ **It's important to note that all results presented in the paper are evaluated using PyTorch weights. There may be differences in performance when using Hugging Face (hf) models.**
27
+
28
+ ## Model Details
29
+ - **Model Type:** vision foundation model, feature backbone
30
+ - **Pretrain Dataset:** CC3M, CC12M, YFCC15M and Recap-DataComp-1B(30M subset)
31
+
32
+
33
+ ## Usage
34
+
35
+ ### Huggingface Version
36
+ Image Embeddings
37
+ ```python
38
+ from PIL import Image
39
+ from transformers import AutoModel
40
+ from transformers import CLIPImageProcessor
41
+ import torch
42
+
43
+ image_path = "CLIP.png"
44
+ model_name_or_path = "LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336
45
+
46
+ processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
47
+ model = AutoModel.from_pretrained(
48
+ model_name_or_path,
49
+ torch_dtype=torch.float16,
50
+ trust_remote_code=True).to('cuda').eval()
51
+
52
+ image = Image.open(image_path)
53
+ input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
54
+
55
+ with torch.no_grad(), torch.cuda.amp.autocast():
56
+ outputs = model.get_image_features(input_pixels)
57
+ ```
58
+ Retrieval
59
+ ```python
60
+ from PIL import Image
61
+ from transformers import AutoModel, AutoConfig, AutoTokenizer
62
+ from transformers import CLIPImageProcessor
63
+ import torch
64
+ from llm2vec import LLM2Vec
65
+
66
+ processor = CLIPImageProcessor.from_pretrained("openai/clip-vit-large-patch14-336")
67
+ model_name_or_path = "microsoft/LLM2CLIP-Openai-L-14-336" # or /path/to/local/LLM2CLIP-Openai-L-14-336
68
+ model = AutoModel.from_pretrained(
69
+ model_name_or_path,
70
+ torch_dtype=torch.float16,
71
+ trust_remote_code=True).to('cuda').eval()
72
+
73
+ llm_model_name = 'microsoft/LLM2CLIP-Llama-3-8B-Instruct-CC-Finetuned'
74
+ config = AutoConfig.from_pretrained(
75
+ llm_model_name, trust_remote_code=True
76
+ )
77
+ llm_model = AutoModel.from_pretrained(llm_model_name, config=config,trust_remote_code=True)
78
+ tokenizer = AutoTokenizer.from_pretrained(llm_model_name)
79
+ llm_model.config._name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct' # Workaround for LLM2VEC
80
+ l2v = LLM2Vec(llm_model, tokenizer, pooling_mode="mean", max_length=512, doc_max_length=512)
81
+
82
+ captions = ["a diagram", "a dog", "a cat"]
83
+ image_path = "CLIP.png"
84
+
85
+ image = Image.open(image_path)
86
+ input_pixels = processor(images=image, return_tensors="pt").pixel_values.to('cuda')
87
+
88
+ with torch.no_grad(), torch.cuda.amp.autocast():
89
+ image_features = model.get_image_features(input_pixels)
90
+ text_features = l2v.encode(captions, convert_to_tensor=True).to('cuda')
91
+ text_features = model.get_text_features(text_features)
92
+
93
+ image_features /= image_features.norm(dim=-1, keepdim=True)
94
+ text_features /= text_features.norm(dim=-1, keepdim=True)
95
+
96
+ text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
97
+
98
+ print("Label probs:", text_probs)
99
+
100
+ ```
101
+
102
+ ## BibTeX & Citation