Update README.md
Browse files
README.md
CHANGED
@@ -48,4 +48,81 @@ For more information, see the [API reference](https://docs.nomic.ai/reference/en
|
|
48 |
Click the Nomic Atlas map below to visualize a 100,000 sample CC3M comparing the Vision and Text Embedding Space!
|
49 |
|
50 |
|
51 |
-
[](https://atlas.nomic.ai/data/nomic-multimodal-series/cc3m-100k-image-bytes-v15/map)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
48 |
Click the Nomic Atlas map below to visualize a 100,000 sample CC3M comparing the Vision and Text Embedding Space!
|
49 |
|
50 |
|
51 |
+
[](https://atlas.nomic.ai/data/nomic-multimodal-series/cc3m-100k-image-bytes-v15/map)
|
52 |
+
|
53 |
+
## Training Details
|
54 |
+
|
55 |
+
We align our vision embedder to the text embedding by employing a technique similar to [LiT](https://arxiv.org/abs/2111.07991) but instead lock the text embedder!
|
56 |
+
|
57 |
+
For more details, see the Nomic Embed Vision Technical Report (soon to be released!) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-vision)
|
58 |
+
|
59 |
+
Training code is released in the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
|
60 |
+
|
61 |
+
## Usage
|
62 |
+
|
63 |
+
Note `nomic-embed-text` *requires* prefixes! We support the prefixes `[search_query, search_document, classification, clustering]`.
|
64 |
+
For retrieval applications, you should prepend `search_document` for all your documents and `search_query` for your queries.
|
65 |
+
|
66 |
+
For example, you are building a RAG application over the top of Wikipedia. You would embed all Wikipedia articles with the prefix `search_document`
|
67 |
+
and any questions you ask with `search_query`. For example:
|
68 |
+
```python
|
69 |
+
queries = ["search_query: who is the first president of the united states?", "search_query: when was babe ruth born?"]
|
70 |
+
documents = ["search_document: <article about US Presidents>", "search_document: <article about Babe Ruth>"]
|
71 |
+
```
|
72 |
+
You can
|
73 |
+
### Transformers
|
74 |
+
|
75 |
+
```python
|
76 |
+
import torch
|
77 |
+
import torch.nn.functional as F
|
78 |
+
from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
|
79 |
+
from PIL import Image
|
80 |
+
import requests
|
81 |
+
|
82 |
+
|
83 |
+
|
84 |
+
processor = AutoImageProcessor.from_pretrained("nomic-ai/nomic-embed-vision-v1.5")
|
85 |
+
vision_model = AutoModel.from_pretrained("nomic-ai/nomic-embed-vision-v1.5", trust_remote_code=True)
|
86 |
+
|
87 |
+
url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
|
88 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
89 |
+
|
90 |
+
inputs = processor(image, return_tensors="pt")
|
91 |
+
|
92 |
+
img_emb = vision_model(**inputs).last_hidden_state
|
93 |
+
img_embeddings = F.normalize(img_emb[:, 0], p=2, dim=1)
|
94 |
+
```
|
95 |
+
|
96 |
+
Additionally, you can perform multimodal retrieval!
|
97 |
+
|
98 |
+
```python
|
99 |
+
|
100 |
+
def mean_pooling(model_output, attention_mask):
|
101 |
+
token_embeddings = model_output[0]
|
102 |
+
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
103 |
+
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
|
104 |
+
|
105 |
+
sentences = ['search_query: What are cute animals to cuddle with?', 'search_query: What do cats look like?']
|
106 |
+
|
107 |
+
tokenizer = AutoTokenizer.from_pretrained('nomic-ai/nomic-embed-text-v1.5')
|
108 |
+
text_model = AutoModel.from_pretrained('nomic-ai/nomic-embed-text-v1.5', trust_remote_code=True)
|
109 |
+
text_model.eval()
|
110 |
+
|
111 |
+
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
112 |
+
|
113 |
+
with torch.no_grad():
|
114 |
+
model_output = text_model(**encoded_input)
|
115 |
+
|
116 |
+
text_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
|
117 |
+
text_embeddings = F.layer_norm(text_embeddings, normalized_shape=(text_embeddings.shape[1],))
|
118 |
+
text_embeddings = F.normalize(text_embeddings, p=2, dim=1)
|
119 |
+
|
120 |
+
print(torch.matmul(img_embeddings, text_embeddings.T))
|
121 |
+
```
|
122 |
+
|
123 |
+
|
124 |
+
# Join the Nomic Community
|
125 |
+
|
126 |
+
- Nomic: [https://nomic.ai](https://nomic.ai)
|
127 |
+
- Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
|
128 |
+
- Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
|