Spaces:

Muthukamalan
/

Coolo-Clip

Sleeping

App Files Files Community

Muthukamalan commited on Nov 14, 2024

Commit

3ff9a31

1 Parent(s): fece87d

application file

Browse files

Files changed (14) hide show

README.md +165 -7
app.py +124 -0
assets/duck.jpeg +0 -0
assets/horse.jpeg +0 -0
contents/bert-model.png +0 -0
contents/clip_model.png +0 -0
contents/cool-clip-nvitop.png +0 -0
contents/cool-clip.png +0 -0
contents/fit-report.png +0 -0
contents/resnet.png +0 -0
features.npy +3 -0
photo_ids.csv +0 -0
photos.tsv000 +0 -0
requirements.txt +5 -0

README.md CHANGED Viewed

@@ -1,14 +1,172 @@
 ---
-title: Coolo Clip
-emoji: 👀
-colorFrom: red
-colorTo: yellow
 sdk: gradio
-sdk_version: 5.5.0
 app_file: app.py
 pinned: false
 license: mit
-short_description: experiment to train clip based models
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: CoolCLIP
+emoji: 🦆
+colorFrom: green
+colorTo: midnight-blue
 sdk: gradio
+sdk_version: 4.44.1
 app_file: app.py
 pinned: false
 license: mit
 ---
+# CLIP
+In early days of transformers starts dominating (ViTs) comes **Contrastive Language–Image Pre-training** ([CLIP](https://github.com/openai/CLIP)-2021) is a powerful neural network model that learns to associate textual descriptions with images.
+# Dataset
+The experiment are performed on [kaggle dataset](https://www.kaggle.com/datasets/adityajn105/flickr8k)
+## APPROACH
+![CLIP-Model Architecture](https://raw.githubusercontent.com/openai/CLIP/dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1/CLIP.png)
+*Image Encoder* may or maynot comes with CNN backbone process image
+- resnet
+- densenet
+*Text Encoder*
+- bert
+- distilbert
+##  Text Encoder
+captions were tokenized by `DistilBert`
+```python
+tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
+tokenizer( list(captions), padding=True, truncation=True, max_length=200 )
+text_model = .model = DistilBertModel.from_pretrained("distilbert-base-uncased")
+```
+<!-- <div align='center'><img src='./contents/bert-model.png' alt=""></div> -->
+<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/bert-model.png' alt=""></div>
+## Image Encoder
+transforms help to standardise the image and pass to the model
+```python
+def get_transforms(mode="train"):
+    if mode == "train":
+        return A.Compose(
+            [
+                A.Resize(224, 224, always_apply=True),
+                A.Normalize(max_pixel_value=255.0, always_apply=True),
+            ]
+        )
+    else:
+        return A.Compose(
+            [
+                A.Resize(224, 224, always_apply=True),
+                A.Normalize(max_pixel_value=255.0, always_apply=True),
+            ]
+        )
+```
+pretrained `resnet` model
+```python
+image_model = timm.create_model( 'resnet18', pretrained, num_classes=0, global_pool="avg" )
+```
+<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/resnet.png' alt=""></div>
+## Projection Head
+Sometimes, `output_image_embedding` won't be same dimension as `output_text_embedding` to make it same dimension it act as adapters.
+It follow simple residual block with non-linear activations
+```python
+class ProjectionHead(nn.Module):
+    def __init__(
+        self,
+        embedding_dim,
+        projection_dim=256,
+        dropout=CFG.dropout
+    ):
+        super().__init__()
+        self.projection = nn.Linear(embedding_dim, projection_dim)
+        self.gelu = nn.GELU()
+        self.fc = nn.Linear(projection_dim, projection_dim)
+        self.dropout = nn.Dropout(dropout)
+        self.layer_norm = nn.LayerNorm(projection_dim)
+    def forward(self, x):
+        projected = self.projection(x)
+        x = self.gelu(projected)
+        x = self.fc(x)
+        x = self.dropout(x)
+        x = x + projected
+        x = self.layer_norm(x)
+        return x
+```
+## CLIP Model
+Combines Image and Text model by adapters and make it understandable.
+```python
+class CLIPModel(pl.LightningModule):
+    def __init__(image_embedding,text_embedding) -> None:
+        super().__init__()
+        self.image_encoder = ImageEncoder()
+        self.text_encoder = TextEncoder()
+        self.image_projection = ProjectionHead(embedding_dim=image_embedding)
+        self.text_projection = ProjectionHead(embedding_dim=text_embedding)
+    def forward(batch):
+        image_features = self.image_encoder(batch["image"])
+        text_features = self.text_encoder( input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]  )
+        image_embeddings = self.image_projection(image_features)
+        text_embeddings = self.text_projection(text_features)
+        # Calculating the Loss
+        logits = (text_embeddings @ image_embeddings.T) / self.temperature
+        images_similarity = image_embeddings @ image_embeddings.T
+        texts_similarity = text_embeddings @ text_embeddings.T
+        targets = F.softmax(  (images_similarity + texts_similarity) / 2 * self.temperature, dim=-1 )
+        texts_loss = cross_entropy(logits, targets, reduction='none')
+        images_loss = cross_entropy(logits.T, targets.T, reduction='none')
+        loss =  (images_loss + texts_loss) / 2.0 # shape: (batch_size)
+        return loss.mean()
+```
+## Model Summary
+```log
+  | Name             | Type           | Params | Mode
+------------------------------------------------------------
+0 | image_encoder    | ImageEncoder   | 11.2 M | train
+1 | text_encoder     | TextEncoder    | 66.4 M | train
+2 | image_projection | ProjectionHead | 197 K  | train
+3 | text_projection  | ProjectionHead | 263 K  | train
+------------------------------------------------------------
+78.0 M    Trainable params
+0         Non-trainable params
+78.0 M    Total params
+312.001   Total estimated model params size (MB)
+200       Modules in train mode
+0         Modules in eval mode
+```
+## Training
+- nvitop
+<!-- ![cool-clip-nvitop](./contents/cool-clip-nvitop.png) -->
+<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/cool-clip-nvitop.png' alt=""></div>
+- htop
+<!-- ![cool-clip](./contents/cool-clip.png) -->
+<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/cool-clip.png' alt=""></div>
+- training
+<!-- ![fit-report](./contents/fit-report.png) -->
+<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/fit-report.png' alt=""></div>
+# Inference
+## GRADIO APP
+<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/clip_model.png' alt=""></div>
+<!-- <div><img align='center' src="./contents/clip_model.png" ></img></div> -->

app.py ADDED Viewed

	@@ -0,0 +1,124 @@

+#Importing all the necessary libraries
+import torch
+import requests
+import numpy as np
+import pandas as pd
+import gradio as gr
+from io import BytesIO
+from PIL import Image as PILIMAGE
+from transformers import CLIPProcessor, CLIPModel, CLIPTokenizer
+from sentence_transformers import SentenceTransformer, util
+device = "cuda" if torch.cuda.is_available() else "cpu"
+# Define model
+model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
+processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
+tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-base-patch32")
+# Load data
+photos = pd.read_csv("./photos.tsv000", sep='\t', header=0)
+photo_features = np.load("./features.npy")
+photo_ids = pd.read_csv("./photo_ids.csv")
+photo_ids = list(photo_ids['photo_id'])
+def encode_text(text):
+  with torch.no_grad():
+    # Encode and normalize the description using CLIP
+    inputs = tokenizer([text],  padding=True, return_tensors="pt")
+    inputs = processor(text=[text], images=None, return_tensors="pt", padding=True).to(device=device)
+  text_encoded =  model.get_text_features(**inputs).detach().cpu().numpy()
+  return text_encoded
+def encode_image(image):
+  image = PILIMAGE.fromarray(image.astype('uint8'), 'RGB')
+  with torch.no_grad():
+        photo_preprocessed = processor(text=None, images=image, return_tensors="pt", padding=True)["pixel_values"]
+        search_photo_feature = model.get_image_features(photo_preprocessed.to(device))
+        search_photo_feature /= search_photo_feature.norm(dim=-1, keepdim=True)
+  image_encoded = search_photo_feature.cpu().numpy()
+  return image_encoded
+T2I = "Text2Image"
+I2I = "Image2Image"
+def similarity(feature, photo_features):
+  similarities = list((feature @ photo_features.T).squeeze(0))
+  return similarities
+def find_best_matches(image, mode, text):
+  # Compute the similarity between the description and each photo using the Cosine similarity
+  print ("Mode now ",mode)
+  if mode == "Text2Image":
+    # Encode the text input
+    text_features = encode_text(text)
+    feature = text_features
+    similarities = similarity(text_features, photo_features)
+  else:
+    #Encode the image input
+    image_features = encode_image(image)
+    feature = image_features
+    similarities = similarity(image_features, photo_features)
+  # Sort the photos by their similarity score
+  best_photos = sorted(zip(similarities, range(photo_features.shape[0])), key=lambda x: x[0], reverse=True)
+  matched_images = []
+  for i in range(3):
+    # Retrieve the photo ID
+    idx = best_photos[i][1]
+    photo_id = photo_ids[idx]
+    # Get all metadata for this photo
+    photo_data = photos[photos["photo_id"] == photo_id].iloc[0]
+    # Display the images
+    #display(Image(url=photo_data["photo_image_url"] + "?w=640"))
+    response = requests.get(photo_data["photo_image_url"] + "?w=640")
+    img = PILIMAGE.open(BytesIO(response.content))
+    matched_images.append(img)
+  return matched_images
+demo = gr.Interface(
+    fn=find_best_matches,
+    inputs=[
+        gr.Image(label="Image to search",),# optional=True
+        gr.Radio([T2I, I2I]),
+        gr.Textbox(lines=1, label="Text query", placeholder="Introduce the search text...",)
+      ],
+      theme="grass",
+      outputs=[
+        gr.Gallery(label="Generated images", show_label=False, elem_id="gallery")
+      ],
+      title="CLIP Search",
+      description="This application displays TOP THREE images from Unsplash dataset that best match the search query provided by the user from (25k images-db). Moreover, the input can be provided via two modes ie text or image form.",
+      examples=[
+        ["./assets/duck.jpeg","Image2Image", None] ,
+        [None, "Text2Image", "Planet Earth"],
+        ["./assets/horse.jpeg", "Text2Image", "Horse"]
+      ]
+    )
+with open("README.md", "r+") as file:
+  readme_content = file.read()
+# 🏐⚽🏀🎾🤸
+readme =gr.Interface( fn = None, inputs=None, outputs=gr.Markdown(readme_content[150:]),clear_btn=None, css="footer{display:none !important}",flagging_options=[],show_progress='hidden',title="") #gr.Interface(lambda name: "Bye " + name, "text", "text")#
+app = gr.TabbedInterface([demo, readme ],tab_names=["CoolCLIP 🦆","README"])
+app.launch(debug=False,)