Coolo-Clip / README.md
Muthukamalan's picture
color change
e374416
---
title: CoolCLIP
emoji: 🦆
colorFrom: green
colorTo: indigo
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: false
license: mit
---
# CLIP
In early days of transformers starts dominating (ViTs) comes **Contrastive Language–Image Pre-training** ([CLIP](https://github.com/openai/CLIP)-2021) is a powerful neural network model that learns to associate textual descriptions with images.
# Dataset
The experiment are performed on [kaggle dataset](https://www.kaggle.com/datasets/adityajn105/flickr8k)
## APPROACH
![CLIP-Model Architecture](https://raw.githubusercontent.com/openai/CLIP/dcba3cb2e2827b402d2701e7e1c7d9fed8a20ef1/CLIP.png)
*Image Encoder* may or maynot comes with CNN backbone process image
- resnet
- densenet
*Text Encoder*
- bert
- distilbert
## Text Encoder
captions were tokenized by `DistilBert`
```python
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
tokenizer( list(captions), padding=True, truncation=True, max_length=200 )
text_model = .model = DistilBertModel.from_pretrained("distilbert-base-uncased")
```
<!-- <div align='center'><img src='./contents/bert-model.png' alt=""></div> -->
<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/bert-model.png' alt=""></div>
## Image Encoder
transforms help to standardise the image and pass to the model
```python
def get_transforms(mode="train"):
if mode == "train":
return A.Compose(
[
A.Resize(224, 224, always_apply=True),
A.Normalize(max_pixel_value=255.0, always_apply=True),
]
)
else:
return A.Compose(
[
A.Resize(224, 224, always_apply=True),
A.Normalize(max_pixel_value=255.0, always_apply=True),
]
)
```
pretrained `resnet` model
```python
image_model = timm.create_model( 'resnet18', pretrained, num_classes=0, global_pool="avg" )
```
<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/resnet.png' alt=""></div>
## Projection Head
Sometimes, `output_image_embedding` won't be same dimension as `output_text_embedding` to make it same dimension it act as adapters.
It follow simple residual block with non-linear activations
```python
class ProjectionHead(nn.Module):
def __init__(
self,
embedding_dim,
projection_dim=256,
dropout=CFG.dropout
):
super().__init__()
self.projection = nn.Linear(embedding_dim, projection_dim)
self.gelu = nn.GELU()
self.fc = nn.Linear(projection_dim, projection_dim)
self.dropout = nn.Dropout(dropout)
self.layer_norm = nn.LayerNorm(projection_dim)
def forward(self, x):
projected = self.projection(x)
x = self.gelu(projected)
x = self.fc(x)
x = self.dropout(x)
x = x + projected
x = self.layer_norm(x)
return x
```
## CLIP Model
Combines Image and Text model by adapters and make it understandable.
```python
class CLIPModel(pl.LightningModule):
def __init__(image_embedding,text_embedding) -> None:
super().__init__()
self.image_encoder = ImageEncoder()
self.text_encoder = TextEncoder()
self.image_projection = ProjectionHead(embedding_dim=image_embedding)
self.text_projection = ProjectionHead(embedding_dim=text_embedding)
def forward(batch):
image_features = self.image_encoder(batch["image"])
text_features = self.text_encoder( input_ids=batch["input_ids"], attention_mask=batch["attention_mask"] )
image_embeddings = self.image_projection(image_features)
text_embeddings = self.text_projection(text_features)
# Calculating the Loss
logits = (text_embeddings @ image_embeddings.T) / self.temperature
images_similarity = image_embeddings @ image_embeddings.T
texts_similarity = text_embeddings @ text_embeddings.T
targets = F.softmax( (images_similarity + texts_similarity) / 2 * self.temperature, dim=-1 )
texts_loss = cross_entropy(logits, targets, reduction='none')
images_loss = cross_entropy(logits.T, targets.T, reduction='none')
loss = (images_loss + texts_loss) / 2.0 # shape: (batch_size)
return loss.mean()
```
## Model Summary
```log
| Name | Type | Params | Mode
------------------------------------------------------------
0 | image_encoder | ImageEncoder | 11.2 M | train
1 | text_encoder | TextEncoder | 66.4 M | train
2 | image_projection | ProjectionHead | 197 K | train
3 | text_projection | ProjectionHead | 263 K | train
------------------------------------------------------------
78.0 M Trainable params
0 Non-trainable params
78.0 M Total params
312.001 Total estimated model params size (MB)
200 Modules in train mode
0 Modules in eval mode
```
## Training
- nvitop
<!-- ![cool-clip-nvitop](./contents/cool-clip-nvitop.png) -->
<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/cool-clip-nvitop.png' alt=""></div>
- htop
<!-- ![cool-clip](./contents/cool-clip.png) -->
<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/cool-clip.png' alt=""></div>
- training
<!-- ![fit-report](./contents/fit-report.png) -->
<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/fit-report.png' alt=""></div>
# Inference
## GRADIO APP
<div align='center'><img src='https://raw.githubusercontent.com/Muthukamalan/CoolCLIP-/refs/heads/main/gradio/contents/clip_model.png' alt=""></div>
<!-- <div><img align='center' src="./contents/clip_model.png" ></img></div> -->