sebastiansarasti
/

CLIPFashionDasa

Safetensors

model_hub_mixin

pytorch_model_hub_mixin

Model card Files Files and versions Community

sebastiansarasti commited on Jan 31

Commit

bc3e87a

verified ·

1 Parent(s): 0a04509

Update README.md

Browse files

Files changed (1) hide show

README.md +85 -3

README.md CHANGED Viewed

@@ -4,6 +4,88 @@ tags:
 - pytorch_model_hub_mixin
 ---
-This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
-- Library: [More Information Needed]
-- Docs: [More Information Needed]

 - pytorch_model_hub_mixin
 ---
+# Model Card: CLIP for Chemistry (CLIPChemistryModel)
+## Model Details
+- **Model Name**: `CLIPModel`
+- **Architecture**: CLIP-based multimodal model for fashion images and text
+- **Dataset**: [E-commerce Products CLIP Dataset](hf://datasets/rajuptvs/ecommerce_products_clip/data/train-00000-of-00001-1f042f20fd269c32.parquet)
+- **Batch Size**: 8
+- **Loss Function**: Contrastive Loss
+- **Optimizer**: Adam (learning rate = 1e-3)
+- **Transfer Learning**: Enabled (frozen backbone layers for both image and text encoders)
+## Model Architecture
+This model is based on the **CLIP (Contrastive Language-Image Pretraining) framework**, specifically designed to learn **joint representations of text and image modalities** for chemistry-related applications.
+### **Components**
+- **Image Encoder (`ImageEncoderHead`)**
+  - Uses a **Vision Transformer (ViT) backbone**
+  - Feature extraction from images
+  - Fully connected (FC) layers to project to a **512-dimensional space**
+- **Text Encoder (`TextEncoderHead`)**
+  - Uses a **Transformer-based text encoder**
+  - Extracts text features and projects them to **512-dimensional space**
+- **CLIPChemistryModel**
+  - Combines the image and text encoders
+  - Computes embeddings for contrastive learning
+## Implementation
+### **Model Definition**
+```python
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+class ImageEncoderHead(nn.Module, PyTorchModelHubMixin):
+    def __init__(self, model):
+        super(ImageEncoderHead, self).__init__()
+        self.model = model
+        for param in self.model.parameters():
+            param.requires_grad = False
+        self.seq1 = nn.Sequential(
+            nn.Linear(768, 1000),
+            nn.Dropout(0.3),
+            nn.ReLU(),
+            nn.Linear(1000, 512),
+            nn.LayerNorm(512),
+        )
+    def forward(self, pixel_values):
+        outputs = self.model(pixel_values).pooler_output
+        outputs = self.seq1(outputs)
+        return outputs.contiguous()
+class TextEncoderHead(nn.Module, PyTorchModelHubMixin):
+    def __init__(self, model):
+        super(TextEncoderHead, self).__init__()
+        self.model = model
+        for param in self.model.parameters():
+            param.requires_grad = False
+        self.seq1 = nn.Sequential(
+            nn.Flatten(),
+            nn.Linear(768 * 128, 2000),
+            nn.Dropout(0.3),
+            nn.ReLU(),
+            nn.Linear(2000, 512),
+            nn.LayerNorm(512),
+        )
+    def forward(self, input_ids, attention_mask):
+        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
+        outputs = self.seq1(outputs)
+        return outputs.contiguous()
+class CLIPModel(nn.Module, PyTorchModelHubMixin):
+    def __init__(self, text_encoder, image_encoder):
+        super(CLIPModel, self).__init__()
+        self.text_encoder = text_encoder
+        self.image_encoder = image_encoder
+    def forward(self, image, input_ids, attention_mask):
+        ie = self.image_encoder(image)
+        te = self.text_encoder(input_ids, attention_mask)
+        return ie, te
+```
+This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration: