sebastiansarasti commited on
Commit
bc3e87a
·
verified ·
1 Parent(s): 0a04509

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +85 -3
README.md CHANGED
@@ -4,6 +4,88 @@ tags:
4
  - pytorch_model_hub_mixin
5
  ---
6
 
7
- This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration:
8
- - Library: [More Information Needed]
9
- - Docs: [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  - pytorch_model_hub_mixin
5
  ---
6
 
7
+ # Model Card: CLIP for Chemistry (CLIPChemistryModel)
8
+
9
+ ## Model Details
10
+ - **Model Name**: `CLIPModel`
11
+ - **Architecture**: CLIP-based multimodal model for fashion images and text
12
+ - **Dataset**: [E-commerce Products CLIP Dataset](hf://datasets/rajuptvs/ecommerce_products_clip/data/train-00000-of-00001-1f042f20fd269c32.parquet)
13
+ - **Batch Size**: 8
14
+ - **Loss Function**: Contrastive Loss
15
+ - **Optimizer**: Adam (learning rate = 1e-3)
16
+ - **Transfer Learning**: Enabled (frozen backbone layers for both image and text encoders)
17
+
18
+ ## Model Architecture
19
+ This model is based on the **CLIP (Contrastive Language-Image Pretraining) framework**, specifically designed to learn **joint representations of text and image modalities** for chemistry-related applications.
20
+
21
+ ### **Components**
22
+ - **Image Encoder (`ImageEncoderHead`)**
23
+ - Uses a **Vision Transformer (ViT) backbone**
24
+ - Feature extraction from images
25
+ - Fully connected (FC) layers to project to a **512-dimensional space**
26
+ - **Text Encoder (`TextEncoderHead`)**
27
+ - Uses a **Transformer-based text encoder**
28
+ - Extracts text features and projects them to **512-dimensional space**
29
+ - **CLIPChemistryModel**
30
+ - Combines the image and text encoders
31
+ - Computes embeddings for contrastive learning
32
+
33
+ ## Implementation
34
+ ### **Model Definition**
35
+ ```python
36
+ import torch
37
+ import torch.nn as nn
38
+ import torch.nn.functional as F
39
+
40
+ class ImageEncoderHead(nn.Module, PyTorchModelHubMixin):
41
+ def __init__(self, model):
42
+ super(ImageEncoderHead, self).__init__()
43
+ self.model = model
44
+ for param in self.model.parameters():
45
+ param.requires_grad = False
46
+ self.seq1 = nn.Sequential(
47
+ nn.Linear(768, 1000),
48
+ nn.Dropout(0.3),
49
+ nn.ReLU(),
50
+ nn.Linear(1000, 512),
51
+ nn.LayerNorm(512),
52
+ )
53
+
54
+ def forward(self, pixel_values):
55
+ outputs = self.model(pixel_values).pooler_output
56
+ outputs = self.seq1(outputs)
57
+ return outputs.contiguous()
58
+
59
+ class TextEncoderHead(nn.Module, PyTorchModelHubMixin):
60
+ def __init__(self, model):
61
+ super(TextEncoderHead, self).__init__()
62
+ self.model = model
63
+ for param in self.model.parameters():
64
+ param.requires_grad = False
65
+ self.seq1 = nn.Sequential(
66
+ nn.Flatten(),
67
+ nn.Linear(768 * 128, 2000),
68
+ nn.Dropout(0.3),
69
+ nn.ReLU(),
70
+ nn.Linear(2000, 512),
71
+ nn.LayerNorm(512),
72
+ )
73
+
74
+ def forward(self, input_ids, attention_mask):
75
+ outputs = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
76
+ outputs = self.seq1(outputs)
77
+ return outputs.contiguous()
78
+
79
+ class CLIPModel(nn.Module, PyTorchModelHubMixin):
80
+ def __init__(self, text_encoder, image_encoder):
81
+ super(CLIPModel, self).__init__()
82
+ self.text_encoder = text_encoder
83
+ self.image_encoder = image_encoder
84
+
85
+ def forward(self, image, input_ids, attention_mask):
86
+ ie = self.image_encoder(image)
87
+ te = self.text_encoder(input_ids, attention_mask)
88
+ return ie, te
89
+ ```
90
+
91
+ This model has been pushed to the Hub using the [PytorchModelHubMixin](https://huggingface.co/docs/huggingface_hub/package_reference/mixins#huggingface_hub.PyTorchModelHubMixin) integration: