maximuspowers
/

multimodal-bias-classifier

+---
+license: mit
+datasets:
+  - vector-institute/newsmediabias-plus
+language:
+  - en
+metrics:
+  - f1(0.698616087436676)
+  - precision(0.6369158625602722)
+  - recall(0.7735527753829956)
+  - accuracy(0.6247606873512268)
+library_name: transformers
+co2_eq_emissions:
+  emissions: 8
+  source: Code Carbon
+  training_type: fine-tuning
+  geographical_location: Albany, New York
+  hardware_used: T4
+base_model:
+  - google-bert/bert-base-uncased
+  - microsoft/resnet-34
+pipeline_tag: custom
+tags:
+  - Social Bias
+  - Multimodal
+---
+# Multimodal Bias Classifier
+This model is a multimodal classifier that combines text and image inputs to detect potential bias in content. It uses a BERT-based text encoder and a ResNet-34 image encoder, which are fused for classification purposes. A contrastive learning approach was used during training, leveraging CLIP embeddings as guidance to align the text and image representations.
+## Model Details
+- **Text Encoder**: BERT (`bert-base-uncased`)
+- **Image Encoder**: ResNet-34 (`microsoft/resnet-34`)
+- **Projection Dimensionality**: 768
+- **Fusion Method**: Concatenation (default), Alignment, or Cosine Similarity
+- **Loss Functions**: Binary Cross-Entropy for classification, Cosine Embedding Loss for contrastive learning
+- **Purpose**: Detecting bias in multimodal content (text + image)
+## Training
+The model was trained using a multimodal dataset with labeled instances of biased and unbiased content. The training process incorporated both classification and contrastive loss to help align the text and image representations in a shared latent space.
+### Training Losses
+- **Classification Loss**: Binary Cross-Entropy (BCEWithLogitsLoss) to classify content as biased or unbiased.
+- **Contrastive Loss**: CosineEmbeddingLoss, which uses CLIP text and image embeddings as ground truth guidance to align text and image features.
+### Excluding CLIP
+While the CLIP model was used during training to guide the alignment of the image and text embeddings, the final model does **not** retain CLIP weights, as it is designed to function independently once training is complete.
+## How to Load the Model
+You can load this model for bias classification by following the code below. The model accepts text input and an image input, processing them through BERT and ResNet-34 encoders, respectively. The final prediction indicates whether the content is likely biased or unbiased.
+```python
+import torch
+from torch import nn
+from transformers import AutoModel
+from huggingface_hub import hf_hub_download
+from typing import Literal
+import json
+class MultimodalClassifier(nn.Module):
+    def __init__(
+            self,
+            text_encoder_id_or_path: str,
+            image_encoder_id_or_path: str,
+            projection_dim: int,
+            fusion_method: Literal["concat", "align", "cosine_similarity"] = "concat",
+            proj_dropout: float = 0.1,
+            fusion_dropout: float = 0.1,
+            num_classes: int = 1,
+        ) -> None:
+        super().__init__()
+        self.fusion_method = fusion_method
+        self.projection_dim = projection_dim
+        self.num_classes = num_classes
+        ##### Text Encoder
+        self.text_encoder = AutoModel.from_pretrained(text_encoder_id_or_path)
+        self.text_projection = nn.Sequential(
+            nn.Linear(self.text_encoder.config.hidden_size, self.projection_dim),
+            nn.Dropout(proj_dropout),
+        )
+        ##### Image Encoder (using ResNet34 from AutoModel with timm)
+        self.image_encoder = AutoModel.from_pretrained(image_encoder_id_or_path, trust_remote_code=True)
+        self.image_encoder.classifier = nn.Identity()  # rm the classification head
+        self.image_projection = nn.Sequential(
+            nn.Linear(512, self.projection_dim),
+            nn.Dropout(proj_dropout),
+        )
+        ##### Fusion Layer
+        fusion_input_dim = self.projection_dim * 2 if fusion_method == "concat" else self.projection_dim
+        self.fusion_layer = nn.Sequential(
+            nn.Dropout(fusion_dropout),
+            nn.Linear(fusion_input_dim, self.projection_dim),
+            nn.GELU(),
+            nn.Dropout(fusion_dropout),
+        )
+        ##### Classification Layer
+        self.classifier = nn.Linear(self.projection_dim, self.num_classes)
+    def forward(self, pixel_values: torch.Tensor, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
+        ##### Text Encoder Projection #####
+        full_text_features = self.text_encoder(input_ids=input_ids, attention_mask=attention_mask, return_dict=True).last_hidden_state
+        full_text_features = full_text_features[:, 0, :]  # using cls token
+        full_text_features = self.text_projection(full_text_features)
+        ##### Image Encoder Projection #####
+        resnet_image_features = self.image_encoder(pixel_values=pixel_values).last_hidden_state
+        # global average pooling for resent image features (bad idea? dim problems)
+        resnet_image_features = resnet_image_features.mean(dim=[-2, -1])
+        resnet_image_features = self.image_projection(resnet_image_features)
+        ##### Fusion and Classification #####
+        if self.fusion_method == "concat":
+            fused_features = torch.cat([full_text_features, resnet_image_features], dim=-1)
+        else:
+            fused_features = full_text_features * resnet_image_features # don't think this works atm (should be dot prod)
+        # fusion and classifier layers
+        fused_features = self.fusion_layer(fused_features)
+        classification_output = self.classifier(fused_features)
+        return classification_output
+def load_model():
+    config_path = hf_hub_download(repo_id="maximuspowers/multimodal-bias-classifier", filename="config.json")
+    with open(config_path, "r") as f:
+        config = json.load(f)
+    model = MultimodalClassifier(
+        text_encoder_id_or_path=config["text_encoder_id_or_path"],
+        image_encoder_id_or_path="microsoft/resnet-34",
+        projection_dim=config["projection_dim"],
+        fusion_method=config["fusion_method"],
+        proj_dropout=config["proj_dropout"],
+        fusion_dropout=config["fusion_dropout"],
+        num_classes=config["num_classes"]
+    )
+    model_weights_path = hf_hub_download(repo_id="maximuspowers/multimodal-bias-classifier", filename="model_weights.pth")
+    checkpoint = torch.load(model_weights_path, map_location=torch.device('cpu'))
+    model.load_state_dict(checkpoint, strict=False)
+    return model
+```
+```python
+import torch
+from transformers import AutoTokenizer
+from PIL import Image
+import requests
+from torchvision import transforms
+model = load_model()
+model.eval()
+# text input
+text_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+sample_text = "This is a sample sentence for bias classification."
+text_inputs = text_tokenizer(
+    sample_text,
+    return_tensors="pt",
+    padding="max_length",
+    truncation=True,
+    max_length=512
+)
+# image input
+image = Image.open("./random_image.jpg").convert("RGB")
+image_transform = transforms.Compose([
+    transforms.Resize((224, 224)),
+    transforms.ToTensor(),
+    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+])
+image_input = image_transform(image).unsqueeze(0) # add batch dim
+# run
+with torch.no_grad():
+    classification_output = model(
+        pixel_values=image_input,
+        input_ids=text_inputs["input_ids"],
+        attention_mask=text_inputs["attention_mask"]
+    )
+    predicted_class = torch.sigmoid(classification_output).round().item()
+print("Predicted class:", "Biased" if predicted_class == 1 else "Unbiased")
+```