File size: 2,770 Bytes
92cdea2
 
 
 
 
 
 
 
 
 
 
b6faf8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
---
license: gpl-3.0
language:
- en
base_model:
- facebook/esm2_t6_8M_UR50D
tags:
- Cancer
- Transcriptomics
- biology
---
# Fine-tuned ESM2 Protein Classifier (pdac_pred_llm)

This repository contains a fine-tuned ESM2 model for protein sequence classification, specifically the model uploaded to `shubhamc-iiitd/pdac_pred_llm`. The model is trained to predict binary labels based on protein sequences.

## Model Description

-   **Base Model:** ESM2-t33-650M-UR50D (Fine-tuned)
-   **Fine-tuning Task:** Binary protein classification.
-   **Architecture:** The model consists of the ESM2 backbone with a linear classification head.
-   **Input:** Protein amino acid sequences.
-   **Output:** Binary classification labels (0 or 1).

## Repository Contents

-   `pytorch_model.bin`: The trained model weights.
-   `alphabet.bin`: The ESM2 alphabet (used as a tokenizer).
-   `config.json`: Configuration file for the model.
-   `README.md`: This file.

## Usage

### Installation

1.  Install the required libraries:

    ```bash
    pip install torch esm biopython huggingface_hub
    ```

### Loading the Model from Hugging Face

```python
import torch
import torch.nn as nn
import esm
from huggingface_hub import hf_hub_download
import json

# Define the model architecture (same as during training)
class ProteinClassifier(nn.Module):
    def __init__(self, esm_model, embedding_dim, num_classes):
        super(ProteinClassifier, self).__init__()
        self.esm_model = esm_model
        self.fc = nn.Linear(embedding_dim, num_classes)

    def forward(self, tokens):
        with torch.no_grad():
            results = self.esm_model(tokens, repr_layers=[33])
        embeddings = results["representations"][33].mean(1)
        output = self.fc(embeddings)
        return output

# Download the model files from Hugging Face
repo_id = "shubhamc-iiitd/pdac_pred_llm"
model_weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin")
alphabet_path = hf_hub_download(repo_id=repo_id, filename="alphabet.bin")
config_path = hf_hub_download(repo_id=repo_id, filename="config.json")

# Load the ESM2 model (used as backbone)
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()

# Load the configuration
with open(config_path, 'r') as f:
    config = json.load(f)

# Initialize the classifier
classifier = ProteinClassifier(model, embedding_dim=config['embedding_dim'], num_classes=config['num_classes'])

# Load the model weights
classifier.load_state_dict(torch.load(model_weights_path))
classifier.eval()

# Load the alphabet
alphabet = torch.load(alphabet_path)
batch_converter = alphabet.get_batch_converter()

#Move models to device if needed
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
```