--- license: gpl-3.0 language: - en base_model: - facebook/esm2_t6_8M_UR50D tags: - Cancer - Transcriptomics - biology --- # Fine-tuned ESM2 Protein Classifier (pdac_pred_llm) This repository contains a fine-tuned ESM2 model for protein sequence classification, specifically the model uploaded to `shubhamc-iiitd/pdac_pred_llm`. The model is trained to predict binary labels based on protein sequences. ## Model Description - **Base Model:** ESM2-t33-650M-UR50D (Fine-tuned) - **Fine-tuning Task:** Binary protein classification. - **Architecture:** The model consists of the ESM2 backbone with a linear classification head. - **Input:** Protein amino acid sequences. - **Output:** Binary classification labels (0 or 1). ## Repository Contents - `pytorch_model.bin`: The trained model weights. - `alphabet.bin`: The ESM2 alphabet (used as a tokenizer). - `config.json`: Configuration file for the model. - `README.md`: This file. ## Usage ### Installation 1. Install the required libraries: ```bash pip install torch esm biopython huggingface_hub ``` ### Loading the Model from Hugging Face ```python import torch import torch.nn as nn import esm from huggingface_hub import hf_hub_download import json # Define the model architecture (same as during training) class ProteinClassifier(nn.Module): def __init__(self, esm_model, embedding_dim, num_classes): super(ProteinClassifier, self).__init__() self.esm_model = esm_model self.fc = nn.Linear(embedding_dim, num_classes) def forward(self, tokens): with torch.no_grad(): results = self.esm_model(tokens, repr_layers=[33]) embeddings = results["representations"][33].mean(1) output = self.fc(embeddings) return output # Download the model files from Hugging Face repo_id = "shubhamc-iiitd/pdac_pred_llm" model_weights_path = hf_hub_download(repo_id=repo_id, filename="pytorch_model.bin") alphabet_path = hf_hub_download(repo_id=repo_id, filename="alphabet.bin") config_path = hf_hub_download(repo_id=repo_id, filename="config.json") # Load the ESM2 model (used as backbone) model, alphabet = esm.pretrained.esm2_t33_650M_UR50D() # Load the configuration with open(config_path, 'r') as f: config = json.load(f) # Initialize the classifier classifier = ProteinClassifier(model, embedding_dim=config['embedding_dim'], num_classes=config['num_classes']) # Load the model weights classifier.load_state_dict(torch.load(model_weights_path)) classifier.eval() # Load the alphabet alphabet = torch.load(alphabet_path) batch_converter = alphabet.get_batch_converter() #Move models to device if needed device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = model.to(device) ```