---
library_name: transformers
license: apache-2.0
metrics:
- perplexity
base_model:
- facebook/esm1b_t33_650M_UR50S
---

## **Pretraining on Phosphosites and their MSAs with MLM Objective on ESM-1b Architecture**

This repository presents a pretrained ESM-1b architecture, where the weights are initialized **from scratch** and trained using the Masked Language Modeling (MLM) objective. The training data consists of labeled phosphosites derived from [DARKIN](https://openreview.net/forum?id=a4x5tbYRYV) and and their Multiple Sequence Alignments (MSA).

### **Developed by:**
Zeynep Işık (MSc, Sabanci University)

### **Training Details**

Architecture: ESM-1b (trained from scratch)
Pretraining Objective: Masked Language Modeling (MLM)
Dataset: Labeled phosphosites from [DARKIN](https://openreview.net/forum?id=a4x5tbYRYV) and their MSAs.
Total Samples: 702,468 (10% seperated for validation)
Sequence Length: ≤ 128 residues
Batch Size: 64
Optimizer: AdamW
Learning Rate: default
Training Duration: 3.5 day

### **Pretraining Performance**

Perplexity at Start: 12.32
Perplexity at End: 1.44
A significant decrease in perplexity indicates that the model has effectively learned meaningful representations of phosphosite-related sequences.


### **Potential Usecases**
This pretrained model can be used for downstream tasks requiring phosphosite knowledge, such as:
✅ Binary classification of phosphosites
✅ Kinase-specific phosphorylation site prediction
✅ Protein-protein interaction prediction involving phosphosites