--- library_name: transformers license: apache-2.0 metrics: - perplexity base_model: - facebook/esm1b_t33_650M_UR50S --- ## **Pretraining on Phosphosites and their MSAs with MLM Objective on ESM-1b Architecture** This repository presents a pretrained ESM-1b architecture, where the weights are initialized **from scratch** and trained using the Masked Language Modeling (MLM) objective. The training data consists of labeled phosphosites derived from [DARKIN](https://openreview.net/forum?id=a4x5tbYRYV) and and their Multiple Sequence Alignments (MSA). ### **Developed by:** Zeynep Işık (MSc, Sabanci University) ### **Training Details** Architecture: ESM-1b (trained from scratch) Pretraining Objective: Masked Language Modeling (MLM) Dataset: Labeled phosphosites from [DARKIN](https://openreview.net/forum?id=a4x5tbYRYV) and their MSAs. Total Samples: 702,468 (10% seperated for validation) Sequence Length: ≤ 128 residues Batch Size: 64 Optimizer: AdamW Learning Rate: default Training Duration: 3.5 day ### **Pretraining Performance** Perplexity at Start: 12.32 Perplexity at End: 1.44 A significant decrease in perplexity indicates that the model has effectively learned meaningful representations of phosphosite-related sequences. ### **Potential Usecases** This pretrained model can be used for downstream tasks requiring phosphosite knowledge, such as: ✅ Binary classification of phosphosites ✅ Kinase-specific phosphorylation site prediction ✅ Protein-protein interaction prediction involving phosphosites