ElnaggarLab
/

ankh2-ext1

+---
+license: cc-by-nc-sa-4.0
+tags:
+- biology
+- protein
+- protein language model
+- protein embedding
+datasets:
+- agemagician/uniref50
+---
+# Important
+The model will be uploaded soon, please stay tuned.
+# ANKH2-Large model
+Pretrained model on protein sequences using a masked language modeling (MLM) objective. It was introduced in
+[this paper](https://arxiv.org/abs/2301.06568) and first released in
+[this repository](https://github.com/agemagician/Ankh). This model is trained on uppercase amino acids: it only works with capital letter amino acids.
+## Model description
+ANKH2-Large is based on the `ANKH-Large` model and was pretrained on a large corpus of protein sequences in a self-supervised fashion.
+This means it was pretrained on the raw protein sequences only, with no humans labelling them in any way (which is why it can use lots of
+publicly available data) with an automatic process to generate inputs and labels from those protein sequences.
+Two important differences between this ANKH2-Large model and the original ANKH-Large version are:
+1. The model was trained with more number of epochs.
+2. The activation function changed to silu.
+It has been shown that the features extracted from this self-supervised model (LM-embeddings) captured important biophysical properties governing protein shape.
+shape.
+This implied learning some of the grammar of the language of life realized in protein sequences.
+## Intended uses & limitations
+The model could be used for protein feature extraction or to be fine-tuned on downstream tasks.
+We have noticed in some tasks you can gain more accuracy by fine-tuning the model using lora method rather than using it as a feature extractor.
+We have also noticed that for feature extraction, its better to use the feature extracted from the encoder rather than from the decoder.
+### How to use
+Here is how to use this model to extract the features of a given protein sequence in PyTorch:
+```python
+sequence_examples = ["PRTEINO", "SEQWENCE"]
+# tokenize sequences and pad up to the longest sequence in the batch
+ids = tokenizer.batch_encode_plus(sequence_examples, add_special_tokens=True, padding="longest")
+input_ids = torch.tensor(ids['input_ids']).to(device)
+attention_mask = torch.tensor(ids['attention_mask']).to(device)
+# generate embeddings
+with torch.no_grad():
+    embedding_repr = model(input_ids=input_ids,attention_mask=attention_mask)
+# extract embeddings for the first ([0,:]) sequence in the batch while removing padded & special tokens ([0,:7])
+emb_0 = embedding_repr.last_hidden_state[0,:7] # shape (7 x 1536)
+print(f"Shape of per-residue embedding of first sequences: {emb_0.shape}")
+# do the same for the second ([1,:]) sequence in the batch while taking into account different sequence lengths ([1,:8])
+emb_1 = embedding_repr.last_hidden_state[1,:8] # shape (8 x 1536)
+# if you want to derive a single representation (per-protein embedding) for the whole protein
+emb_0_per_protein = emb_0.mean(dim=0) # shape (1536)
+print(f"Shape of per-protein embedding of first sequences: {emb_0_per_protein.shape}")
+```
+## Training data
+The ANKH2-Large model was pretrained on [UniRef50](https://www.uniprot.org/help/uniref), a dataset consisting of 60 million protein sequences.
+## Training procedure
+### Preprocessing
+The protein sequences are uppercased and tokenized using a single space and a vocabulary size of 25.
+The inputs of the model are then of the form:
+```
+Protein Sequence </s>
+```
+The preprocessing step was performed on the fly, by cutting and padding the protein sequences up to 512 tokens.
+The details of the masking procedure for each sequence are as follows:
+- 20% of the amino acids are masked.
+- In 100% of the cases, the masked amino acids are replaced by `<extra_id_num>` token, where "num" is a number in range 0 and 115.
+### Pretraining
+The model was trained on a single TPU Pod V4-256 for 45 epochs in total, using sequence length 512 (batch size 1k).
+It was trained using ANKH-Large model as an initial checkpoint, rather than training from scratch.
+It has a total of approximately 2B parameters and was trained using the encoder-decoder architecture.
+The optimizer used is Adafactor with linear warmup with linear decay learning rate schedule for pre-training.
+## Evaluation results
+When the model is used for feature extraction "FE" and parameter efficient fine-tuning "Lora", this model achieves the following results:
+Test results :
+| Task/Dataset | Method | secondary structure (3-states) | secondary structure (8-states)  |  Localization | Membrane  | Solubility | Fluorescence |
+|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|:-----:|
+|   CASP12  | FE | comming soon | comming soon |    |    |    |    |
+|   CASP12  | Lora | comming soon | comming soon |    |    |    |    |
+|   TS115   | FE | comming soon | comming soon |    |    |     |    |
+|   TS115   | Lora | comming soon | comming soon |    |    |     |    |
+|   CB513   | FE | comming soon | comming soon |    |    |    |    |
+|   CB513   | Lora | comming soon | comming soon |    |    |    |    |
+|  DeepLoc  | FE |    |    | comming soon | comming soon |    |
+|  DeepLoc  | Lora |   |    | comming soon | comming soon |    |    |
+|  Solubility  | FE |   |    |   |   |  comming soon |    |
+|  Solubility  | Lora |   |    |   |   |  74%  |    |
+|  Fluorescence  | FE |   |    |   |   |   |  Comming Soon  |
+|  Fluorescence  | Lora |   |    |   |   |    |  68%  |
+### BibTeX entry and citation info
+```bibtex
+@article{elnaggar2023ankh,
+  title={Ankh☥: Optimized protein language model unlocks general-purpose modelling},
+  author={Elnaggar, Ahmed and Essam, Hazem and Salah-Eldin, Wafaa and Moustafa, Walid and Elkerdawy, Mohamed and Rochereau, Charlotte and Rost, Burkhard},
+  journal={bioRxiv},
+  pages={2023--01},
+  year={2023},
+  publisher={Cold Spring Harbor Laboratory}
+}
+```
+> Created by [Ahmed Elnaggar/@Elnaggar_AI](https://twitter.com/Elnaggar_AI) | [LinkedIn](https://www.linkedin.com/in/prof-ahmed-elnaggar/)