|
--- |
|
language: en |
|
tags: |
|
- protein |
|
- protbert |
|
- masked-language-modeling |
|
- bioinformatics |
|
- sequence-prediction |
|
datasets: |
|
- custom |
|
license: mit |
|
library_name: transformers |
|
pipeline_tag: fill-mask |
|
--- |
|
|
|
# ProtBERT-Unmasking |
|
|
|
This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context. |
|
|
|
## Model Description |
|
|
|
- **Base Model**: ProtBERT |
|
- **Task**: Protein Sequence Unmasking |
|
- **Training**: Fine-tuned on masked protein sequences |
|
- **Use Case**: Predicting missing or masked amino acids in protein sequences |
|
- **Optimal Use**: Best performance on E. coli sequences with known amino acids K, C, Y, H, S, M |
|
|
|
For detailed information about the training methodology and approach, please refer to our paper: |
|
[https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892) |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoModelForMaskedLM, AutoTokenizer |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking") |
|
tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking") |
|
|
|
# Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M) |
|
sequence = "MALN[MASK]KFGP[MASK]LVRK" |
|
inputs = tokenizer(sequence, return_tensors="pt") |
|
outputs = model(**inputs) |
|
predictions = outputs.logits |
|
``` |
|
|
|
## Inference API |
|
|
|
The model is optimized for: |
|
- **Organism**: E. coli |
|
- **Known Amino Acids**: K, C, Y, H, S, M |
|
- **Task**: Predicting unknown amino acids in a sequence |
|
|
|
Example API usage: |
|
```python |
|
from transformers import pipeline |
|
|
|
unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking') |
|
sequence = "K[MASK]YHS[MASK]" # Example with known amino acids K,Y,H,S |
|
results = unmasker(sequence) |
|
|
|
for result in results: |
|
print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}") |
|
``` |
|
|
|
## Limitations and Biases |
|
|
|
- This model is specifically designed for protein sequence unmasking in E. coli |
|
- Optimal performance is achieved when working with sequences containing known amino acids K, C, Y, H, S, M |
|
- The model may not perform optimally for: |
|
- Sequences from other organisms |
|
- Sequences without the specified known amino acids |
|
- Other protein-related tasks |
|
|
|
## Training Details |
|
|
|
The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper: |
|
[https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892) |
|
|