File size: 2,533 Bytes
49354fb 7b34995 49354fb 7b34995 49354fb 7b34995 49354fb 7b34995 49354fb 7b34995 49354fb 7b34995 49354fb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
language: en
tags:
- protein
- protbert
- masked-language-modeling
- bioinformatics
- sequence-prediction
datasets:
- custom
license: mit
library_name: transformers
pipeline_tag: fill-mask
---
# ProtBERT-Unmasking
This model is a fine-tuned version of ProtBERT specifically optimized for unmasking protein sequences. It can predict masked amino acids in protein sequences based on the surrounding context.
## Model Description
- **Base Model**: ProtBERT
- **Task**: Protein Sequence Unmasking
- **Training**: Fine-tuned on masked protein sequences
- **Use Case**: Predicting missing or masked amino acids in protein sequences
- **Optimal Use**: Best performance on E. coli sequences with known amino acids K, C, Y, H, S, M
For detailed information about the training methodology and approach, please refer to our paper:
[https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892)
## Usage
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForMaskedLM.from_pretrained("your-username/protbert-sequence-unmasking")
tokenizer = AutoTokenizer.from_pretrained("your-username/protbert-sequence-unmasking")
# Example usage for E. coli sequence with known amino acids (K,C,Y,H,S,M)
sequence = "MALN[MASK]KFGP[MASK]LVRK"
inputs = tokenizer(sequence, return_tensors="pt")
outputs = model(**inputs)
predictions = outputs.logits
```
## Inference API
The model is optimized for:
- **Organism**: E. coli
- **Known Amino Acids**: K, C, Y, H, S, M
- **Task**: Predicting unknown amino acids in a sequence
Example API usage:
```python
from transformers import pipeline
unmasker = pipeline('fill-mask', model='your-username/protbert-sequence-unmasking')
sequence = "K[MASK]YHS[MASK]" # Example with known amino acids K,Y,H,S
results = unmasker(sequence)
for result in results:
print(f"Predicted amino acid: {result['token_str']}, Score: {result['score']:.3f}")
```
## Limitations and Biases
- This model is specifically designed for protein sequence unmasking in E. coli
- Optimal performance is achieved when working with sequences containing known amino acids K, C, Y, H, S, M
- The model may not perform optimally for:
- Sequences from other organisms
- Sequences without the specified known amino acids
- Other protein-related tasks
## Training Details
The complete details of the training methodology, dataset preparation, and model evaluation can be found in our paper:
[https://arxiv.org/abs/2408.00892](https://arxiv.org/abs/2408.00892)
|