Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -70,31 +70,59 @@ with torch.no_grad():
|
|
| 70 |
print(attentions[-1].shape) # (2, 20, 11, 11)
|
| 71 |
```
|
| 72 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 73 |
## Embed entire datasets with no new code
|
| 74 |
-
To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
|
|
|
|
|
|
|
| 75 |
```python
|
| 76 |
-
|
| 77 |
-
sequences=
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
)
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
```
|
| 99 |
|
| 100 |
|
|
|
|
| 70 |
print(attentions[-1].shape) # (2, 20, 11, 11)
|
| 71 |
```
|
| 72 |
|
| 73 |
+
### Contact prediction
|
| 74 |
+
Because we can output attentions using the naive attention implementation, the contact prediction is also supported
|
| 75 |
+
```python
|
| 76 |
+
with torch.no_grad():
|
| 77 |
+
contact_map = model.predict_contacts(**tokenized).squeeze().cpu().numpy() # (seq_len, seq_len)
|
| 78 |
+
```
|
| 79 |
+

|
| 80 |
+
|
| 81 |
## Embed entire datasets with no new code
|
| 82 |
+
To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time it will take.
|
| 83 |
+
|
| 84 |
+
Example:
|
| 85 |
```python
|
| 86 |
+
embedding_dict = model.embed_dataset(
|
| 87 |
+
sequences=[
|
| 88 |
+
'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
|
| 89 |
+
],
|
| 90 |
+
batch_size=2, # adjust for your GPU memory
|
| 91 |
+
max_len=512, # adjust for your needs
|
| 92 |
+
full_embeddings=False, # if True, no pooling is performed
|
| 93 |
+
embed_dtype=torch.float32, # cast to what dtype you want
|
| 94 |
+
pooling_type=['mean', 'cls'], # more than one pooling type will be concatenated together
|
| 95 |
+
num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
|
| 96 |
+
sql=False, # if True, embeddings will be stored in SQLite database
|
| 97 |
+
sql_db_path='embeddings.db',
|
| 98 |
+
save=True, # if True, embeddings will be saved as a .pth file
|
| 99 |
+
save_path='embeddings.pth',
|
| 100 |
)
|
| 101 |
+
# embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
|
| 102 |
+
```
|
| 103 |
|
| 104 |
+
```
|
| 105 |
+
model.embed_dataset()
|
| 106 |
+
Args:
|
| 107 |
+
sequences: List of protein sequences
|
| 108 |
+
batch_size: Batch size for processing
|
| 109 |
+
max_len: Maximum sequence length
|
| 110 |
+
full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
|
| 111 |
+
pooling_type: Type of pooling ('mean' or 'cls')
|
| 112 |
+
num_workers: Number of workers for data loading, 0 for the main process
|
| 113 |
+
sql: Whether to store embeddings in SQLite database - will be stored in float32
|
| 114 |
+
sql_db_path: Path to SQLite database
|
| 115 |
+
|
| 116 |
+
Returns:
|
| 117 |
+
Dictionary mapping sequences to embeddings, or None if sql=True
|
| 118 |
+
|
| 119 |
+
Note:
|
| 120 |
+
- If sql=True, embeddings can only be stored in float32
|
| 121 |
+
- sql is ideal if you need to stream a very large dataset for training in real-time
|
| 122 |
+
- save=True is ideal if you can store the entire embedding dictionary in RAM
|
| 123 |
+
- sql will be used if it is True and save is True or False
|
| 124 |
+
- If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
|
| 125 |
+
- Sequences will be truncated to max_len and sorted by length in descending order for faster processing
|
| 126 |
```
|
| 127 |
|
| 128 |
|