Synthyra
/

ESM2-650M

@@ -70,31 +70,59 @@ with torch.no_grad():
 print(attentions[-1].shape) # (2, 20, 11, 11)
 ```
 ## Embed entire datasets with no new code
-To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
 ```python
-embeddings = model.embed_dataset(
-    sequences=sequences, # list of protein strings
-    batch_size=16, # embedding batch size
-    max_len=2048, # truncate to max_len
-    full_embeddings=True, # return residue-wise embeddings
-    full_precision=False, # store as float32
-    pooling_type='mean', # use mean pooling if protein-wise embeddings
-    num_workers=0, # data loading num workers
-    sql=False, # return dictionary of sequences and embeddings
 )
-_ = model.embed_dataset(
-    sequences=sequences, # list of protein strings
-    batch_size=16, # embedding batch size
-    max_len=2048, # truncate to max_len
-    full_embeddings=True, # return residue-wise embeddings
-    full_precision=False, # store as float32
-    pooling_type='mean', # use mean pooling if protein-wise embeddings
-    num_workers=0, # data loading num workers
-    sql=True, # store sequences in local SQL database
-    sql_db_path='embeddings.db', # path to .db file of choice
-)
 ```

 print(attentions[-1].shape) # (2, 20, 11, 11)
 ```
+### Contact prediction
+Because we can output attentions using the naive attention implementation, the contact prediction is also supported
+```python
+with torch.no_grad():
+    contact_map = model.predict_contacts(**tokenized).squeeze().cpu().numpy() # (seq_len, seq_len)
+```
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/62f2bd3bdb7cbd214b658c48/9707OSXZ3Wdgn0Ni-55T-.png)
 ## Embed entire datasets with no new code
+To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time it will take.
+Example:
 ```python
+embedding_dict = model.embed_dataset(
+    sequences=[
+        'MALWMRLLPLLALLALWGPDPAAA', ... # list of protein sequences
+    ],
+    batch_size=2, # adjust for your GPU memory
+    max_len=512, # adjust for your needs
+    full_embeddings=False, # if True, no pooling is performed
+    embed_dtype=torch.float32, # cast to what dtype you want
+    pooling_type=['mean', 'cls'], # more than one pooling type will be concatenated together
+    num_workers=0, # if you have many cpu cores, we find that num_workers = 4 is fast for large datasets
+    sql=False, # if True, embeddings will be stored in SQLite database
+    sql_db_path='embeddings.db',
+    save=True, # if True, embeddings will be saved as a .pth file
+    save_path='embeddings.pth',
 )
+# embedding_dict is a dictionary mapping sequences to their embeddings as tensors for .pth or numpy arrays for sql
+```
+```
+model.embed_dataset()
+Args:
+    sequences: List of protein sequences
+    batch_size: Batch size for processing
+    max_len: Maximum sequence length
+    full_embeddings: Whether to return full residue-wise (True) embeddings or pooled (False)
+    pooling_type: Type of pooling ('mean' or 'cls')
+    num_workers: Number of workers for data loading, 0 for the main process
+    sql: Whether to store embeddings in SQLite database - will be stored in float32
+    sql_db_path: Path to SQLite database
+Returns:
+    Dictionary mapping sequences to embeddings, or None if sql=True
+Note:
+    - If sql=True, embeddings can only be stored in float32
+    - sql is ideal if you need to stream a very large dataset for training in real-time
+    - save=True is ideal if you can store the entire embedding dictionary in RAM
+    - sql will be used if it is True and save is True or False
+    - If your sql database or .pth file is already present, they will be scanned first for already embedded sequences
+    - Sequences will be truncated to max_len and sorted by length in descending order for faster processing
 ```