Synthyra
/

ESMplusplus_large

Fill-Mask

Transformers

Safetensors

ESMplusplus

custom_code

Model card Files Files and versions Community

lhallee commited on Dec 6, 2024

Commit

0cf4e7f

verified ·

1 Parent(s): 7e7ee15

Update README.md

Browse files

Files changed (1) hide show

README.md +28 -1

README.md CHANGED Viewed

@@ -4,7 +4,7 @@ tags: []
 ---
 # ESM++
-ESM++ is a faithful implementation of [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) ([license](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement)) that allows for batching and standard Huggingface compatibility without requiring the ESM Python package.
 The large version corresponds to the 600 million parameter version of ESMC.
@@ -42,6 +42,33 @@ import torch
 model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, torch_dtype=torch.float16) # or torch.bfloat16
 ```
 ### Comparison across floating-point precision and implementations
 We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
 Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.

 ---
 # ESM++
+[ESM++](https://github.com/Synthyra) is a faithful implementation of [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) ([license](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement)) that allows for batching and standard Huggingface compatibility without requiring the ESM Python package.
 The large version corresponds to the 600 million parameter version of ESMC.
 model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, torch_dtype=torch.float16) # or torch.bfloat16
 ```
+## Embed entire datasets with no new code
+To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the progress bar is usually much longer than the actual time.
+```python
+embeddings = model.embed_dataset(
+    sequences=sequences, # list of protein strings
+    batch_size=16, # embedding batch size
+    max_len=2048, # truncate to max_len
+    full_embeddings=True, # return residue-wise embeddings
+    full_precision=False, # store as float32
+    pooling_type='mean', # use mean pooling if protein-wise embeddings
+    num_workers=0, # data loading num workers
+    sql=False, # return dictionary of sequences and embeddings
+)
+_ = model.embed_dataset(
+    sequences=sequences, # list of protein strings
+    batch_size=16, # embedding batch size
+    max_len=2048, # truncate to max_len
+    full_embeddings=True, # return residue-wise embeddings
+    full_precision=False, # store as float32
+    pooling_type='mean', # use mean pooling if protein-wise embeddings
+    num_workers=0, # data loading num workers
+    sql=True, # store sequences in local SQL database
+    sql_db_path='embeddings.db', # path to .db file of choice
+)
+```
 ### Comparison across floating-point precision and implementations
 We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
 Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.