Update README.md
Browse files
README.md
CHANGED
@@ -4,7 +4,7 @@ tags: []
|
|
4 |
---
|
5 |
|
6 |
# ESM++
|
7 |
-
ESM++ is a faithful implementation of [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) ([license](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement)) that allows for batching and standard Huggingface compatibility without requiring the ESM Python package.
|
8 |
The large version corresponds to the 600 million parameter version of ESMC.
|
9 |
|
10 |
|
@@ -42,6 +42,33 @@ import torch
|
|
42 |
model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, torch_dtype=torch.float16) # or torch.bfloat16
|
43 |
```
|
44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
### Comparison across floating-point precision and implementations
|
46 |
We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
|
47 |
Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.
|
|
|
4 |
---
|
5 |
|
6 |
# ESM++
|
7 |
+
[ESM++](https://github.com/Synthyra) is a faithful implementation of [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) ([license](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement)) that allows for batching and standard Huggingface compatibility without requiring the ESM Python package.
|
8 |
The large version corresponds to the 600 million parameter version of ESMC.
|
9 |
|
10 |
|
|
|
42 |
model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, torch_dtype=torch.float16) # or torch.bfloat16
|
43 |
```
|
44 |
|
45 |
+
## Embed entire datasets with no new code
|
46 |
+
To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the progress bar is usually much longer than the actual time.
|
47 |
+
```python
|
48 |
+
embeddings = model.embed_dataset(
|
49 |
+
sequences=sequences, # list of protein strings
|
50 |
+
batch_size=16, # embedding batch size
|
51 |
+
max_len=2048, # truncate to max_len
|
52 |
+
full_embeddings=True, # return residue-wise embeddings
|
53 |
+
full_precision=False, # store as float32
|
54 |
+
pooling_type='mean', # use mean pooling if protein-wise embeddings
|
55 |
+
num_workers=0, # data loading num workers
|
56 |
+
sql=False, # return dictionary of sequences and embeddings
|
57 |
+
)
|
58 |
+
|
59 |
+
_ = model.embed_dataset(
|
60 |
+
sequences=sequences, # list of protein strings
|
61 |
+
batch_size=16, # embedding batch size
|
62 |
+
max_len=2048, # truncate to max_len
|
63 |
+
full_embeddings=True, # return residue-wise embeddings
|
64 |
+
full_precision=False, # store as float32
|
65 |
+
pooling_type='mean', # use mean pooling if protein-wise embeddings
|
66 |
+
num_workers=0, # data loading num workers
|
67 |
+
sql=True, # store sequences in local SQL database
|
68 |
+
sql_db_path='embeddings.db', # path to .db file of choice
|
69 |
+
)
|
70 |
+
```
|
71 |
+
|
72 |
### Comparison across floating-point precision and implementations
|
73 |
We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
|
74 |
Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.
|