lhallee commited on
Commit
0cf4e7f
·
verified ·
1 Parent(s): 7e7ee15

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -1
README.md CHANGED
@@ -4,7 +4,7 @@ tags: []
4
  ---
5
 
6
  # ESM++
7
- ESM++ is a faithful implementation of [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) ([license](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement)) that allows for batching and standard Huggingface compatibility without requiring the ESM Python package.
8
  The large version corresponds to the 600 million parameter version of ESMC.
9
 
10
 
@@ -42,6 +42,33 @@ import torch
42
  model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, torch_dtype=torch.float16) # or torch.bfloat16
43
  ```
44
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  ### Comparison across floating-point precision and implementations
46
  We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
47
  Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.
 
4
  ---
5
 
6
  # ESM++
7
+ [ESM++](https://github.com/Synthyra) is a faithful implementation of [ESMC](https://www.evolutionaryscale.ai/blog/esm-cambrian) ([license](https://www.evolutionaryscale.ai/policies/cambrian-open-license-agreement)) that allows for batching and standard Huggingface compatibility without requiring the ESM Python package.
8
  The large version corresponds to the 600 million parameter version of ESMC.
9
 
10
 
 
42
  model = AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, torch_dtype=torch.float16) # or torch.bfloat16
43
  ```
44
 
45
+ ## Embed entire datasets with no new code
46
+ To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the progress bar is usually much longer than the actual time.
47
+ ```python
48
+ embeddings = model.embed_dataset(
49
+ sequences=sequences, # list of protein strings
50
+ batch_size=16, # embedding batch size
51
+ max_len=2048, # truncate to max_len
52
+ full_embeddings=True, # return residue-wise embeddings
53
+ full_precision=False, # store as float32
54
+ pooling_type='mean', # use mean pooling if protein-wise embeddings
55
+ num_workers=0, # data loading num workers
56
+ sql=False, # return dictionary of sequences and embeddings
57
+ )
58
+
59
+ _ = model.embed_dataset(
60
+ sequences=sequences, # list of protein strings
61
+ batch_size=16, # embedding batch size
62
+ max_len=2048, # truncate to max_len
63
+ full_embeddings=True, # return residue-wise embeddings
64
+ full_precision=False, # store as float32
65
+ pooling_type='mean', # use mean pooling if protein-wise embeddings
66
+ num_workers=0, # data loading num workers
67
+ sql=True, # store sequences in local SQL database
68
+ sql_db_path='embeddings.db', # path to .db file of choice
69
+ )
70
+ ```
71
+
72
  ### Comparison across floating-point precision and implementations
73
  We measured the difference of the last hidden states of the fp32 weights vs. fp16 or bf16. We find that the fp16 is closer to the fp32 outputs, so we recommend loading in fp16.
74
  Please note that the ESM package also loads ESMC in fp32 but casts to bf16 by default, which has its share of advantages and disadvantages in inference / training - so load whichever you like for half precision.