lhallee commited on
Commit
b45f5d5
·
verified ·
1 Parent(s): 3826ba8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +111 -95
README.md CHANGED
@@ -1,96 +1,112 @@
1
- ---
2
- library_name: transformers
3
- tags: []
4
- ---
5
-
6
- # FastESM
7
- FastESM is a Huggingface compatible plug in version of ESM2 rewritten with a newer PyTorch attention implementation.
8
-
9
- Load any ESM2 models into a FastEsm model to dramatically speed up training and inference without **ANY** cost in performance.
10
-
11
- Outputting attention maps (or the contact prediction head) is not natively possible with SDPA. You can still pass ```output_attentions``` to have attention calculated manually and returned.
12
- Various other optimizations also make the base implementation slightly different than the one in transformers.
13
-
14
- ## Use with 🤗 transformers
15
-
16
- ### For working with embeddings
17
- ```python
18
- import torch
19
- from transformers import AutoModel, AutoTokenizer
20
-
21
- model_path = 'Synthyra/ESM2-650M'
22
- model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
23
- tokenizer = model.tokenizer
24
-
25
- sequences = ['MPRTEIN', 'MSEQWENCE']
26
- tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
27
- with torch.no_grad():
28
- embeddings = model(**tokenized).last_hidden_state
29
-
30
- print(embeddings.shape) # (2, 11, 1280)
31
- ```
32
-
33
- ### For working with sequence logits
34
- ```python
35
- import torch
36
- from transformers import AutoModelForMaskedLM, AutoTokenizer
37
-
38
- model = AutoModelForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
39
- with torch.no_grad():
40
- logits = model(**tokenized).logits
41
-
42
- print(logits.shape) # (2, 11, 33)
43
- ```
44
-
45
- ### For working with attention maps
46
- ```python
47
- import torch
48
- from transformers import AutoModel, AutoTokenizer
49
-
50
- model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
51
- with torch.no_grad():
52
- attentions = model(**tokenized, output_attentions).attentions # tuples of (batch_size, num_heads, seq_len, seq_len)
53
-
54
- print(attentions[-1].shape) # (2, 20, 11, 11)
55
- ```
56
-
57
- ## Embed entire datasets with no new code
58
- To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
59
- ```python
60
- embeddings = model.embed_dataset(
61
- sequences=sequences, # list of protein strings
62
- batch_size=16, # embedding batch size
63
- max_len=2048, # truncate to max_len
64
- full_embeddings=True, # return residue-wise embeddings
65
- full_precision=False, # store as float32
66
- pooling_type='mean', # use mean pooling if protein-wise embeddings
67
- num_workers=0, # data loading num workers
68
- sql=False, # return dictionary of sequences and embeddings
69
- )
70
-
71
- _ = model.embed_dataset(
72
- sequences=sequences, # list of protein strings
73
- batch_size=16, # embedding batch size
74
- max_len=2048, # truncate to max_len
75
- full_embeddings=True, # return residue-wise embeddings
76
- full_precision=False, # store as float32
77
- pooling_type='mean', # use mean pooling if protein-wise embeddings
78
- num_workers=0, # data loading num workers
79
- sql=True, # store sequences in local SQL database
80
- sql_db_path='embeddings.db', # path to .db file of choice
81
- )
82
- ```
83
-
84
-
85
- ### Citation
86
- If you use any of this implementation or work please cite it (as well as the [ESM2](https://www.science.org/doi/10.1126/science.ade2574) paper).
87
- ```
88
- @misc {FastESM2,
89
- author = { Hallee, L. and Bichara, D. and Gleghorn, J, P. },
90
- title = { FastESM2 },
91
- year = 2024,
92
- url = { https://huggingface.co/Synthyra/FastESM2_650 },
93
- doi = { 10.57967/hf/3729 },
94
- publisher = { Hugging Face }
95
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  ```
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # FastESM
7
+ FastESM is a Huggingface compatible plug in version of ESM2 rewritten with a newer PyTorch attention implementation.
8
+
9
+ Load any ESM2 models into a FastEsm model to dramatically speed up training and inference without **ANY** cost in performance.
10
+
11
+ Outputting attention maps (or the contact prediction head) is not natively possible with SDPA. You can still pass ```output_attentions``` to have attention calculated manually and returned.
12
+ Various other optimizations also make the base implementation slightly different than the one in transformers.
13
+
14
+ ## Use with 🤗 transformers
15
+
16
+ ### Supported models
17
+ ```python
18
+ model_dict = {
19
+ # Synthyra/ESM2-8M
20
+ 'ESM2-8M': 'facebook/esm2_t6_8M_UR50D',
21
+ # Synthyra/ESM2-35M
22
+ 'ESM2-35M': 'facebook/esm2_t12_35M_UR50D',
23
+ # Synthyra/ESM2-150M
24
+ 'ESM2-150M': 'facebook/esm2_t30_150M_UR50D',
25
+ # Synthyra/ESM2-650M
26
+ 'ESM2-650M': 'facebook/esm2_t33_650M_UR50D',
27
+ # Synthyra/ESM2-3B
28
+ 'ESM2-3B': 'facebook/esm2_t36_3B_UR50D',
29
+ }
30
+ ```
31
+
32
+ ### For working with embeddings
33
+ ```python
34
+ import torch
35
+ from transformers import AutoModel, AutoTokenizer
36
+
37
+ model_path = 'Synthyra/ESM2-8M'
38
+ model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
39
+ tokenizer = model.tokenizer
40
+
41
+ sequences = ['MPRTEIN', 'MSEQWENCE']
42
+ tokenized = tokenizer(sequences, padding=True, return_tensors='pt')
43
+ with torch.no_grad():
44
+ embeddings = model(**tokenized).last_hidden_state
45
+
46
+ print(embeddings.shape) # (2, 11, 1280)
47
+ ```
48
+
49
+ ### For working with sequence logits
50
+ ```python
51
+ import torch
52
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
53
+
54
+ model = AutoModelForMaskedLM.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
55
+ with torch.no_grad():
56
+ logits = model(**tokenized).logits
57
+
58
+ print(logits.shape) # (2, 11, 33)
59
+ ```
60
+
61
+ ### For working with attention maps
62
+ ```python
63
+ import torch
64
+ from transformers import AutoModel, AutoTokenizer
65
+
66
+ model = AutoModel.from_pretrained(model_path, torch_dtype=torch.float16, trust_remote_code=True).eval()
67
+ with torch.no_grad():
68
+ attentions = model(**tokenized, output_attentions).attentions # tuples of (batch_size, num_heads, seq_len, seq_len)
69
+
70
+ print(attentions[-1].shape) # (2, 20, 11, 11)
71
+ ```
72
+
73
+ ## Embed entire datasets with no new code
74
+ To embed a list of protein sequences **fast**, just call embed_dataset. Sequences are sorted to reduce padding tokens, so the initial progress bar estimation is usually much longer than the actual time.
75
+ ```python
76
+ embeddings = model.embed_dataset(
77
+ sequences=sequences, # list of protein strings
78
+ batch_size=16, # embedding batch size
79
+ max_len=2048, # truncate to max_len
80
+ full_embeddings=True, # return residue-wise embeddings
81
+ full_precision=False, # store as float32
82
+ pooling_type='mean', # use mean pooling if protein-wise embeddings
83
+ num_workers=0, # data loading num workers
84
+ sql=False, # return dictionary of sequences and embeddings
85
+ )
86
+
87
+ _ = model.embed_dataset(
88
+ sequences=sequences, # list of protein strings
89
+ batch_size=16, # embedding batch size
90
+ max_len=2048, # truncate to max_len
91
+ full_embeddings=True, # return residue-wise embeddings
92
+ full_precision=False, # store as float32
93
+ pooling_type='mean', # use mean pooling if protein-wise embeddings
94
+ num_workers=0, # data loading num workers
95
+ sql=True, # store sequences in local SQL database
96
+ sql_db_path='embeddings.db', # path to .db file of choice
97
+ )
98
+ ```
99
+
100
+
101
+ ### Citation
102
+ If you use any of this implementation or work please cite it (as well as the [ESM2](https://www.science.org/doi/10.1126/science.ade2574) paper).
103
+ ```
104
+ @misc {FastESM2,
105
+ author = { Hallee, L. and Bichara, D. and Gleghorn, J, P. },
106
+ title = { FastESM2 },
107
+ year = 2024,
108
+ url = { https://huggingface.co/Synthyra/FastESM2_650 },
109
+ doi = { 10.57967/hf/3729 },
110
+ publisher = { Hugging Face }
111
+ }
112
  ```