Niksa Praljak commited on
Commit
b18f47c
·
1 Parent(s): c865888

Add information on ProteoScribe instructions

Browse files
Files changed (1) hide show
  1. README.md +70 -7
README.md CHANGED
@@ -269,19 +269,82 @@ The facilitated embeddings are saved to the specified output_data_path for furth
269
 
270
  ## Stage 3: ProteoScribe
271
 
272
- 🚧 **Coming Soon** 🚧
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
273
 
274
- This stage will contain scripts and models for the ProteoScribe process. Check back for:
275
- - Configuration files
276
- - Model weights
277
- - Running instructions
278
- - Output examples
279
 
280
  ## Support
281
 
282
  For questions or issues:
283
  - Open an issue in this repository
284
- - Contact: [Your contact information]
285
 
286
  ---
287
  Repository maintained by the BioM3 Team
 
269
 
270
  ## Stage 3: ProteoScribe
271
 
272
+ [Previous sections remain the same...]
273
+
274
+ ### Expected Output
275
+
276
+ The script generates multiple sequence variants for each input embedding. Here's the sample output showing different replicas:
277
+ [Previous sections remain the same...]
278
+
279
+ ## Stage 3: ProteoScribe
280
+
281
+ ### Overview
282
+
283
+ In this stage, the **ProteoScribe model** takes the facilitated embeddings (z_c) from Stage 2 and generates novel protein sequences that match the desired functional description. The model outputs multiple sequence variants (replicas) for each input embedding.
284
+
285
+ ### Model Weights
286
+
287
+ Before running the model, ensure you have:
288
+ - Configuration file: `stage3_config.json`
289
+ - Pre-trained weights: `BioM3_ProteoScribe_pfam_epoch20_v1.bin`
290
+
291
+ ### Running ProteoScribe
292
+
293
+ Run the sequence generation:
294
+ ```bash
295
+ python run_ProteoScribe_sample.py \
296
+ --json_path "./stage3_config.json" \
297
+ --model_path "./weights/ProteoScribe/BioM3_ProteoScribe_pfam_epoch20_v1.bin" \
298
+ --input_path "test_Facilitator_embeddings.pt" \
299
+ --output_path "test_ProteoScribe_samples.pt"
300
+ ```
301
+
302
+ ### Expected Output
303
+
304
+ The script generates multiple sequence variants for each input embedding. Here's a sample output showing different replicas:
305
+
306
+ **Replica 0:**
307
+
308
+ Prompt 1 (Translation initiation factor):
309
+ - <span style="color:purple">TAKEDWLEMQNTVLETLPNTMFRVELENGHVITAAISGGMRKNYIRILTGDKAKVELTPYDLSKGRICFRAK</span>
310
+
311
+ Prompt 2 (Peptidyl-tRNA hydrolase):
312
+ - <span style="color:purple">MSLIIGLLGNEKKYEFTRHRGVVFISDIANPFYDEFKETIGSVKTGHGFVEDGNYVIKFLVLTIPNRFSIERSARAVQDFYPDLDKVIIYIDDLPFKGGVRLSLHGGDHGNDNLVNGIADKSIGMGIDRRVIRVPEPMVVEVLWHPVFYVFDRFALEIKEIPKLMDILVEKAKELLFDVNKAYFEVL</span>
313
+
314
+ Prompt 3 (tRNA-ribosyltransferase):
315
+ - <span style="color:purple">MSKGPVHFVNVQEEAHTGRLLGAIVETEHGTPPVMYNPSLYSYTNPEPAMQDRLQDASNILLYNTYLWHGPDRCVILQSRGHLNKMNDKPYLILDSGGFMQIMLLSRRIGEFYVHETFHPHKTLSFLSPERVANIQMDLDTTVFDIMDNCPEKPYKYIEESVRLSDRWTTALSDRPDYGRRDQALFGIVGEAQFEDLRERSIEFGLDWAFDGYAIGGLSVGQPPEEMENVINYTKQVPEKLPRTLYNVSGTQLSDDIIGIARVGDMFDCVLPTRIARNGTFLTGQRNVKFAKASRDFNPPIDCKTCDCYTCQNYIRHVLHSGERLGFDGTIIHTIYLFDNLMALMKEAIQKDRKPYFEQHFAEDLSR</span>
316
+
317
+ Prompt 4 (Chaperonin GroEL):
318
+ - <span style="color:purple">MAAKDVKFGNEARVRMLRGVNTLADAVKTTLGPKGRNVVLEKSFGSPTITKDGVSVAREIELDDKFETMGAQMVKEVASKANDKAGDGTTTATVLAQSIITEGLKAVASGMNPMDLKRGIDKAVAAAVENLKTMKVPASDSKAIAQVGTISANSDETIGKLEADAMDKVGKDGVITVEEGQGLKDELDVVEGMQFDRGYLSPYFINKPDSGAVELESPFILLVDKKISNIRELMPVLEAVAKSSKPLLIIAEDVEGEALATLVVNTMRGIIKIAAVKAPGFTHRRKEMLQDIATLTAGTVISEQIDIELEKATLNDLGQAKRIVINKDTATIVDGAGDVADISSRVHQIRANVEEATSDYDREKLQERLAKLSGGVAVIKVGAGTEVEMKEKKARVEDALHQTRAATEEGVVAGGGVALIRAASKLAAVRPNSANDALEGIERVLAKELLPQQIALDGVGVSPNKATAIIANGVGGYAAANYEYGLVDKLEQVGDAPTKVVRAIVSDAMGSAMGAETIVVDAMGEAQA</span>
319
+
320
+ Prompt 5 (Chaperone DnaK):
321
+ - <span style="color:purple">MSTLKTVPLGCFNFQYTWNELNKIDTTISACFEEATSREIKETATDKQVLYEMRKHLCCTTDAHIGPPSVKGIHSPNKVTFGQRYCAQSGVEAFAGKEDIGKLKLVDVAGEGKPHALQLGSYAVIRVINQQALDDWLPVQEFDRVGKKIAGETNIMFDDLDFALNDWKVTENTQLRGGREGRNEITSLPLGLQWNLIEDQFFKHECDADNTILDEARLSAGWTKIAVFGIGASGVAHIIRVMSGAGLEMKSARLVGPRLCARIRQIVEEAKKNGILNARNISCAYEFAVCPFLCSISLDSKTRLDVEDLQPPLLKKFEEEIVKILEGAGKTLDKLDSVELIGFGMRVPIIRELIKFIFEAPTAAPNLFGDETIAKPKIALTHILIIKHYLKPRSRHKVKLYDNVSFWAELDVQGEDDIIVVNHAKSTVKVVLDDVKGVSFLENAKGINPSILILKLRNGEPKYDTTSDIVFRGFADDDTVPEEGLPDDCAKLKCLGLESPTYRVAEKTIDEGLKPEENEAKELIIKENKGSSSGESGVTNSSDVTEPDQLALDPANPSMDKTGSEERQNGVDEQMKNALTSNTGVSSGNGKLQELVELTEAAYTKRQIIEEEDGRSLLIQCTVICLEAKKKDRTLYDDEYGEGPYGEWPAVLAQRKAMSYQDECEAEFLEWFPSKSIKIKVVDRKMGADKDLKALSVEDAVSAEQATGQPLIESVLRKDDEKESE</span>
322
+
323
+ **Replica 4:**
324
+
325
+ Prompt 1 (Translation initiation factor):
326
+ - <span style="color:purple">MAKEDCLEMQGTVLETLPNTMFRVELENGHVILAAISGKMRKNYIRILTGDKVKVDMTPYDLSKGRIVFRAK</span>
327
+
328
+ Prompt 2 (Peptidyl-tRNA hydrolase):
329
+ - <span style="color:purple">MKLIVGLGTNSDKNRPTRNTNVGFFYLDDLKSITPVQIKAKFNGLTRCGPKADEHVLIVDVKTPMNKNGNEQSMKFTDYFGPVDYISLVVIHDDVQIIDGKDKPFKVGKYRGPHLGIANILALIKSGRVRIVVSNLPKKGNHVINGVVGIDMDDWLNLVQDFKENNGLIFICGGSARHGVINRLKKKDGLFEAPDCFSEKLEEKMRKCDGDPAITLDPFEAVQF</span>
330
+
331
+ Prompt 3 (tRNA-ribosyltransferase):
332
+ - <span style="color:purple">MVNKPVRAVKIKTTKPVGKYIGSIVVPAGTFPMPPFMVPEITPTCKEKTPALIQLATSIGDIYTLHSWIRQAGNMIDHGELHMKKFMNWKALVTDSGGFFMVLSLRYHVDYGFHFQTNGSHFPLSSMFMSDSIIASIQAGMDNFGADIVFDWPYPAQTYEYMMNSLEWTDRCRRALGELIKATDKPHLKNFGYVQIGGIHVLRSEQSLRVLTLRDDSLFGVGVMGESKPYQNDFLWQVIPKTLPYNPLRYGRPMQAIERSIDAGIRMFDCIDPTLPPRLIATTGCHMTSREGRSVVSNRDYDRSFSPLDPKCDCYHCRGYIRCYVNHLFKAKEILGLPLWSDNTVYSLRDMIDRVQHFTVDGLKMEDLHNLFKGFVSEFRHHSAEKKGSE</span>
333
+
334
+ Prompt 4 (Chaperonin GroEL):
335
+ - <span style="color:purple">MAAKEVKFGNDARVKMLRGVNTLADAVKVTLGPKGRNVVLDKSFGAPTITKDGVSVAKEIELKDKFENMGAQMVKEIANKANDLAGDGTTTATVLAQSIINEGLKAVAAGMNPMDLKRGIDKAVIAAVANLKTLSVPCSDSKAIAQVATISANSVETVGKLKAEAMDKVGKEGVITVEEGSGLQDELDVVEGMQFDRGYLSPYFINKPDSGALELESPFILLVDKKISNIRELLAVLEAVAKSGKPLLIIAEDVEGEALATLVVNTMRGIVKVAAVKAPGFGDRRKAMLQDIATLTGGTVISEEIGMELEKATLSELGQAKRVVINKDTTTIIDGGGEEAQIRLRVAQIQAQIEDASSDYDKEKLQERVAKLSGGVAVIKVGAATEVEMKEKKARVEDALHATRACIEEGVVAGGGVALIRVAKKFADLQGSNEDQNVGVKVALRAMEAPLRQIVLNMGEEPSVVANTVKAGEGNYGYNAASGEYGDMIEYGILDPTKVTRSTLQYAASVAGLMITTEAMVAEMEPKD</span>
336
+
337
+ Prompt 5 (Chaperone DnaK):
338
+ - <span style="color:purple">MMNGTKKLNSWQIGAPGAFKDSGILPVVINRYQNTPTSAIVQAYRTERGIAAKSRNALKNPSSCFDIFRYDLKKVGRFNGEKNLVDYDTLPFVIAICYTKIKAEAEDYLGREIDEILVIPPMYFVSYKGRVVKKIKDKADVDVNRIIAEPSAAAIAYGLDSSNNAEMIVYDYGGGSIDVSIVEATENNDKYRAVEFDMGKSGLNNVLRKDARVRGKRDRDSSDPTYIALYNSGLALQEKVEEGVEIDEVNQDSLPLNNKNAIGMRKEKIELRRTTFSSLAKDLLEKTKEPMKKAFKEAGLTHEEVGEIVLVGGDMKIPAVVARVQETFQKTLLNLALDPEVVSLGSAIQGLVLYGNQIYINEDRLKPYVIPDGLNFNPDPDLSENLFIPRKSTILEGVFMGNLTAPIVHSFEPYSKEFPLGPNNGLLNLKLKSFIEFSTINENSVPPTTKDKFIGLCNDLSMSNARYKDAEPTDEKKHEENIVVEDEHSDSQAQLLQGREKIQKKCILNEEKKEKVKTELKKLESLVNPELRSKMTADEISGCLAKSKNALEKFQRKMTPKPEDGDEKRDFLKTKNSDNTEYFTFES</span>
339
+
340
+ Each replica represents a different possible sequence design that maintains the desired functional properties specified in the input text description. The different replicas allow for exploration of sequence diversity while preserving the intended functionality.
341
 
 
 
 
 
 
342
 
343
  ## Support
344
 
345
  For questions or issues:
346
  - Open an issue in this repository
347
+ - Contact: [email protected]
348
 
349
  ---
350
  Repository maintained by the BioM3 Team