RaphaelMourad commited on
Commit
632df63
·
verified ·
1 Parent(s): 1d55f8a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -3
README.md CHANGED
@@ -1,3 +1,52 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - pretrained
5
+ - modernbert
6
+ - DNA
7
+ - virus
8
+ ---
9
+
10
+ # Model Card for ModernBert-DNA-v1-37M-virus (Mistral for DNA)
11
+
12
+ The ModernBert-DNA-v1-37M-virus Large Language Model (LLM) is a pretrained generative DNA sequence model with 37M parameters.
13
+ It is derived from ModernBERT model, which was simplified for DNA: the number of layers and the hidden size were reduced.
14
+ The model was pretrained using around 15071 viruses > 1kb. Virus genomes were split into 1kb sequences.
15
+
16
+ Virus genome database was downloaded from https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Genome&VirusLineage_ss=taxid:10239&SourceDB_s=RefSeq.
17
+ NB: the DNA sequence was used, not the RNA sequence.
18
+
19
+
20
+ ## Load the model from huggingface:
21
+
22
+ ```
23
+ import torch
24
+ from transformers import AutoTokenizer, AutoModel
25
+
26
+ tokenizer = AutoTokenizer.from_pretrained("RaphaelMourad/ModernBert-DNA-v1-37M-virus", trust_remote_code=True)
27
+ model = AutoModel.from_pretrained("RaphaelMourad/ModernBert-DNA-v1-37M-virus", trust_remote_code=True)
28
+ ```
29
+
30
+ ## Calculate the embedding of a DNA sequence
31
+
32
+ ```
33
+ DNAseq = "TGATGATTGGCGCGGCTAGGATCGGCT"
34
+ inputs = tokenizer(DNAseq, return_tensors = 'pt')["input_ids"]
35
+ hidden_states = model(inputs)[0] # [1, sequence_length, 256]
36
+
37
+ # embedding with max pooling
38
+ embedding_max = torch.max(hidden_states[0], dim=0)[0]
39
+ print(embedding_max.shape) # expect to be 256
40
+ ```
41
+
42
+ ## Troubleshooting
43
+
44
+ Ensure you are utilizing a stable version of Transformers, 4.34.0 or newer.
45
+
46
+ ## Notice
47
+
48
+ ModernBert-DNA-v1-37M-virus is a pretrained base model for DNA.
49
+
50
+ ## Contact
51
+
52
+ Raphaël Mourad. [email protected]