Update README.md
Browse files
README.md
CHANGED
@@ -8,13 +8,13 @@ tags:
|
|
8 |
- genomics
|
9 |
- segmentation
|
10 |
---
|
11 |
-
# segment-nt-
|
12 |
|
13 |
-
Segment-NT-
|
14 |
-
elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [Segment-NT
|
15 |
but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
|
16 |
|
17 |
-
For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **Segment-NT
|
18 |
available for these species. The annotations therefore concern the 7 main gene elements available from Ensembl [REF], namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
|
19 |
splice acceptor and donor sites.
|
20 |
|
@@ -59,8 +59,8 @@ features = [
|
|
59 |
"promoter_Tissue_invariant",
|
60 |
]
|
61 |
|
62 |
-
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/
|
63 |
-
model = AutoModel.from_pretrained("InstaDeepAI/
|
64 |
|
65 |
# Choose the length to which the input sequences are padded. By default, the
|
66 |
# model max length is chosen, but feel free to decrease it as the time taken to
|
@@ -100,7 +100,7 @@ print(f"Intron probabilities shape: {probabilities_intron.shape}")
|
|
100 |
|
101 |
## Training data
|
102 |
|
103 |
-
The **segment-nt-
|
104 |
validation for training monitoring and test for final evaluation.
|
105 |
|
106 |
## Training procedure
|
|
|
8 |
- genomics
|
9 |
- segmentation
|
10 |
---
|
11 |
+
# segment-nt-multi-species
|
12 |
|
13 |
+
Segment-NT-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
|
14 |
+
elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [Segment-NT](https://huggingface.co/InstaDeepAI/segment_nt) model on a dataset encompassing the human genome
|
15 |
but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
|
16 |
|
17 |
+
For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **Segment-NT**, mainly because only this subset of annotations is
|
18 |
available for these species. The annotations therefore concern the 7 main gene elements available from Ensembl [REF], namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
|
19 |
splice acceptor and donor sites.
|
20 |
|
|
|
59 |
"promoter_Tissue_invariant",
|
60 |
]
|
61 |
|
62 |
+
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
|
63 |
+
model = AutoModel.from_pretrained("InstaDeepAI/segment_nt_multi_species", trust_remote_code=True)
|
64 |
|
65 |
# Choose the length to which the input sequences are padded. By default, the
|
66 |
# model max length is chosen, but feel free to decrease it as the time taken to
|
|
|
100 |
|
101 |
## Training data
|
102 |
|
103 |
+
The **segment-nt-multi-species** model was finetuned on human, mouse, chicken, fly, zebrafish and worm genomes. For each specie, a subset of chromosomes is kept as
|
104 |
validation for training monitoring and test for final evaluation.
|
105 |
|
106 |
## Training procedure
|