Update README.md
Browse files
README.md
CHANGED
@@ -10,11 +10,11 @@ tags:
|
|
10 |
---
|
11 |
# segment-nt-multi-species
|
12 |
|
13 |
-
|
14 |
-
elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [
|
15 |
but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
|
16 |
|
17 |
-
For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **
|
18 |
available for these species. The annotations therefore concern the 7 main gene elements available from [Ensembl](https://www.ensembl.org/index.html), namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
|
19 |
splice acceptor and donor sites.
|
20 |
|
@@ -39,7 +39,7 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
|
|
39 |
A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
|
40 |
|
41 |
|
42 |
-
⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However,
|
43 |
been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change the `rescaling_factor`
|
44 |
argument in the config to `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference
|
45 |
(i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`.
|
|
|
10 |
---
|
11 |
# segment-nt-multi-species
|
12 |
|
13 |
+
SegmentNT-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
|
14 |
+
elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [SegmentNT](https://huggingface.co/InstaDeepAI/segment_nt) model on a dataset encompassing the human genome
|
15 |
but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
|
16 |
|
17 |
+
For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **SegmentNT**, mainly because only this subset of annotations is
|
18 |
available for these species. The annotations therefore concern the 7 main gene elements available from [Ensembl](https://www.ensembl.org/index.html), namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
|
19 |
splice acceptor and donor sites.
|
20 |
|
|
|
39 |
A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
|
40 |
|
41 |
|
42 |
+
⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However, SegmentNT has
|
43 |
been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change the `rescaling_factor`
|
44 |
argument in the config to `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference
|
45 |
(i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`.
|