Update README.md
Browse files
README.md
CHANGED
|
@@ -8,7 +8,6 @@ pipeline_tag: text-generation
|
|
| 8 |
[ChatNT](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1) is the first multimodal conversational agent designed with a deep understanding of biological sequences (DNA, RNA, proteins).
|
| 9 |
It enables users — even those with no coding background — to interact with biological data through natural language and it generalizes across multiple biological tasks and modalities.
|
| 10 |
|
| 11 |
-
|
| 12 |
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
|
| 13 |
|
| 14 |
### Model Sources
|
|
@@ -19,7 +18,34 @@ It enables users — even those with no coding background — to interact with b
|
|
| 19 |
- **Paper:** [ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1.full.pdf)
|
| 20 |
|
| 21 |
|
| 22 |
-
###
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
|
| 25 |
PyTorch should also be installed.
|
|
@@ -30,6 +56,9 @@ pip install torch
|
|
| 30 |
```
|
| 31 |
|
| 32 |
A small snippet of code is given here in order to **generate ChatNT answers from a pipeline (high-level)**.
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
```
|
| 35 |
# Load pipeline
|
|
@@ -63,6 +92,7 @@ english_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolde
|
|
| 63 |
bio_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="bio_tokenizer")
|
| 64 |
|
| 65 |
# Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences))
|
|
|
|
| 66 |
english_sequence = "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Is there any evidence of an acceptor splice site in this sequence <DNA> ?"
|
| 67 |
dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"]
|
| 68 |
|
|
|
|
| 8 |
[ChatNT](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1) is the first multimodal conversational agent designed with a deep understanding of biological sequences (DNA, RNA, proteins).
|
| 9 |
It enables users — even those with no coding background — to interact with biological data through natural language and it generalizes across multiple biological tasks and modalities.
|
| 10 |
|
|
|
|
| 11 |
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI)
|
| 12 |
|
| 13 |
### Model Sources
|
|
|
|
| 18 |
- **Paper:** [ChatNT: A Multimodal Conversational Agent for DNA, RNA and Protein Tasks](https://www.biorxiv.org/content/10.1101/2024.04.30.591835v1.full.pdf)
|
| 19 |
|
| 20 |
|
| 21 |
+
### Architecture and Parameters
|
| 22 |
+
ChatNT is built on a three‑module design: a 500M‑parameter [Nucleotide Transformer v2](https://www.nature.com/articles/s41592-024-02523-z) DNA encoder pre‑trained on genomes from 850 species
|
| 23 |
+
(handling up to 12 kb per sequence, Dalla‑Torre et al., 2024), an English‑aware Perceiver Resampler that linearly projects and gated‑attention compresses
|
| 24 |
+
2048 DNA‑token embeddings into 64 task‑conditioned vectors (REF), and a frozen 7B‑parameter [Vicuna‑7B](https://lmsys.org/blog/2023-03-30-vicuna/) decoder.
|
| 25 |
+
|
| 26 |
+
Users provide a natural‑language prompt containing one or more `<DNA>` placeholders and the corresponding DNA sequences (tokenized as 6‑mers).
|
| 27 |
+
The projection layer inserts 64 resampled DNA embeddings at each placeholder, and the Vicuna decoder generates free‑form English responses in
|
| 28 |
+
an autoregressive fashion, using low‑temperature sampling to produce classification labels, multi‑label statements, or numeric values.
|
| 29 |
+
|
| 30 |
+
### Training Data
|
| 31 |
+
ChatNT was instruction‑tuned on a unified corpus covering 27 diverse tasks from DNA, RNA and proteins, spanning multiple species, tissues and biological processes.
|
| 32 |
+
This amounted to 605 million DNA tokens (≈ 3.6 billion bases) and 273 million English tokens, sampled uniformly over tasks for 2 billion instruction tokens.
|
| 33 |
+
|
| 34 |
+
### Tokenization
|
| 35 |
+
DNA inputs are broken into overlapping 6‑mer tokens and padded or truncated to 2048 tokens (~ 12 kb). English prompts and
|
| 36 |
+
outputs use the LLaMA tokenizer, augmented with `<DNA>` as a special token to mark sequence insertion points.
|
| 37 |
+
|
| 38 |
+
### Credit and License
|
| 39 |
+
The DNA encoder is the Nucleotide Transformer v2 ([Dalla‑Torre et al., 2024](https://www.nature.com/articles/s41592-024-02523-z)), and the English decoder is Vicuna‑7B (
|
| 40 |
+
[Chiang et al., 2023](https://lmsys.org/blog/2023-03-30-vicuna/)). All code and model artifacts are released under ???.
|
| 41 |
+
|
| 42 |
+
### Limitations and Disclaimer
|
| 43 |
+
While ChatNT excels at conversational molecular‑phenotype tasks, it is **not** a clinical or diagnostic tool. It can produce incorrect or
|
| 44 |
+
“hallucinated” answers, particularly on out‑of‑distribution inputs, and its numeric predictions may suffer digit‑level errors. Confidence
|
| 45 |
+
estimates require post‑hoc calibration. Users should always validate critical outputs against experiments or specialized bioinformatics
|
| 46 |
+
pipelines.
|
| 47 |
+
|
| 48 |
+
## How to use
|
| 49 |
|
| 50 |
Until its next release, the transformers library needs to be installed from source with the following command in order to use the models.
|
| 51 |
PyTorch should also be installed.
|
|
|
|
| 56 |
```
|
| 57 |
|
| 58 |
A small snippet of code is given here in order to **generate ChatNT answers from a pipeline (high-level)**.
|
| 59 |
+
- The prompt used for training ChatNT is already incorporated inside the pipeline and is the following:
|
| 60 |
+
"A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful,
|
| 61 |
+
detailed, and polite answers to the user's questions."
|
| 62 |
|
| 63 |
```
|
| 64 |
# Load pipeline
|
|
|
|
| 92 |
bio_tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/ChatNT", subfolder="bio_tokenizer")
|
| 93 |
|
| 94 |
# Define custom inputs (note that the number of <DNA> token in the english sequence must be equal to len(dna_sequences))
|
| 95 |
+
# Here the english sequence should include the prompt
|
| 96 |
english_sequence = "A chat between a curious user and an artificial intelligence assistant that can handle bio sequences. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Is there any evidence of an acceptor splice site in this sequence <DNA> ?"
|
| 97 |
dna_sequences = ["ATCGGAAAAAGATCCAGAAAGTTATACCAGGCCAATGGGAATCACCTATTACGTGGATAATAGCGATAGTATGTTACCTATAAATTTAACTACGTGGATATCAGGCAGTTACGTTACCAGTCAAGGAGCACCCAAAACTGTCCAGCAACAAGTTAATTTACCCATGAAGATGTACTGCAAGCCTTGCCAACCAGTTAAAGTAGCTACTCATAAGGTAATAAACAGTAATATCGACTTTTTATCCATTTTGATAATTGATTTATAACAGTCTATAACTGATCGCTCTACATAATCTCTATCAGATTACTATTGACACAAACAGAAACCCCGTTAATTTGTATGATATATTTCCCGGTAAGCTTCGATTTTTAATCCTATCGTGACAATTTGGAATGTAACTTATTTCGTATAGGATAAACTAATTTACACGTTTGAATTCCTAGAATATGGAGAATCTAAAGGTCCTGGCAATGCCATCGGCTTTCAATATTATAATGGACCAAAAGTTACTCTATTAGCTTCCAAAACTTCGCGTGAGTACATTAGAACAGAAGAATAACCTTCAATATCGAGAGAGTTACTATCACTAACTATCCTATG"]
|
| 98 |
|