Update README.md
Browse files
README.md
CHANGED
|
@@ -11,8 +11,14 @@ arxiv: 2502.07272
|
|
| 11 |
|
| 12 |
# GENERator-eukaryote-1.2b-base model
|
| 13 |
|
| 14 |
-
## Important Notice
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Abouts
|
| 18 |
In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 1.2B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERator consistently achieves state-of-the-art performance across a wide spectrum of benchmarks, including [Genomic Benchmarks](https://huggingface.co/datasets/katielink/genomic-benchmarks/tree/main), [NT tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised), and our newly proposed [Gener tasks](https://huggingface.co/GenerTeam).
|
|
@@ -37,6 +43,7 @@ config = model.config
|
|
| 37 |
max_length = config.max_position_embeddings
|
| 38 |
|
| 39 |
# Define input sequences.
|
|
|
|
| 40 |
sequences = [
|
| 41 |
"ATGAGGTGGCAAGAAATGGGCTAC",
|
| 42 |
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
|
|
|
|
| 11 |
|
| 12 |
# GENERator-eukaryote-1.2b-base model
|
| 13 |
|
| 14 |
+
## **Important Notice**
|
| 15 |
+
If you are using **GENERator** for sequence generation, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either:
|
| 16 |
+
1. Padding the sequence on the left with `'A'` (**left padding**), or
|
| 17 |
+
2. Simply truncating the sequence from the left (**left truncation**).
|
| 18 |
+
|
| 19 |
+
This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `<oov>` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`.
|
| 20 |
+
|
| 21 |
+
We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.
|
| 22 |
|
| 23 |
## Abouts
|
| 24 |
In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 1.2B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERator consistently achieves state-of-the-art performance across a wide spectrum of benchmarks, including [Genomic Benchmarks](https://huggingface.co/datasets/katielink/genomic-benchmarks/tree/main), [NT tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised), and our newly proposed [Gener tasks](https://huggingface.co/GenerTeam).
|
|
|
|
| 43 |
max_length = config.max_position_embeddings
|
| 44 |
|
| 45 |
# Define input sequences.
|
| 46 |
+
# The input sequence length should be a
|
| 47 |
sequences = [
|
| 48 |
"ATGAGGTGGCAAGAAATGGGCTAC",
|
| 49 |
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
|