GenerTeam
/

GENERator-eukaryote-1.2b-base

Text Generation

text-generation-inference

Model card Files Files and versions

GenerTeam commited on Jun 8

Commit

72c1d4f

·

verified ·

1 Parent(s): 9688ffe

Update README.md

Files changed (1) hide show

README.md +9 -2

README.md CHANGED Viewed

@@ -11,8 +11,14 @@ arxiv: 2502.07272
 # GENERator-eukaryote-1.2b-base model
-## Important Notice !!!
-An issue was identified in the `model.safetensors` file of the initial release, likely caused by an unstable internet connection during upload. If you downloaded **GENERator-eukaryote-1.2b-base** before **February 26, 2025**, please re-download the model to ensure optimal and reliable performance.
 ## Abouts
 In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 1.2B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERator consistently achieves state-of-the-art performance across a wide spectrum of benchmarks, including [Genomic Benchmarks](https://huggingface.co/datasets/katielink/genomic-benchmarks/tree/main), [NT tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised), and our newly proposed [Gener tasks](https://huggingface.co/GenerTeam).
@@ -37,6 +43,7 @@ config = model.config
 max_length = config.max_position_embeddings
 # Define input sequences.
 sequences = [
     "ATGAGGTGGCAAGAAATGGGCTAC",
     "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"

 # GENERator-eukaryote-1.2b-base model
+## **Important Notice**
+If you are using **GENERator** for sequence generation, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either:
+1. Padding the sequence on the left with `'A'` (**left padding**), or
+2. Simply truncating the sequence from the left (**left truncation**).
+This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `<oov>` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`.
+We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.
 ## Abouts
 In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 1.2B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERator consistently achieves state-of-the-art performance across a wide spectrum of benchmarks, including [Genomic Benchmarks](https://huggingface.co/datasets/katielink/genomic-benchmarks/tree/main), [NT tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised), and our newly proposed [Gener tasks](https://huggingface.co/GenerTeam).
 max_length = config.max_position_embeddings
 # Define input sequences.
+# The input sequence length should be a
 sequences = [
     "ATGAGGTGGCAAGAAATGGGCTAC",
     "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"