GenerTeam commited on
Commit
72c1d4f
·
verified ·
1 Parent(s): 9688ffe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -2
README.md CHANGED
@@ -11,8 +11,14 @@ arxiv: 2502.07272
11
 
12
  # GENERator-eukaryote-1.2b-base model
13
 
14
- ## Important Notice !!!
15
- An issue was identified in the `model.safetensors` file of the initial release, likely caused by an unstable internet connection during upload. If you downloaded **GENERator-eukaryote-1.2b-base** before **February 26, 2025**, please re-download the model to ensure optimal and reliable performance.
 
 
 
 
 
 
16
 
17
  ## Abouts
18
  In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 1.2B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERator consistently achieves state-of-the-art performance across a wide spectrum of benchmarks, including [Genomic Benchmarks](https://huggingface.co/datasets/katielink/genomic-benchmarks/tree/main), [NT tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised), and our newly proposed [Gener tasks](https://huggingface.co/GenerTeam).
@@ -37,6 +43,7 @@ config = model.config
37
  max_length = config.max_position_embeddings
38
 
39
  # Define input sequences.
 
40
  sequences = [
41
  "ATGAGGTGGCAAGAAATGGGCTAC",
42
  "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
 
11
 
12
  # GENERator-eukaryote-1.2b-base model
13
 
14
+ ## **Important Notice**
15
+ If you are using **GENERator** for sequence generation, please ensure that the length of each input sequence is a multiple of **6**. This can be achieved by either:
16
+ 1. Padding the sequence on the left with `'A'` (**left padding**), or
17
+ 2. Simply truncating the sequence from the left (**left truncation**).
18
+
19
+ This requirement arises because **GENERator** employs a 6-mer tokenizer. If the input sequence length is not a multiple of **6**, the tokenizer will append an `<oov>` (out-of-vocabulary) token to the end of the token sequence. This can result in uninformative subsequent generations, such as repeated `'AAAAAA'`.
20
+
21
+ We apologize for any inconvenience this may cause and recommend adhering to the above guidelines to ensure accurate and meaningful generation results.
22
 
23
  ## Abouts
24
  In this repository, we present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs and 1.2B parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERator consistently achieves state-of-the-art performance across a wide spectrum of benchmarks, including [Genomic Benchmarks](https://huggingface.co/datasets/katielink/genomic-benchmarks/tree/main), [NT tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised), and our newly proposed [Gener tasks](https://huggingface.co/GenerTeam).
 
43
  max_length = config.max_position_embeddings
44
 
45
  # Define input sequences.
46
+ # The input sequence length should be a
47
  sequences = [
48
  "ATGAGGTGGCAAGAAATGGGCTAC",
49
  "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"