GENERanno-eukaryote-0.5b-base model

Abouts

In this repository, we present GENERanno, a genomic foundation model featuring a context length of 8k base pairs and 500M parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERanno achieves comparable performance with GENERator in benchmark evaluations, including Genomic Benchmarks, NT tasks, and our newly proposed Gener tasks, making them the top genomic foundation models in the field (2025-02).

Beyond benchmark performance, the GENERanno model is meticulously designed with its specialization in gene annotation. The model efficiently and accurately identifies gene locations, predicts gene function, and annotates gene structure, highlighting its potential to revolutionize genomic research by significantly enhancing the precision and efficiency of gene annotation processes.

Please note that the GENERanno is currently in the developmental phase. We are actively refining the model and will release more technical details soon. Stay tuned for updates!

How to use

Simple example: embedding


import torch
from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and model using the pretrained model name
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERanno-eukaryote-0.5b-base")
model = AutoModel.from_pretrained("GenerTeam/GENERanno-eukaryote-0.5b-base", trust_remote_code=True)

# Get model configuration and maximum sequence length
config = model.config
max_length = config.max_position_embeddings

# Define input sequences
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Tokenize the sequences
# The add_special_tokens=True adds special tokens
tokenizer.padding_side = "right"
inputs = tokenizer(
    sequences,
    add_special_tokens=True,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Perform a forward pass through the model to obtain the outputs, including hidden states
with torch.inference_mode():
    outputs = model(**inputs, output_hidden_states=True)

# Retrieve the hidden states from the last layer
# hidden_states shape: (batch_size, sequence_length, hidden_size)
hidden_states = outputs.hidden_states[-1]

# Option 1: Use the first token (BOS) as the sentence embedding
cls_embeddings = hidden_states[:, 0, :]

# Option 2: Use mean pooling over the token embeddings
# Use the attention mask to take care of the padded tokens
attention_mask = inputs["attention_mask"]  # Shape: (batch_size, sequence_length)
# Expand the attention mask dimensions so that it matches the hidden_states dimensions
expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32)
# Sum the token embeddings, taking the mask into account
sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1)
# Compute the average by dividing with the sum of the attention mask
mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1)

print("BOS Embeddings:", cls_embeddings)
print("Mean Embeddings:", mean_embeddings)

Citation

@misc{wu2025generator,
      title={GENERator: A Long-Context Generative Genomic Foundation Model}, 
      author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
      year={2025},
      eprint={2502.07272},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.07272}, 
}