---
license: mit
pipeline_tag: fill-mask
tags:
- biology
- genomics
- long-context
library_name: transformers
---
# GENERanno-prokaryote-0.5b-base model

## Abouts
In this repository, we present GENERanno, a compact yet powerful genomic foundation model featuring a context length of 8k base pairs with single-nucleotide resolution and 500M parameters, trained on an expansive dataset comprising 715 billion base pairs of prokaryotic DNA. Our evaluations demonstrate that the GENERanno consistently achieves state-of-the-art performance across a wide spectrum of biologically meaningfull tasks, namely the [Prokaryotic Gener Tasks](https://huggingface.co/datasets/GenerTeam/prokaryotic-gener-tasks) (2025-5).

In addition, we present [GENERanno-prokaryote-0.5b-cds-annotator-preview](https://huggingface.co/GenerTeam/GENERanno-prokaryote-0.5b-cds-annotator-preview), a model meticulously finetuned for metagenomic annotation. Through comprehensive evaluations, GENERanno-cds-annotator achieves superior accuracy compared to traditional HMM-based methods (e.g., [GLIMMER3](https://ccb.jhu.edu/software/glimmer/index.shtml), [GeneMarkS2](https://genemark.bme.gatech.edu/genemarks2.cgi), [Prodigal](https://github.com/hyattpd/Prodigal?tab=readme-ov-file)) and recent LLM-based approaches (e.g., [GeneLM](https://www.biorxiv.org/content/10.1101/2025.03.20.644312v1)), while demonstrating exceptional generalization ability on archaeal genomes. The detailed annotation results are provided [here](https://huggingface.co/datasets/GenerTeam/cds-annotation).

The code and implementation details are available on Github: [https://github.com/GenerTeam/GENERanno](https://github.com/GenerTeam/GENERanno).

## How to use
### Simple example: embedding

```python

import torch
from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and model using the pretrained model name
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERanno-prokaryote-0.5b-base")
model = AutoModel.from_pretrained("GenerTeam/GENERanno-prokaryote-0.5b-base", trust_remote_code=True)

# Get model configuration and maximum sequence length
config = model.config
max_length = config.max_position_embeddings

# Define input sequences
sequences = [
    "ATGAGGTGGCAAGAAATGGGCTAC",
    "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
]

# Tokenize the sequences
# The add_special_tokens=True adds special tokens
tokenizer.padding_side = "right"
inputs = tokenizer(
    sequences,
    add_special_tokens=True,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=max_length
)

# Perform a forward pass through the model to obtain the outputs, including hidden states
with torch.inference_mode():
    outputs = model(**inputs, output_hidden_states=True)

# Retrieve the hidden states from the last layer
# hidden_states shape: (batch_size, sequence_length, hidden_size)
hidden_states = outputs.hidden_states[-1]

# Option 1: Use the first token (BOS) as the sentence embedding
cls_embeddings = hidden_states[:, 0, :]

# Option 2: Use mean pooling over the token embeddings
# Use the attention mask to take care of the padded tokens
attention_mask = inputs["attention_mask"]  # Shape: (batch_size, sequence_length)
# Expand the attention mask dimensions so that it matches the hidden_states dimensions
expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32)
# Sum the token embeddings, taking the mask into account
sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1)
# Compute the average by dividing with the sum of the attention mask
mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1)

print("BOS Embeddings:", cls_embeddings)
print("Mean Embeddings:", mean_embeddings)
```

## Citation
```
@article{li2025generanno,
	author = {Li, Qiuyi and Wu, Wei and Zhu, Yiheng and Feng, Fuli and Ye, Jieping and Wang, Zheng},
	title = {GENERanno: A Genomic Foundation Model for Metagenomic Annotation},
	elocation-id = {2025.06.04.656517},
	year = {2025},
	doi = {10.1101/2025.06.04.656517},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/06/05/2025.06.04.656517},
	journal = {bioRxiv}
}
```