--- license: mit pipeline_tag: fill-mask tags: - biology - genomics - long-context library_name: transformers --- # GENERanno-prokaryote-0.5b-base model ## Abouts In this repository, we present GENERanno, a compact yet powerful genomic foundation model featuring a context length of 8k base pairs with single-nucleotide resolution and 500M parameters, trained on an expansive dataset comprising 715 billion base pairs of prokaryotic DNA. Our evaluations demonstrate that the GENERanno consistently achieves state-of-the-art performance across a wide spectrum of biologically meaningfull tasks, namely the [Prokaryotic Gener Tasks](https://huggingface.co/datasets/GenerTeam/prokaryotic-gener-tasks) (2025-5). In addition, we present [GENERanno-prokaryote-0.5b-cds-annotator-preview](https://huggingface.co/GenerTeam/GENERanno-prokaryote-0.5b-cds-annotator-preview), a model meticulously finetuned for metagenomic annotation. Through comprehensive evaluations, GENERanno-cds-annotator achieves superior accuracy compared to traditional HMM-based methods (e.g., [GLIMMER3](https://ccb.jhu.edu/software/glimmer/index.shtml), [GeneMarkS2](https://genemark.bme.gatech.edu/genemarks2.cgi), [Prodigal](https://github.com/hyattpd/Prodigal?tab=readme-ov-file)) and recent LLM-based approaches (e.g., [GeneLM](https://www.biorxiv.org/content/10.1101/2025.03.20.644312v1)), while demonstrating exceptional generalization ability on archaeal genomes. The detailed annotation results are provided [here](https://huggingface.co/datasets/GenerTeam/cds-annotation). The code and implementation details are available on Github: [https://github.com/GenerTeam/GENERanno](https://github.com/GenerTeam/GENERanno). ## How to use ### Simple example: embedding ```python import torch from transformers import AutoTokenizer, AutoModel # Load the tokenizer and model using the pretrained model name tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERanno-prokaryote-0.5b-base") model = AutoModel.from_pretrained("GenerTeam/GENERanno-prokaryote-0.5b-base", trust_remote_code=True) # Get model configuration and maximum sequence length config = model.config max_length = config.max_position_embeddings # Define input sequences sequences = [ "ATGAGGTGGCAAGAAATGGGCTAC", "GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" ] # Tokenize the sequences # The add_special_tokens=True adds special tokens tokenizer.padding_side = "right" inputs = tokenizer( sequences, add_special_tokens=True, return_tensors="pt", padding=True, truncation=True, max_length=max_length ) # Perform a forward pass through the model to obtain the outputs, including hidden states with torch.inference_mode(): outputs = model(**inputs, output_hidden_states=True) # Retrieve the hidden states from the last layer # hidden_states shape: (batch_size, sequence_length, hidden_size) hidden_states = outputs.hidden_states[-1] # Option 1: Use the first token (BOS) as the sentence embedding cls_embeddings = hidden_states[:, 0, :] # Option 2: Use mean pooling over the token embeddings # Use the attention mask to take care of the padded tokens attention_mask = inputs["attention_mask"] # Shape: (batch_size, sequence_length) # Expand the attention mask dimensions so that it matches the hidden_states dimensions expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32) # Sum the token embeddings, taking the mask into account sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1) # Compute the average by dividing with the sum of the attention mask mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1) print("BOS Embeddings:", cls_embeddings) print("Mean Embeddings:", mean_embeddings) ``` ## Citation ``` @article{li2025generanno, author = {Li, Qiuyi and Wu, Wei and Zhu, Yiheng and Feng, Fuli and Ye, Jieping and Wang, Zheng}, title = {GENERanno: A Genomic Foundation Model for Metagenomic Annotation}, elocation-id = {2025.06.04.656517}, year = {2025}, doi = {10.1101/2025.06.04.656517}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2025/06/05/2025.06.04.656517}, journal = {bioRxiv} } ```