|
--- |
|
license: mit |
|
pipeline_tag: fill-mask |
|
tags: |
|
- biology |
|
- genomics |
|
- long-context |
|
library_name: transformers |
|
--- |
|
# GENERanno-eukaryote-0.5b-base model |
|
|
|
## Abouts |
|
In this repository, we present GENERanno, a genomic foundation model featuring a context length of 8k base pairs and 500M parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERanno achieves comparable performance with [GENERator](https://huggingface.co/GenerTeam/GENERator-eukaryote-1.2b-base) in benchmark evaluations, including [Genomic Benchmarks](https://huggingface.co/datasets/katielink/genomic-benchmarks/tree/main), [NT tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised), and our newly proposed [Gener tasks](https://huggingface.co/GenerTeam), making them the top genomic foundation models in the field (2025-02). |
|
|
|
Beyond benchmark performance, the GENERanno model is meticulously designed with its specialization in gene annotation. The model efficiently and accurately identifies gene locations, predicts gene function, and annotates gene structure, highlighting its potential to revolutionize genomic research by significantly enhancing the precision and efficiency of gene annotation processes. |
|
|
|
Please note that the GENERanno is currently in the developmental phase. We are actively refining the model and will release more technical details soon. Stay tuned for updates! |
|
|
|
## How to use |
|
### Simple example: embedding |
|
|
|
```python |
|
|
|
import torch |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
# Load the tokenizer and model using the pretrained model name |
|
tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERanno-eukaryote-0.5b-base") |
|
model = AutoModel.from_pretrained("GenerTeam/GENERanno-eukaryote-0.5b-base", trust_remote_code=True) |
|
|
|
# Get model configuration and maximum sequence length |
|
config = model.config |
|
max_length = config.max_position_embeddings |
|
|
|
# Define input sequences |
|
sequences = [ |
|
"ATGAGGTGGCAAGAAATGGGCTAC", |
|
"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT" |
|
] |
|
|
|
# Tokenize the sequences |
|
# The add_special_tokens=True adds special tokens |
|
tokenizer.padding_side = "right" |
|
inputs = tokenizer( |
|
sequences, |
|
add_special_tokens=True, |
|
return_tensors="pt", |
|
padding=True, |
|
truncation=True, |
|
max_length=max_length |
|
) |
|
|
|
# Perform a forward pass through the model to obtain the outputs, including hidden states |
|
with torch.inference_mode(): |
|
outputs = model(**inputs, output_hidden_states=True) |
|
|
|
# Retrieve the hidden states from the last layer |
|
# hidden_states shape: (batch_size, sequence_length, hidden_size) |
|
hidden_states = outputs.hidden_states[-1] |
|
|
|
# Option 1: Use the first token (BOS) as the sentence embedding |
|
cls_embeddings = hidden_states[:, 0, :] |
|
|
|
# Option 2: Use mean pooling over the token embeddings |
|
# Use the attention mask to take care of the padded tokens |
|
attention_mask = inputs["attention_mask"] # Shape: (batch_size, sequence_length) |
|
# Expand the attention mask dimensions so that it matches the hidden_states dimensions |
|
expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32) |
|
# Sum the token embeddings, taking the mask into account |
|
sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1) |
|
# Compute the average by dividing with the sum of the attention mask |
|
mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1) |
|
|
|
print("BOS Embeddings:", cls_embeddings) |
|
print("Mean Embeddings:", mean_embeddings) |
|
``` |
|
|
|
## Citation |
|
``` |
|
@misc{wu2025generator, |
|
title={GENERator: A Long-Context Generative Genomic Foundation Model}, |
|
author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang}, |
|
year={2025}, |
|
eprint={2502.07272}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2502.07272}, |
|
} |
|
``` |