GENERanno-eukaryote-0.5b-base / README.md

Update README.md

553d323 verified about 1 month ago

3.91 kB

	---
	license: mit
	pipeline_tag: fill-mask
	tags:
	- biology
	- genomics
	- long-context
	library_name: transformers
	---
	# GENERanno-eukaryote-0.5b-base model

	## Abouts
	In this repository, we present GENERanno, a genomic foundation model featuring a context length of 8k base pairs and 500M parameters, trained on an expansive dataset comprising 386 billion base pairs of eukaryotic DNA. Our evaluations demonstrate that the GENERanno achieves comparable performance with [GENERator](https://huggingface.co/GenerTeam/GENERator-eukaryote-1.2b-base) in benchmark evaluations, including [Genomic Benchmarks](https://huggingface.co/datasets/katielink/genomic-benchmarks/tree/main), [NT tasks](https://huggingface.co/datasets/InstaDeepAI/nucleotide_transformer_downstream_tasks_revised), and our newly proposed [Gener tasks](https://huggingface.co/GenerTeam), making them the top genomic foundation models in the field (2025-02).

	Beyond benchmark performance, the GENERanno model is meticulously designed with its specialization in gene annotation. The model efficiently and accurately identifies gene locations, predicts gene function, and annotates gene structure, highlighting its potential to revolutionize genomic research by significantly enhancing the precision and efficiency of gene annotation processes.

	Please note that the GENERanno is currently in the developmental phase. We are actively refining the model and will release more technical details soon. Stay tuned for updates!

	## How to use
	### Simple example: embedding

	```python

	import torch
	from transformers import AutoTokenizer, AutoModel

	# Load the tokenizer and model using the pretrained model name
	tokenizer = AutoTokenizer.from_pretrained("GenerTeam/GENERanno-eukaryote-0.5b-base")
	model = AutoModel.from_pretrained("GenerTeam/GENERanno-eukaryote-0.5b-base", trust_remote_code=True)

	# Get model configuration and maximum sequence length
	config = model.config
	max_length = config.max_position_embeddings

	# Define input sequences
	sequences = [
	"ATGAGGTGGCAAGAAATGGGCTAC",
	"GAATTCCATGAGGCTATAGAATAATCTAAGAGAAAT"
	]

	# Tokenize the sequences
	# The add_special_tokens=True adds special tokens
	tokenizer.padding_side = "right"
	inputs = tokenizer(
	sequences,
	add_special_tokens=True,
	return_tensors="pt",
	padding=True,
	truncation=True,
	max_length=max_length
	)

	# Perform a forward pass through the model to obtain the outputs, including hidden states
	with torch.inference_mode():
	outputs = model(**inputs, output_hidden_states=True)

	# Retrieve the hidden states from the last layer
	# hidden_states shape: (batch_size, sequence_length, hidden_size)
	hidden_states = outputs.hidden_states[-1]

	# Option 1: Use the first token (BOS) as the sentence embedding
	cls_embeddings = hidden_states[:, 0, :]

	# Option 2: Use mean pooling over the token embeddings
	# Use the attention mask to take care of the padded tokens
	attention_mask = inputs["attention_mask"] # Shape: (batch_size, sequence_length)
	# Expand the attention mask dimensions so that it matches the hidden_states dimensions
	expanded_mask = attention_mask.unsqueeze(-1).expand(hidden_states.size()).to(torch.float32)
	# Sum the token embeddings, taking the mask into account
	sum_embeddings = torch.sum(hidden_states * expanded_mask, dim=1)
	# Compute the average by dividing with the sum of the attention mask
	mean_embeddings = sum_embeddings / expanded_mask.sum(dim=1)

	print("BOS Embeddings:", cls_embeddings)
	print("Mean Embeddings:", mean_embeddings)
	```

	## Citation
	```
	@misc{wu2025generator,
	title={GENERator: A Long-Context Generative Genomic Foundation Model},
	author={Wei Wu and Qiuyi Li and Mingyang Li and Kun Fu and Fuli Feng and Jieping Ye and Hui Xiong and Zheng Wang},
	year={2025},
	eprint={2502.07272},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.07272},
	}
	```