zhangtaolab
/

plant-dnagpt-singlebase

Model card Files Files and versions Community

plant-dnagpt-singlebase / README.md

lgq12697's picture

Upload 7 files

56f1c7d verified 6 months ago

|

history blame contribute delete

2.44 kB

	---
	license: cc-by-nc-sa-4.0
	widget:
	- text: AAAAGCGACATGACCAAACTGCCCCTCACCCGCCGCACTGATGACCGA
	tags:
	- DNA
	- biology
	- genomics
	datasets:
	- zhangtaolab/plant_reference_genomes
	---
	# Plant foundation DNA large language models

	The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
	All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.


	Developed by: zhangtaolab

	### Model Sources

	- Repository: [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
	- Manuscript: [Versatile applications of foundation DNA language models in plant genomes]()

	### Architecture

	The model is trained based on the OpenAI GPT-2 model with modified tokenizer specific for DNA sequence.

	### How to use

	Install the runtime library first:
	```bash
	pip install transformers
	```

	Here is a simple code for inference:
	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model_name = 'plant-dnagpt-singlebase'
	# load model and tokenizer
	model = AutoModelForCausalLM.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(f'zhangtaolab/{model_name}', trust_remote_code=True)

	# example sequence and tokenization
	sequences = ['ATATACGGCCGNC','GGGTATCGCTTCCGAC']
	tokens = tokenizer(sequences,padding="longest")['input_ids']
	print(f"Tokenzied sequence: {tokenizer.batch_decode(tokens)}")

	# inference
	device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
	model.to(device)
	inputs = tokenizer(sequences, truncation=True, padding='max_length', max_length=512,
	return_tensors="pt")
	inputs = {k: v.to(device) for k, v in inputs.items()}
	outs = model(
	**inputs,
	output_hidden_states=True
	)

	# get the final layer embeddings and prediction logits
	embeddings = outs['hidden_states'][-1].detach().numpy()
	logits = outs['logits'].detach().numpy()
	```


	### Training data
	We use CausalLM method to pre-train the model, the tokenized sequence have a maximum length of 512.
	Detailed training procedure can be found in our manuscript.


	#### Hardware
	Model was pre-trained on a NVIDIA RTX4090 GPU (24 GB).