YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

GenerRNA

GenerRNA is a generative RNA language model based on a Transformer decoder-only architecture. It was pre-trained on 30M sequences, encompassing 17B nucleotides.

Here, you can find all the relevant scripts for running GenerRNA on your machine. GenerRNA enable you to generate RNA sequences in a zero-shot manner for exploring the RNA space, or to fine-tune the model using a specific dataset for generating RNAs belonging to a particular family or possessing specific characteristics.

Requirements

A CUDA environment, and a minimum VRAM of 8GB was required.

Dependencies

torch>=2.0
numpy
transformers==4.33.0.dev0
datasets==2.14.4
tqdm

Usage

Firstly, combine the split model using the command cat model.pt.part-* > model.pt.recombined

Directory tree

.
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
β”œβ”€β”€ configs 
β”‚   β”œβ”€β”€ example_finetuning.py
β”‚   └── example_pretraining.py
β”œβ”€β”€ experiments_data
β”œβ”€β”€ model.pt.part-aa # splited bin data of *HISTORICAL* model (shorter context window, less VRAM comsuption)
β”œβ”€β”€ model.pt.part-ab 
β”œβ”€β”€ model.pt.part-ac
β”œβ”€β”€ model.pt.part-ad
β”œβ”€β”€ model_updated.pt # *NEWER* model, with longer context windows and being trained on a deduplicated dataset
β”œβ”€β”€ model.py         # define the architecture
β”œβ”€β”€ sampling.py      # script to generate sequences
β”œβ”€β”€ tokenization.py  # preparete data
β”œβ”€β”€ tokenizer_bpe_1024
β”‚   β”œβ”€β”€ tokenizer.json
β”‚   β”œβ”€β”€ ....
β”œβ”€β”€ train.py # script for training/fine-tuning

De novo Generation in a zero-shot fashion

Usage example:

python sampling.py \
    --out_path {output_file_path} \
    --max_new_tokens 256 \
    --ckpt_path {model.pt} \
    --tokenizer_path {path_to_tokenizer_directory, e.g /tokenizer_bpe_1024}

Pre-training or Fine-tuning on your own sequences

First, tokenize your sequence data, ensuring each sequence is on a separate line and there is no header.

python tokenization.py \
    --data_dir {path_to_the_directory_containing_sequence_data} \
    --file_name {file_name_of_sequence_data} \
    --tokenizer_path {path_to_tokenizer_directory}  \
    --out_dir {directory_to_save_tokenized_data} \
    --block_size 256

Next, refer to ./configs/example_**.py to create a config file of GPT model.

Lastly, excute following command:

python train.py \
    --config {path_to_your_config_file}

Train your own tokenizer

Usage example:

python train_BPE.py \
    --txt_file_path {path_to_training_file(txt,each sequence is on a separate line)} \
    --vocab_size 50256 \
    --new_tokenizer_path {directory_to_save_trained_tokenizer} \
                

License

The source code is licensed MIT. See LICENSE

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.