Joint NT-ESM2 DNA-Protein Models
This repository contains jointly trained Nucleotide Transformer (NT) and ESM2 models for DNA-protein sequence analysis.
Model Components
DNA Model (dna/
)
- Type: Nucleotide Transformer for DNA sequences
- Context: 4096 tokens
- Training: Transcript-specific coding sequences
Protein Model (protein/
)
- Type: ESM2 for protein sequences
- Variant: Large model
- Training: Corresponding protein sequences
Usage
from transformers import AutoModel, AutoTokenizer
# Load DNA model
dna_model = AutoModel.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna")
dna_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna")
# Load protein model
protein_model = AutoModel.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")
protein_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")
# Example joint usage
dna_seq = "ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA"
protein_seq = "MKRISLHHHHHHHQVTVRWD"
dna_inputs = dna_tokenizer(dna_seq, return_tensors="pt")
protein_inputs = protein_tokenizer(protein_seq, return_tensors="pt")
dna_outputs = dna_model(**dna_inputs)
protein_outputs = protein_model(**protein_inputs)
Training Details
- Joint Training: Models trained together for cross-modal understanding
- Batch Size: 8
- Data: Transcript-specific coding sequences with corresponding proteins
- Architecture: Maintained original NT and ESM2 architectures
Repository Structure
βββ dna/ # NT DNA model
β βββ config.json
β βββ model.safetensors
β βββ tokenizer_config.json
β βββ vocab.txt
β βββ special_tokens_map.json
βββ protein/ # ESM2 protein model
β βββ config.json
β βββ model.safetensors
β βββ tokenizer_config.json
β βββ vocab.txt
β βββ special_tokens_map.json
βββ joint_config.json # Joint model configuration
Citation
If you use these models, please cite the original NT and ESM2 papers along with your joint training methodology.