File size: 2,304 Bytes
e1f6da6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# Joint NT-ESM2 DNA-Protein Models

This repository contains jointly trained Nucleotide Transformer (NT) and ESM2 models for DNA-protein sequence analysis.

## Model Components

### DNA Model (`dna/`)
- **Type**: Nucleotide Transformer for DNA sequences
- **Context**: 4096 tokens
- **Training**: Transcript-specific coding sequences

### Protein Model (`protein/`)
- **Type**: ESM2 for protein sequences  
- **Variant**: Large model
- **Training**: Corresponding protein sequences

## Usage

```python
from transformers import AutoModel, AutoTokenizer

# Load DNA model
dna_model = AutoModel.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna")
dna_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="dna")

# Load protein model
protein_model = AutoModel.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")  
protein_tokenizer = AutoTokenizer.from_pretrained("vsubasri/joint-nt-esm2-transcript-coding", subfolder="protein")

# Example joint usage
dna_seq = "ATGAAACGCATTAGCACCACCATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGA"
protein_seq = "MKRISLHHHHHHHQVTVRWD"

dna_inputs = dna_tokenizer(dna_seq, return_tensors="pt")
protein_inputs = protein_tokenizer(protein_seq, return_tensors="pt")

dna_outputs = dna_model(**dna_inputs)
protein_outputs = protein_model(**protein_inputs)
```

## Training Details

- **Joint Training**: Models trained together for cross-modal understanding
- **Batch Size**: 8
- **Data**: Transcript-specific coding sequences with corresponding proteins
- **Architecture**: Maintained original NT and ESM2 architectures

## Repository Structure

```
β”œβ”€β”€ dna/                    # NT DNA model
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ vocab.txt
β”‚   └── special_tokens_map.json
β”œβ”€β”€ protein/                # ESM2 protein model  
β”‚   β”œβ”€β”€ config.json
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ tokenizer_config.json
β”‚   β”œβ”€β”€ vocab.txt
β”‚   └── special_tokens_map.json
└── joint_config.json       # Joint model configuration
```

## Citation

If you use these models, please cite the original NT and ESM2 papers along with your joint training methodology.