|
--- |
|
library_name: transformers |
|
tags: |
|
- bulk RNA-seq |
|
- DNA methylation |
|
- biology |
|
- transcriptomics |
|
- epigenomics |
|
- multimodal |
|
--- |
|
|
|
# MOJO |
|
|
|
MOJO (MultiOmics JOint representation learning) is a model that learns joint representations of bulk RNA-seq and DNA methylation through bimodal masked language modeling and is tailored for cancer-type classification and survival analysis on the TCGA dataset. |
|
|
|
**Developed by:** [InstaDeep](https://huggingface.co/InstaDeepAI) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- [**Repository**](https://github.com/instadeepai/multiomics-open-research) |
|
- **Paper:** [Bimodal masked language modeling for bulk RNA-seq and DNA methylation |
|
representation learning](https://www.biorxiv.org/content/10.1101/2025.06.25.661237v1) |
|
|
|
### How to use |
|
|
|
Until its next release, the transformers library needs to be installed from source with |
|
the following command in order to use the models. |
|
PyTorch should also be installed. |
|
|
|
``` |
|
pip install --upgrade git+https://github.com/huggingface/transformers.git |
|
pip install torch |
|
``` |
|
|
|
## Other notes |
|
We also provide the params for the MOJO jax model in `jax_params`. |
|
|
|
A small snippet of code is provided below to run inference with the model using bulk RNA-seq and DNA methylation samples from the [TCGA](https://portal.gdc.cancer.gov/) dataset. |
|
|
|
``` |
|
import numpy as np |
|
import pandas as pd |
|
from transformers import AutoModel, AutoTokenizer |
|
from huggingface_hub import hf_hub_download |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/MOJO", trust_remote_code=True) |
|
model = AutoModel.from_pretrained( |
|
"InstaDeepAI/MOJO", |
|
trust_remote_code=True, |
|
) |
|
|
|
n_examples = 4 |
|
omic_dict = {} |
|
|
|
for omic in ["rnaseq", "methylation"]: |
|
csv_path = hf_hub_download( |
|
repo_id="InstaDeepAI/MOJO", |
|
filename=f"data/tcga_{omic}_sample.csv", |
|
repo_type="model", |
|
) |
|
omic_array = pd.read_csv(csv_path).drop(["identifier", "cohort"], axis=1).to_numpy()[:n_examples, :] |
|
if omic == "rnaseq": |
|
omic_array = np.log10(1 + omic_array) |
|
assert omic_array.shape[1] == model.config.sequence_length |
|
omic_dict[omic] = omic_array |
|
|
|
omic_ids = { |
|
omic: tokens["input_ids"] |
|
for omic, tokens in tokenizer.batch_encode_plus(omic_dict, pad_to_fixed_length=True, return_tensors="pt").items() |
|
} |
|
|
|
omic_mean_embeddings = model(omic_ids)["after_transformer_embedding"].mean(axis=1) # embeddings can be used for downstream tasks. |
|
``` |
|
|
|
|
|
### Citing our work |
|
|
|
``` |
|
@article {G{\'e}lard2025.06.25.661237, |
|
author = {G{\'e}lard, Maxence and Benkirane, Hakim and Pierrot, Thomas and Richard, Guillaume and Courn{\`e}de, Paul-Henry}, |
|
title = {Bimodal masked language modeling for bulk RNA-seq and DNA methylation representation learning}, |
|
elocation-id = {2025.06.25.661237}, |
|
year = {2025}, |
|
doi = {10.1101/2025.06.25.661237}, |
|
publisher = {Cold Spring Harbor Laboratory}, |
|
URL = {https://www.biorxiv.org/content/early/2025/06/27/2025.06.25.661237}, |
|
journal = {bioRxiv} |
|
} |
|
``` |
|
|