InstaDeepAI
/

MOJO

Feature Extraction

DNA methylation

transcriptomics

Model card Files Files and versions

MOJO / README.md

mgelard's picture

Update README.md

0b090ba verified 2 months ago

|

history blame contribute delete

2.99 kB

	---
	library_name: transformers
	tags:
	- bulk RNA-seq
	- DNA methylation
	- biology
	- transcriptomics
	- epigenomics
	- multimodal
	---

	# MOJO

	MOJO (MultiOmics JOint representation learning) is a model that learns joint representations of bulk RNA-seq and DNA methylation through bimodal masked language modeling and is tailored for cancer-type classification and survival analysis on the TCGA dataset.

	Developed by: [InstaDeep](https://huggingface.co/InstaDeepAI)

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- [Repository](https://github.com/instadeepai/multiomics-open-research)
	- Paper: [Bimodal masked language modeling for bulk RNA-seq and DNA methylation
	representation learning](https://www.biorxiv.org/content/10.1101/2025.06.25.661237v1)

	### How to use

	Until its next release, the transformers library needs to be installed from source with
	the following command in order to use the models.
	PyTorch should also be installed.

	```
	pip install --upgrade git+https://github.com/huggingface/transformers.git
	pip install torch
	```

	## Other notes
	We also provide the params for the MOJO jax model in `jax_params`.

	A small snippet of code is provided below to run inference with the model using bulk RNA-seq and DNA methylation samples from the [TCGA](https://portal.gdc.cancer.gov/) dataset.

	```
	import numpy as np
	import pandas as pd
	from transformers import AutoModel, AutoTokenizer
	from huggingface_hub import hf_hub_download

	tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/MOJO", trust_remote_code=True)
	model = AutoModel.from_pretrained(
	"InstaDeepAI/MOJO",
	trust_remote_code=True,
	)

	n_examples = 4
	omic_dict = {}

	for omic in ["rnaseq", "methylation"]:
	csv_path = hf_hub_download(
	repo_id="InstaDeepAI/MOJO",
	filename=f"data/tcga_{omic}_sample.csv",
	repo_type="model",
	)
	omic_array = pd.read_csv(csv_path).drop(["identifier", "cohort"], axis=1).to_numpy()[:n_examples, :]
	if omic == "rnaseq":
	omic_array = np.log10(1 + omic_array)
	assert omic_array.shape[1] == model.config.sequence_length
	omic_dict[omic] = omic_array

	omic_ids = {
	omic: tokens["input_ids"]
	for omic, tokens in tokenizer.batch_encode_plus(omic_dict, pad_to_fixed_length=True, return_tensors="pt").items()
	}

	omic_mean_embeddings = model(omic_ids)["after_transformer_embedding"].mean(axis=1) # embeddings can be used for downstream tasks.
	```


	### Citing our work

	```
	@article {G{\'e}lard2025.06.25.661237,
	author = {G{\'e}lard, Maxence and Benkirane, Hakim and Pierrot, Thomas and Richard, Guillaume and Courn{\`e}de, Paul-Henry},
	title = {Bimodal masked language modeling for bulk RNA-seq and DNA methylation representation learning},
	elocation-id = {2025.06.25.661237},
	year = {2025},
	doi = {10.1101/2025.06.25.661237},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2025/06/27/2025.06.25.661237},
	journal = {bioRxiv}
	}
	```