β¨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement
Abstract
Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific featuresβincluding deep feature embeddings, segmentation information, geometric cues, and color informationβto generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormerβs state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer
π Updates
29.07.2025
π The ModalFormer paper is now available! Check it out and explore our results and methodology.28.07.2025
π¦ Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned!
βοΈ Setup and Testing
For ease, utilize a Linux machine with CUDA-ready devices (GPUs).
To setup the environment, first run the provided setup script:
./environment_setup.sh
# or
bash environment_setup.sh
Note: in case of difficulties, ensure environment_setup.sh
is executable by running:
chmod +x environment_setup.sh
Give the setup a couple of minutes to run.
Please check out the GitHub repository for more implementation details.
π Citation
@misc{brateanu2025modalformer,
title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement},
author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti},
year={2025},
eprint={2507.20388},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2507.20388},
}
π Acknowledgements
We use this codebase as foundation for our implementation.
Evaluation results
- PSNR on LOL-v1self-reported27.970
- SSIM on LOL-v1self-reported0.897
- PSNR on LOL-v2-Realself-reported29.330
- SSIM on LOL-v2-Realself-reported0.915
- PSNR on LOL-v2-Syntheticself-reported30.150
- SSIM on LOL-v2-Syntheticself-reported0.951
- PSNR on SDSD-indoorself-reported31.370
- SSIM on SDSD-indoorself-reported0.917
- PSNR on SDSD-outdoorself-reported31.730
- SSIM on SDSD-outdoorself-reported0.904