metadata

license: mit
tags:
  - low-light
  - low-light-image-enhancement
  - image-enhancement
  - image-restoration
  - computer-vision
  - low-light-enhance
  - multimodal
  - multimodal-learning
  - transformer
  - transformers
  - vision-transformer
  - vision-transformers
model-index:
  - name: ModalFormer
    results:
      - task:
          type: low-light-image-enhancement
        dataset:
          name: LOL-v1
          type: LOL-v1
        metrics:
          - type: PSNR
            value: 27.97
            name: PSNR
          - type: SSIM
            value: 0.897
            name: SSIM
      - task:
          type: low-light-image-enhancement
        dataset:
          name: LOL-v2-Real
          type: LOL-v2-Real
        metrics:
          - type: PSNR
            value: 29.33
            name: PSNR
          - type: SSIM
            value: 0.915
            name: SSIM
      - task:
          type: low-light-image-enhancement
        dataset:
          name: LOL-v2-Synthetic
          type: LOL-v2-Synthetic
        metrics:
          - type: PSNR
            value: 30.15
            name: PSNR
          - type: SSIM
            value: 0.951
            name: SSIM
      - task:
          type: low-light-image-enhancement
        dataset:
          name: SDSD-indoor
          type: SDSD-indoor
        metrics:
          - type: PSNR
            value: 31.37
            name: PSNR
          - type: SSIM
            value: 0.917
            name: SSIM
      - task:
          type: low-light-image-enhancement
        dataset:
          name: SDSD-outdoor
          type: SDSD-outdoor
        metrics:
          - type: PSNR
            value: 31.73
            name: PSNR
          - type: SSIM
            value: 0.904
            name: SSIM
      - task:
          type: low-light-image-enhancement
        dataset:
          name: MEF
          type: MEF
        metrics:
          - type: NIQE
            value: 3.44
            name: NIQE
      - task:
          type: low-light-image-enhancement
        dataset:
          name: LIME
          type: LIME
        metrics:
          - type: NIQE
            value: 3.82
            name: NIQE
      - task:
          type: low-light-image-enhancement
        dataset:
          name: DICM
          type: DICM
        metrics:
          - type: NIQE
            value: 3.64
            name: NIQE
      - task:
          type: low-light-image-enhancement
        dataset:
          name: NPE
          type: NPE
        metrics:
          - type: NIQE
            value: 3.55
            name: NIQE
pipeline_tag: image-to-image

✨ ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

Alexandru Brateanu, Raul Balmez, Ciprian Orhei, Codruta Ancuti, Cosmin Ancuti

Abstract

Low-light image enhancement (LLIE) is a fundamental yet challenging task due to the presence of noise, loss of detail, and poor contrast in images captured under insufficient lighting conditions. Recent methods often rely solely on pixel-level transformations of RGB images, neglecting the rich contextual information available from multiple visual modalities. In this paper, we present ModalFormer, the first large-scale multimodal framework for LLIE that fully exploits nine auxiliary modalities to achieve state-of-the-art performance. Our model comprises two main components: a Cross-modal Transformer (CM-T) designed to restore corrupted images while seamlessly integrating multimodal information, and multiple auxiliary subnetworks dedicated to multimodal feature reconstruction. Central to the CM-T is our novel Cross-modal Multi-headed Self-Attention mechanism (CM-MSA), which effectively fuses RGB data with modality-specific features—including deep feature embeddings, segmentation information, geometric cues, and color information—to generate information-rich hybrid attention maps. Extensive experiments on multiple benchmark datasets demonstrate ModalFormer’s state-of-the-art performance in LLIE. Pre-trained models and results are made available at https://github.com/albrateanu/ModalFormer

🆕 Updates

29.07.2025 🎉 The ModalFormer paper is now available! Check it out and explore our results and methodology.
28.07.2025 📦 Pre-trained models and test data published! ArXiv paper version and HuggingFace demo coming soon, stay tuned!

⚙️ Setup and Testing

For ease, utilize a Linux machine with CUDA-ready devices (GPUs).

To setup the environment, first run the provided setup script:

./environment_setup.sh
# or 
bash environment_setup.sh

Note: in case of difficulties, ensure environment_setup.sh is executable by running:

chmod +x environment_setup.sh

Give the setup a couple of minutes to run.

Please check out the GitHub repository for more implementation details.

📚 Citation

@misc{brateanu2025modalformer,
      title={ModalFormer: Multimodal Transformer for Low-Light Image Enhancement}, 
      author={Alexandru Brateanu and Raul Balmez and Ciprian Orhei and Codruta Ancuti and Cosmin Ancuti},
      year={2025},
      eprint={2507.20388},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.20388}, 
}

🙏 Acknowledgements

We use this codebase as foundation for our implementation.

Paper: https://arxiv.org/pdf/2507.20388