SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation
Abstract
In the field of synthetic aperture radar (SAR) remote sensing image interpretation, although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified, and the first multi-task dialogue benchmark in the SAR field has been successfully established. The project will be released at https://github.com/JimmyMa99/SARChat, aiming to promote the in-depth development and wide application of SAR visual language models.
Community
SARChat-Bench-2M is the first large-scale multimodal dialogue dataset focusing on Synthetic Aperture Radar (SAR) imagery. It contains approximately 2 million high-quality SAR image-text pairs, supporting multiple tasks including scene classification, image captioning, visual question answering, and object localization. We conducted comprehensive evaluations on 16 state-of-the-art vision-language models (including Qwen2VL, InternVL2.5, and LLaVA), establishing the first multi-task benchmark in the SAR domain.
Great !!!
It belongs to the first visual text dataset focusing on SAR, which facilitates everyone's research on large language models. It's really powerful.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models (2024)
- GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing (2025)
- Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment (2024)
- SAR Strikes Back: A New Hope for RSVQA (2025)
- Measuring and Mitigating Hallucinations in Vision-Language Dataset Generation for Remote Sensing (2025)
- ComprehendEdit: A Comprehensive Dataset and Evaluation Framework for Multimodal Knowledge Editing (2024)
- FiVL: A Framework for Improved Vision-Language Alignment (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper