RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing

Fengxiang Wang¹, Hongzhen Wang^{2 †}, Yulin Wang², Di Wang³, Mingshuo Chen⁴, Haiyan Zhao²,
Yangang Sun², Shuo Wang², Long Lan¹, Wenjing Yang^{1 †}, Jing Zhang^{3 †}

¹ National University of Defense Technology, China, ² Tsinghua University, China,
³ Wuhan University, China, ⁴ Beijing University of Posts and Telecommunications, China

📚 Contents

News
Abstract
Overview
Evaluation Results
Scaling Behavior
Pretraining
Checkpoints
Citation
Acknowledgement

🔥News

[2025.03.13] The paper is available on arXiv.

📄Abstract

Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data.

🔍Overview

The input image is first divided into patches, and high-value patches are selected for random rotation using the Adaptive Rotation Encoding Strategy. These patches are then tokenized and processed by the Mamba encoder. The encoded features undergo autoregressive next-token prediction, followed by a multi-scale strategy that computes losses at different scales for gradient updates. RoMA optimally adapts the Mamba architecture for remote sensing, making its encoder a robust feature extractor for diverse downstream tasks.

✅Evaluation Results

Results for scene classification, change detection, and semantic segmentation.
“TR” represents the ratio of training data.
^★ indicates results from MA3E and MTP.
^† denotes our reproduction with their official code.
Methods	Publication	Backbone	Params	Scene Classification		Change Detection	Semantic Segmentation
				AID	UCM	OSCD	SpaceNetv1
				OA(TR=50%)	OA(TR=50%)	F1	mF1
Natural Image pretraining
MoCo v3^★	ICCV'21	ViT-B	86M	78.72	38.34	-	-
DINO^★	ICCV'21	ViT-B	86M	78.51	40.04	-	-
MAE^★	CVPR'22	ViT-B	86M	84.21	52.75	-	-
SimMIM^★	CVPR'22	ViT-B	86M	83.19	51.48	-	-
LoMaR^★	Arxiv'22	ViT-B	86M	82.26	51.89	-	-
MixMAE^★	CVPR'23	Swin-B/W14	88M	81.53	50.63	-	-
ARM^†	ICLR'25	Mamba-B	85M	81.14	50.41	47.28	77.89
RS Image pretraining
SeCo^★	ICCV'21	ResNet-50	25.6M	78.26	47.45	47.67	77.09
CACo^★	CVPR'23	ResNet-50	25.6M	77.81	40.53	52.11	77.94
SatMAE^★	NIPS'22	ViT-L	307M	55.10	34.28	52.76	78.07
ScaleMAE^★	ICCV'23	ViT-L	307M	48.46	28.19	-	-
GFM^★	ICCV'23	Swin-B	88M–	80.58	49.73	-	-
RVSA^★	TGRS'23	ViT-B+RVSA	86M	84.06	50.86	50.28	79.56
SatMAE++^†	CVPR'24	ViT-L	307M	85.98	55.72	53.10	79.21
MA3E^★	ECCV'24	ViT-B	86M	85.86	55.69	-	-
RoMA	Baidu & Hugging Face	Mamba-B	85M	87.36	59.45	55.63	79.50

For implementation of each task, please check the corresponding folder for more details.

📈Scaling Behavior

Mamba shows a clear performance boost on downstream tasks as the pretraining data volume grows. We pretrain the Mamba-Base model with RoMA across various data scales and evaluate its performance in the downstream tasks. As illustrated in Figure 2, larger datasets lead to significant improvements. Mamba-based RSFMs exhibit no significant performance bottlenecks across a broad pretraining data scale from 62.5K to 4M, achieving data learning capabilities on par with ViT-based RSFMs.

Mamba’s performance also improves with increasing model size. We conduct extensive pretraining on four model variants—Tiny, Small, Base, and Large—following the configurations in our code. As shown in Figure 3, larger models consistently achieve superior results on downstream tasks. Although Mamba-Large surpasses Mamba-Base in AID dataset, its performance gain remains limited, likely due to insufficient pretraining. With only 300 epochs on 4 million samples, the training may not be adequate for a 297M-parameter model. Due to experimental constraints, we did not extend pretraining to 800 epochs as in MAE. The OSCD and SpaceNet experiments are ongoing, with updates to follow. However, these results do not alter our key findings: Mamba-based RSFMs pretrained with RoMA demonstrate performance gains as model parameters scale.

🚀Pretraining

For environment setup and pretraining instructions, please refer to RoMA/requirements.txt and RoMA/train.sh.

🎯Checkpoints

We provide our pretrained weights in Baidu & Hugging Face.

🔗Citation

If you find RoMA helpful, please consider citing:

@article{wang2025roma,
  title={RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing},
  author={Fengxiang Wang and Hongzhen Wang and Yulin Wang and Di Wang and Mingshuo Chen and Haiyan Zhao and Yangang Sun and Shuo Wang and Long Lan and Wenjing Yang and Jing Zhang},
  journal={arXiv preprint arXiv:2503.10392},
  year={2025}
}

🤝Acknowledgement

ARM: Autoregressive Pretraining with Mamba in Vision.
MTP: MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining.
RSP: An Empirical Study of Remote Sensing Pretraining.
open-cd: An open source change detection toolbox based on a series of open source general vision task tools.
mmcv, mmsegmentation: OpenMMLab Toolbox.