Zero-Shot Image Classification
TerraTorch
multispectral
earth-observation
blumenstiel commited on
Commit
997a3f5
·
1 Parent(s): ccad809

Update README

Browse files
Files changed (1) hide show
  1. README.md +74 -1
README.md CHANGED
@@ -8,6 +8,79 @@ base_model:
8
  - laion/CLIP-ViT-B-16-laion2B-s34B-b88K
9
  ---
10
 
 
 
 
11
  # Llama3-MS-CLIP ViT-B/16
12
 
13
- More information coming soon. Stay tuned!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - laion/CLIP-ViT-B-16-laion2B-s34B-b88K
9
  ---
10
 
11
+ [![arXiv](https://img.shields.io/badge/arXiv-2504.11171-b31b1b?logo=arxiv)](https://arxiv.org/abs/2503.15969)
12
+ [![Code](https://img.shields.io/badge/GitHub-MS_CLIP-0F62FE?logo=github)](https://github.com/IBM/MS-CLIP)
13
+
14
  # Llama3-MS-CLIP ViT-B/16
15
 
16
+ Llama3-MS-CLIP is the first Vision-Language Model in the CLIP family that understands multispectral imagery.
17
+ It is trained on one million image-text pairs from the SSL4EO-S12-v1.1 dataset with generated captions.
18
+
19
+ ## Architecture
20
+
21
+ The CLIP model consists of two encoders for text and images. We extended the RGB patch embeddings to multispectral input and initialized the weights of the additional input channels with zeros. During the continual pre-training, the images and texts of each batch are encoded and combined. The loss increases the similarity of matching pairs while decreasing other combinations.
22
+
23
+ ![Architecture](assets/clip_architecture.png)
24
+
25
+ ## Evaluation
26
+
27
+ We evaluated Llama3-MS-CLIP with zero-shot classification and text-to-image retrieval results, measured in accuracy (%) ↑ and mAP@100 (%) ↑, respectively.
28
+ The following Figure compares our model with the OpenCLIP baselines and other EO-VLMs.
29
+ We applied a smoothed min-max scaling and annotated the lowest and highest scores.
30
+ Our multispectral CLIP model is outperforming other RGB-based models on most benchmarks.
31
+
32
+ ![Benchmarking](assets/benchmarking.png)
33
+
34
+
35
+ ## Usage
36
+
37
+ You can use the model out of the box for zero-shot classification and text-to-image retrieval with Sentinel-2 L2A images.
38
+
39
+ ### Setup
40
+
41
+ Clone the repository and set up a new env:
42
+ ```shell
43
+ git clone https://github.com/IBM/MS-CLIP
44
+ cd MS-CLIP
45
+ python -m venv venv
46
+ source venv/bin/activate
47
+ pip install -e .
48
+ ```
49
+
50
+ Run zero-shot classification with:
51
+ ```shell
52
+ python inference.py --run-classification \
53
+ --model-name "Llama3-MS-CLIP-Base" \
54
+ --images "path/to/sentinel2_files.tif" \
55
+ --class-names "class 1" "class 2" "class 3"
56
+ ```
57
+
58
+ Run text-to-image retrieval with:
59
+ ```shell
60
+ python inference.py --run_retrieval\
61
+ --model-name "Llama3-MS-CLIP-Base" \
62
+ --images "path/to/sentinel2_files.tif" \
63
+ --query "Your query text" \
64
+ --top-k 5 # Number of retrieved images
65
+ ```
66
+
67
+ More information is provided in the GitHub repository [MS-CLIP](https://github.com/IBM/MS-CLIP).
68
+
69
+ ## Citation
70
+
71
+ Please cite the following paper, if you use Llama3-MS-CLIP in your research:
72
+
73
+ ```
74
+ @article{marimo2025beyond,
75
+ title={Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation},
76
+ author={Marimo, Clive Tinashe and Blumenstiel, Benedikt and Nitsche, Maximilian and Jakubik, Johannes and Brunschwiler, Thomas},
77
+ journal={arXiv preprint arXiv:2503.15969},
78
+ year={2025}
79
+ }
80
+ ```
81
+
82
+ ## License
83
+
84
+ Built with Meta Llama 3.
85
+
86
+ While the model itself is not based on Llama 3 but OpenCLIP B/16, it is trained on captions generated by a Llama 3-derivative model (License: https://github.com/meta-llama/llama3/blob/main/LICENSE).