Update README.md
Browse files
README.md
CHANGED
|
@@ -27,52 +27,62 @@ tags:
|
|
| 27 |
StreetCLIP is a robust foundation model for open-domain image geolocalization and other
|
| 28 |
geographic and climate-related tasks.
|
| 29 |
|
| 30 |
-
Trained on
|
| 31 |
-
on multiple open-domain image geolocalization benchmarks in zero-shot,
|
| 32 |
-
trained on millions of images.
|
| 33 |
|
|
|
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
-
## Model
|
| 38 |
|
| 39 |
-
<!-- Provide a longer summary of what this model is. -->
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
- **Developed by:** Authors not disclosed
|
| 43 |
- **Model type:** [CLIP](https://openai.com/blog/clip/)
|
| 44 |
- **Language:** English
|
| 45 |
- **License:** Create Commons Attribution Non Commercial 4.0
|
| 46 |
-
- **
|
| 47 |
|
| 48 |
## Model Sources
|
| 49 |
|
| 50 |
- **Paper:** Pre-print available soon ...
|
| 51 |
-
- **Demo:** Currently in development ...
|
| 52 |
|
| 53 |
# Uses
|
| 54 |
|
| 55 |
-
|
|
|
|
|
|
|
| 56 |
|
| 57 |
## Direct Use
|
| 58 |
|
| 59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
## Downstream Use
|
| 62 |
|
| 63 |
-
|
|
|
|
| 64 |
|
| 65 |
## Out-of-Scope Use
|
| 66 |
|
| 67 |
-
|
| 68 |
|
| 69 |
# Bias, Risks, and Limitations
|
| 70 |
|
| 71 |
-
|
|
|
|
| 72 |
|
| 73 |
## Recommendations
|
| 74 |
-
|
| 75 |
-
|
|
|
|
|
|
|
| 76 |
|
| 77 |
## How to Get Started with the Model
|
| 78 |
|
|
@@ -102,14 +112,23 @@ probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the lab
|
|
| 102 |
|
| 103 |
## Training Data
|
| 104 |
|
| 105 |
-
StreetCLIP was trained on an
|
| 106 |
-
urban and rural images. The data used to train the model comes from 101 countries
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 107 |
|
| 108 |
## Training Procedure
|
| 109 |
|
| 110 |
-
|
|
|
|
|
|
|
|
|
|
| 111 |
|
| 112 |
-
|
|
|
|
| 113 |
|
| 114 |
# Evaluation
|
| 115 |
|
|
@@ -121,12 +140,17 @@ identify the correct country and then city of geographical image origin.
|
|
| 121 |
|
| 122 |
### Testing Data
|
| 123 |
|
|
|
|
|
|
|
| 124 |
* [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/).
|
| 125 |
* [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps)
|
| 126 |
|
| 127 |
### Metrics
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
## Results
|
| 132 |
|
|
@@ -143,10 +167,6 @@ achieving SOTA performance on a selection of benchmark metrics.
|
|
| 143 |
- **Hardware Type:** 4 NVIDIA A100 GPUs
|
| 144 |
- **Hours used:** 12
|
| 145 |
|
| 146 |
-
# Example Image Attribution
|
| 147 |
-
|
| 148 |
-
To be added soon ...
|
| 149 |
-
|
| 150 |
# Citation
|
| 151 |
|
| 152 |
Preprint available soon ...
|
|
|
|
| 27 |
StreetCLIP is a robust foundation model for open-domain image geolocalization and other
|
| 28 |
geographic and climate-related tasks.
|
| 29 |
|
| 30 |
+
Trained on an original dataset of 1.1 million street-level urban and rural geo-tagged images, it achieves
|
| 31 |
+
state-of-the-art performance on multiple open-domain image geolocalization benchmarks in zero-shot,
|
| 32 |
+
outperforming supervised models trained on millions of images.
|
| 33 |
|
| 34 |
+
# Model Description
|
| 35 |
|
| 36 |
+
StreetCLIP is a model pretrained by deriving image captions synthetically from image class labels using
|
| 37 |
+
a domain-specific caption template. This allows StreetCLIP to transfer its generalized zero-shot learning
|
| 38 |
+
capabilities to a specific domain (i.e. the domain of image geolocalization).
|
| 39 |
+
StreetCLIP builds on the OpenAI's pretrained large version of CLIP ViT, using 14x14 pixel
|
| 40 |
+
patches and images with a 336 pixel side length.
|
| 41 |
|
| 42 |
+
## Model Details
|
| 43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
- **Model type:** [CLIP](https://openai.com/blog/clip/)
|
| 45 |
- **Language:** English
|
| 46 |
- **License:** Create Commons Attribution Non Commercial 4.0
|
| 47 |
+
- **Trained from model:** [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336)
|
| 48 |
|
| 49 |
## Model Sources
|
| 50 |
|
| 51 |
- **Paper:** Pre-print available soon ...
|
|
|
|
| 52 |
|
| 53 |
# Uses
|
| 54 |
|
| 55 |
+
StreetCLIP has a deep understanding of the visual features found in street-level urban and rural scenes
|
| 56 |
+
and knows how to relate these concepts to specific countries, regions, and cities. Given its training setup,
|
| 57 |
+
the following use cases are recommended for StreetCLIP.
|
| 58 |
|
| 59 |
## Direct Use
|
| 60 |
|
| 61 |
+
StreetCLIP can be used out-of-the box using zero-shot learning to infer the geolocation of images on a country, region,
|
| 62 |
+
or city level. Given that StreetCLIP was pretrained on a dataset of stree-level urban and rural images,
|
| 63 |
+
the best performance can be expected on images from a similar distribution.
|
| 64 |
+
|
| 65 |
+
Broader direct use cases
|
| 66 |
|
| 67 |
## Downstream Use
|
| 68 |
|
| 69 |
+
StreetCLIP can be finetuned for any downstream applications that require geographic or street-level urban or rural
|
| 70 |
+
scene understanding.
|
| 71 |
|
| 72 |
## Out-of-Scope Use
|
| 73 |
|
| 74 |
+
Any use cases attempting to geolocate users' private images are out-of-scope and discouraged.
|
| 75 |
|
| 76 |
# Bias, Risks, and Limitations
|
| 77 |
|
| 78 |
+
StreetCLIP was not trained on social media images or images of identifable people for a reason. As such, any use case
|
| 79 |
+
attempting to geolocalize users' private images
|
| 80 |
|
| 81 |
## Recommendations
|
| 82 |
+
We encourage the community to apply StreetCLIP to applications with significant social impact of which there are many.
|
| 83 |
+
Examples include analyzing the built environment (i.e. building quality, type, or energy efficiency classification),
|
| 84 |
+
infrastructure (i.e. road quality, utility pole maintenance, identifying damage from natural disasters), and natural
|
| 85 |
+
environment (i.e. image segmentation, vegetation mapping and classification, tracking deforestation).
|
| 86 |
|
| 87 |
## How to Get Started with the Model
|
| 88 |
|
|
|
|
| 112 |
|
| 113 |
## Training Data
|
| 114 |
|
| 115 |
+
StreetCLIP was trained on an original, unreleased street-level dataset of 1.1 million real-world,
|
| 116 |
+
urban and rural images. The data used to train the model comes from 101 countries, biased towards
|
| 117 |
+
western countries and not including India and China.
|
| 118 |
+
|
| 119 |
+
## Preprocessing
|
| 120 |
+
|
| 121 |
+
Same preprocessing as [openai/clip-vit-large-patch14-336](https://huggingface.co/openai/clip-vit-large-patch14-336).
|
| 122 |
|
| 123 |
## Training Procedure
|
| 124 |
|
| 125 |
+
StreetCLIP is initialized with OpenAI's pretrained large version of CLIP ViT and then pretrained using the synthetic
|
| 126 |
+
caption domain-specific pretraining method described in the paper corresponding to this work. StreetCLIP was trained
|
| 127 |
+
for 3 epochs using an AdamW optimizer with a learning rate of 1e-6 on 3 NVIDIA A100 80GB GPUs, a batch size of 32,
|
| 128 |
+
and gradient accumulation of 12 steps.
|
| 129 |
|
| 130 |
+
StreetCLIP was trained with the goal of matching images in the batch
|
| 131 |
+
with the caption correponding to the correct city, region, and country of the images' origins.
|
| 132 |
|
| 133 |
# Evaluation
|
| 134 |
|
|
|
|
| 140 |
|
| 141 |
### Testing Data
|
| 142 |
|
| 143 |
+
StreetCLIP was evaluated on the following two open-domain image geolocalization benchmarks.
|
| 144 |
+
|
| 145 |
* [IM2GPS](http://graphics.cs.cmu.edu/projects/im2gps/).
|
| 146 |
* [IM2GPS3K](https://github.com/lugiavn/revisiting-im2gps)
|
| 147 |
|
| 148 |
### Metrics
|
| 149 |
|
| 150 |
+
The objective of the listed benchmark datasets is to predict the images' coordinates of origin with as
|
| 151 |
+
little deviation as possible. A common metric set forth in prior literature is called Percentage at Kilometer (% @ KM).
|
| 152 |
+
The Percentage at Kilometer metric first calculates the distance in kilometers between the predicted coordinates
|
| 153 |
+
to the ground truth coordinates and then looks at what percentage of error distances are below a certain kilometer threshold.
|
| 154 |
|
| 155 |
## Results
|
| 156 |
|
|
|
|
| 167 |
- **Hardware Type:** 4 NVIDIA A100 GPUs
|
| 168 |
- **Hours used:** 12
|
| 169 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 170 |
# Citation
|
| 171 |
|
| 172 |
Preprint available soon ...
|