FunAILab
/

NeCo

English

computer_vision

Model card Files Files and versions

xet

Community

valentinospariza commited on Mar 29

Commit

9fbca3a

verified ·

1 Parent(s): 2478b11

Update README.md

Browse files

Files changed (1) hide show

README.md +133 -3

README.md CHANGED Viewed

@@ -1,3 +1,133 @@
----
-license: mit
----

+---
+license: mit
+language:
+- en
+base_model:
+- facebook/dinov2-base
+- facebook/dinov2-small
+tags:
+- computer_vision
+---
+# Near, far: Patch-ordering enhances vision foundation models' scene understanding
+Welcome to the Hugging Face repository for **NeCo**. an adapted vision encoder that captures fine-grained details and structural information essential for performing key-point matching, semantic segmentation and more. This repository hosts pretrained checkpoints for NeCo, enabling easy integration into your projects.
+Our paper discussing our work:
+**"Near, far: Patch-ordering enhances vision foundation models' scene understanding"**
+*[Valentinos Pariza](https://vpariza.github.io), [Mohammadreza Salehi](https://smsd75.github.io),[Gertjan J. Burghouts](https://gertjanburghouts.github.io), [Francesco Locatello](https://www.francescolocatello.com/), [Yuki M. Asano](yukimasano.github.io)*
+🌐 **[Project Page](https://vpariza.github.io/NeCo/)**
+⌨️ **[GitHub Repository](https://github.com/vpariza/NeCo)**
+📄 **[Read the Paper on arXiv](https://arxiv.org/abs/2408.11054)**
+## Model Details
+### Model Description
+NeCo introduces a new self-supervised learning technique for enhancing spatial representations in vision transformers. By leveraging Patch Neighbor Consistency, NeCo captures fine-grained details and structural information that are crucial for various downstream tasks, such as semantic segmentation.
+- **Model type:** Vision Encoder (Dino, Dinov2, ...)
+- **Language(s) (NLP):** Python
+- **License:** MIT
+- **Finetuned from model [optional]:** Dinov2, Dinov2R, Dino, ...
+## How to Get Started with the Model
+To use NeCo models on downstream dense prediction tasks, you just need to install `timm`  and `torch` and depending on which checkpoint you use you can load it as follows:
+The models can be download from our [NeCo Hugging Face repo](https://huggingface.co/FunAILab/NeCo/tree/main).
+#### Models after post-training dinov2 (following dinov2 architecture)
+##### NeCo on Dinov2
+```python
+import torch
+# change to dinov2_vitb14 for base as described in:
+#    https://github.com/facebookresearch/dinov2
+model =  torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
+path_to_checkpoint = "<your path to downloaded ckpt>"
+state_dict = torch.load(path_to_checkpoint)
+model.load_state_dict(state_dict, strict=False)
+```
+##### NeCo on Dinov2 with Registers
+```python
+import torch
+# change to dinov2_vitb14_reg for base as described in:
+#    https://github.com/facebookresearch/dinov2
+model =  torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
+path_to_checkpoint = "<your path to downloaded ckpt>"
+state_dict = torch.load(path_to_checkpoint)
+model.load_state_dict(state_dict, strict=False)
+```
+#### Models after post-training dino or similar (following dino architecture)
+##### NeCo on Dinov2 with Registers
+```python
+import torch
+from timm.models.vision_transformer import vit_small_patch16_224, vit_base_patch16_224
+# Change to vit_base_patch8_224() if you want to use our larger model
+model = vit_small_patch16_224()
+path_to_checkpoint = "<your path to downloaded ckpt>"
+state_dict = torch.load(path_to_checkpoint, map_location='cpu')
+model.load_state_dict(state_dict, strict=False)
+```
+**Note:** In case you want to directly load the weights of the model from a hugging face url, please execute:
+```python
+import torch
+state_dict = torch.hub.load_state_dict_from_url("<url to the hugging face checkpoint>")
+```
+## Training Details
+### Training Data
+* We have post-trained our models on the **COCO Dataset**.
+### Training Procedure
+Please look our repository and read our paper for more details.
+## Environmental Impact
+- **Hardware Type:** NVIDIA A100 GPU
+- **Hours used:** 18 (per model)
+- **Cloud Provider:** Helma NHR FAU (Germany), (Snellius The Netherlands)
+- **Compute Region:** Europe/Germany & Netherlands
+## Citation
+**BibTeX:**
+```
+@inproceedings{
+   pariza2025near,
+   title={Near, far: Patch-ordering enhances vision foundation models' scene understanding},
+   author={Valentinos Pariza and Mohammadreza Salehi and Gertjan J. Burghouts and Francesco Locatello and Yuki M Asano},
+   booktitle={The Thirteenth International Conference on Learning Representations},
+   year={2025},
+   url={https://openreview.net/forum?id=Qro97zWC29}
+}
+```
+<!-- **APA:** -->
+<!-- [More Information Needed] -->
+<!-- ## Glossary [optional] -->
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+<!-- [More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed] -->