Spaces:
Runtime error
Runtime error
File size: 8,118 Bytes
2df809d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
# Continuous 3D Perception Model with Persistent State
<div align="center">
<img src="./assets/factory-ezgif.com-video-speed.gif" alt="CUT3R" />
</div>
<hr>
<br>
Official implementation of <strong>Continuous 3D Perception Model with Persistent State</strong>, CVPR 2025 (Oral)
[*QianqianWang**](https://qianqianwang68.github.io/),
[*Yifei Zhang**](https://forrest-110.github.io/),
[*Aleksander Holynski*](https://holynski.org/),
[*Alexei A Efros*](https://people.eecs.berkeley.edu/~efros/),
[*Angjoo Kanazawa*](https://people.eecs.berkeley.edu/~kanazawa/)
(*: equal contribution)
<div style="line-height: 1;">
<a href="https://cut3r.github.io/" target="_blank" style="margin: 2px;">
<img alt="Website" src="https://img.shields.io/badge/Website-CUT3R-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
</a>
<a href="https://arxiv.org/pdf/2501.12387" target="_blank" style="margin: 2px;">
<img alt="Arxiv" src="https://img.shields.io/badge/Arxiv-CUT3R-red?logo=%23B31B1B" style="display: inline-block; vertical-align: middle;"/>
</a>
</div>

## Table of Contents
- [TODO](#todo)
- [Get Started](#getting-started)
- [Installation](#installation)
- [Checkpoints](#download-checkpoints)
- [Inference](#inference)
- [Datasets](#datasets)
- [Evaluation](#evaluation)
- [Datasets](#datasets-1)
- [Evaluation Scripts](#evaluation-scripts)
- [Training and Fine-tuning](#training-and-fine-tuning)
- [Acknowledgements](#acknowledgements)
- [Citation](#citation)
## TODO
- [x] Release multi-view stereo results of DL3DV dataset.
- [ ] Online demo integrated with WebCam
## Getting Started
### Installation
1. Clone CUT3R.
```bash
git clone https://github.com/CUT3R/CUT3R.git
cd CUT3R
```
2. Create the environment.
```bash
conda create -n cut3r python=3.11 cmake=3.14.0
conda activate cut3r
conda install pytorch torchvision pytorch-cuda=12.1 -c pytorch -c nvidia # use the correct version of cuda for your system
pip install -r requirements.txt
# issues with pytorch dataloader, see https://github.com/pytorch/pytorch/issues/99625
conda install 'llvm-openmp<16'
# for training logging
pip install git+https://github.com/nerfstudio-project/gsplat.git
# for evaluation
pip install evo
pip install open3d
```
3. Compile the cuda kernels for RoPE (as in CroCo v2).
```bash
cd src/croco/models/curope/
python setup.py build_ext --inplace
cd ../../../../
```
### Download Checkpoints
We currently provide checkpoints on Google Drive:
| Modelname | Training resolutions | #Views| Head |
|-------------|----------------------|-------|------|
| [`cut3r_224_linear_4.pth`](https://drive.google.com/file/d/11dAgFkWHpaOHsR6iuitlB_v4NFFBrWjy/view?usp=drive_link) | 224x224 | 16 | Linear |
| [`cut3r_512_dpt_4_64.pth`](https://drive.google.com/file/d/1Asz-ZB3FfpzZYwunhQvNPZEUA8XUNAYD/view?usp=drive_link) | 512x384, 512x336, 512x288, 512x256, 512x160, 384x512, 336x512, 288x512, 256x512, 160x512 | 4-64 | DPT |
> `cut3r_224_linear_4.pth` is our intermediate checkpoint and `cut3r_512_dpt_4_64.pth` is our final checkpoint.
To download the weights, run the following commands:
```bash
cd src
# for 224 linear ckpt
gdown --fuzzy https://drive.google.com/file/d/11dAgFkWHpaOHsR6iuitlB_v4NFFBrWjy/view?usp=drive_link
# for 512 dpt ckpt
gdown --fuzzy https://drive.google.com/file/d/1Asz-ZB3FfpzZYwunhQvNPZEUA8XUNAYD/view?usp=drive_link
cd ..
```
### Inference
To run the inference code, you can use the following command:
```bash
# the following script will run inference offline and visualize the output with viser on port 8080
python demo.py --model_path MODEL_PATH --seq_path SEQ_PATH --size SIZE --vis_threshold VIS_THRESHOLD --output_dir OUT_DIR # input can be a folder or a video
# Example:
# python demo.py --model_path src/cut3r_512_dpt_4_64.pth --size 512 \
# --seq_path examples/001 --vis_threshold 1.5 --output_dir tmp
#
# python demo.py --model_path src/cut3r_224_linear_4.pth --size 224 \
# --seq_path examples/001 --vis_threshold 1.5 --output_dir tmp
# the following script will run inference with global alignment and visualize the output with viser on port 8080
python demo_ga.py --model_path MODEL_PATH --seq_path SEQ_PATH --size SIZE --vis_threshold VIS_THRESHOLD --output_dir OUT_DIR
```
Output results will be saved to `output_dir`.
> Currently, we accelerate the feedforward process by processing inputs in parallel within the encoder, which results in linear memory consumption as the number of frames increases.
## Datasets
Our training data includes 32 datasets listed below. We provide processing scripts for all of them. Please download the datasets from their official sources, and refer to [preprocess.md](docs/preprocess.md) for processing scripts and more information about the datasets.
- [ARKitScenes](https://github.com/apple/ARKitScenes)
- [BlendedMVS](https://github.com/YoYo000/BlendedMVS)
- [CO3Dv2](https://github.com/facebookresearch/co3d)
- [MegaDepth](https://www.cs.cornell.edu/projects/megadepth/)
- [ScanNet++](https://kaldir.vc.in.tum.de/scannetpp/)
- [ScanNet](http://www.scan-net.org/ScanNet/)
- [WayMo Open dataset](https://github.com/waymo-research/waymo-open-dataset)
- [WildRGB-D](https://github.com/wildrgbd/wildrgbd/)
- [Map-free](https://research.nianticlabs.com/mapfree-reloc-benchmark/dataset)
- [TartanAir](https://theairlab.org/tartanair-dataset/)
- [UnrealStereo4K](https://github.com/fabiotosi92/SMD-Nets)
- [Virtual KITTI 2](https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds-vkitti-2/)
- [3D Ken Burns](https://github.com/sniklaus/3d-ken-burns.git)
- [BEDLAM](https://bedlam.is.tue.mpg.de/)
- [COP3D](https://github.com/facebookresearch/cop3d)
- [DL3DV](https://github.com/DL3DV-10K/Dataset)
- [Dynamic Replica](https://github.com/facebookresearch/dynamic_stereo)
- [EDEN](https://lhoangan.github.io/eden/)
- [Hypersim](https://github.com/apple/ml-hypersim)
- [IRS](https://github.com/HKBU-HPML/IRS)
- [Matterport3D](https://niessner.github.io/Matterport/)
- [MVImgNet](https://github.com/GAP-LAB-CUHK-SZ/MVImgNet)
- [MVS-Synth](https://phuang17.github.io/DeepMVS/mvs-synth.html)
- [OmniObject3D](https://omniobject3d.github.io/)
- [PointOdyssey](https://pointodyssey.com/)
- [RealEstate10K](https://google.github.io/realestate10k/)
- [SmartPortraits](https://mobileroboticsskoltech.github.io/SmartPortraits/)
- [Spring](https://spring-benchmark.org/)
- [Synscapes](https://synscapes.on.liu.se/)
- [UASOL](https://osf.io/64532/)
- [UrbanSyn](https://www.urbansyn.org/)
- [HOI4D](https://hoi4d.github.io/)
## Evaluation
### Datasets
Please follow [MonST3R](https://github.com/Junyi42/monst3r/blob/main/data/evaluation_script.md) and [Spann3R](https://github.com/HengyiWang/spann3r/blob/main/docs/data_preprocess.md) to prepare **Sintel**, **Bonn**, **KITTI**, **NYU-v2**, **TUM-dynamics**, **ScanNet**, **7scenes** and **Neural-RGBD** datasets.
The datasets should be organized as follows:
```
data/
├── 7scenes
├── bonn
├── kitti
├── neural_rgbd
├── nyu-v2
├── scannetv2
├── sintel
└── tum
```
### Evaluation Scripts
Please refer to the [eval.md](docs/eval.md) for more details.
## Training and Fine-tuning
Please refer to the [train.md](docs/train.md) for more details.
## Acknowledgements
Our code is based on the following awesome repositories:
- [DUSt3R](https://github.com/naver/dust3r)
- [MonST3R](https://github.com/Junyi42/monst3r.git)
- [Spann3R](https://github.com/HengyiWang/spann3r.git)
- [Viser](https://github.com/nerfstudio-project/viser)
We thank the authors for releasing their code!
## Citation
If you find our work useful, please cite:
```bibtex
@article{wang2025continuous,
title={Continuous 3D Perception Model with Persistent State},
author={Wang, Qianqian and Zhang, Yifei and Holynski, Aleksander and Efros, Alexei A and Kanazawa, Angjoo},
journal={arXiv preprint arXiv:2501.12387},
year={2025}
}
```
|