roll-ai commited on
Commit
b33ce09
Β·
verified Β·
1 Parent(s): ded78ff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -142
README.md CHANGED
@@ -1,142 +1,13 @@
1
- # FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis (CogVideoX-based FloVD)<br>
2
-
3
- ![Teaser image 1](./assets/pages/teaser.png)
4
-
5
- [\[Project Page\]](https://jinwonjoon.github.io/flovd_site/)
6
- [\[arXiv\]](https://arxiv.org/abs/2502.08244/)
7
-
8
- **FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis**<br>
9
- Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho<br>
10
- POSTECH, Microsoft Research Asia
11
- <br>
12
-
13
- ## Gallery
14
-
15
- ### FloVD-CogVideoX-5B
16
-
17
-
18
- <table border="0" style="width: 100%; text-align: left; margin-top: 20px;">
19
- <tr>
20
- <td>
21
- <video src="https://github.com/user-attachments/assets/a55d1c29-6682-417d-886c-695b1d1b61fd" width="100%" controls autoplay loop></video>
22
- </td>
23
- <td>
24
- <video src="https://github.com/user-attachments/assets/4def8617-063f-4e61-969a-fd0507dbdeec" width="100%" controls autoplay loop></video>
25
- </td>
26
- <td>
27
- <video src="https://github.com/user-attachments/assets/55745611-fea3-4f3f-bdd1-48b5f6c24f98" width="100%" controls autoplay loop></video>
28
- </td>
29
- <td>
30
- <video src="https://github.com/user-attachments/assets/97be3121-ae38-45f9-822a-e387cf262824" width="100%" controls autoplay loop></video>
31
- </td>
32
- </tr>
33
- </table>
34
-
35
- ## Project Updates
36
-
37
- - **News**: ```2025/05/02```: We have updated the code for `FloVD-CogVideoX`. We will release dataset preprocessing and training codes soon.
38
-
39
- - **News**: ```2025/02/26```: Our paper has been accepted to CVPR 2025.
40
-
41
-
42
- ## Quick Start
43
-
44
- ### Prompt Optimization
45
-
46
- As mentioned in [CogVideoX](https://github.com/THUDM/CogVideo), we recommend to use long, detailed text prompts to get better results. Our FloVD-CogVideoX model is trained using text captions extracted from [CogVLM2](https://github.com/THUDM/CogVLM2).
47
-
48
- ### Environment
49
-
50
- **Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.**
51
-
52
- ```
53
- pip install -r requirements.txt
54
- ```
55
-
56
- ### Optical flow normalization
57
- As mentioned in FloVD paper, we normalize optical flow following [Generative Image Dynamics](https://generative-dynamics.github.io/). For this, we use scale factors (s_x, s_y) of (60, 36) for both FVSM and OMSM.
58
-
59
- ### Pre-trained checkpoints
60
- Download the FloVD-CogVideoX <br>
61
- FVSM and OMSM (Curated) <br>
62
- [\[Google Drive\]](https://drive.google.com/drive/folders/1Y7Fha8QKX6bg_0YEOxQf0M6uaPJ9SfgB?usp=sharing)
63
- In addition, we used the off-the-shelf depth estimation model (Depth Anything V2, metric depth).
64
- For these models, please refer links below. <br>
65
- [\[Depth_anything_v2_metric\]](https://github.com/DepthAnything/Depth-Anything-V2/tree/main/metric_depth)
66
- <br>
67
- Then, place these checkpoints in ./ckpt directory
68
- ```shell
69
- # File tree
70
- ./ckpt/
71
- β”œβ”€β”€ FVSM
72
- β”‚ β”œ FloVD_FVSM_Controlnet.pt
73
- β”œβ”€β”€ OMSM
74
- β”‚ β”œ selected_blocks.safetensors
75
- β”‚ β”œ pytorch_lora_weights.safetensors
76
- β”œβ”€β”€ others
77
- β”‚ β”œ depth_anything_v2_metric_hypersim_vitb.pth
78
- ```
79
-
80
- ### Pre-defined camera trajectory
81
- We provide several example camera trajectory for user's quick inference.
82
- Refer to "./assets/cam_trajectory/" for visualization of each camera trajectory.
83
- ```shell
84
- # File tree
85
- ./assets/
86
- β”œβ”€β”€ manual_poses
87
- β”‚ β”œ ...
88
- β”œβ”€β”€ re10k_poses
89
- β”‚ β”œ ...
90
- β”œβ”€β”€ manual_poses_PanTiltSpin
91
- β”‚ β”œ ...
92
- ```
93
-
94
- ### Inference Settings
95
- In the inference time, we recommend to use the same setting used in the training time.
96
- + The number of frames: 49
97
-
98
- + FPS: 16
99
-
100
- + Flow scale factor: (s_x, s_y) = (60, 36)
101
-
102
- + CONTROLNET_GUIDANCE_END: 0.4 for better camera controllability, 0.1 for more natural object motions. This argument means the ratio of timestep to inject ControlNet features to the pre-trained model.
103
-
104
-
105
- ### Inference
106
-
107
- + [flovd_demo](inference/flovd_demo.py): To synthesize videos with desired camera trajectory and natural object motions, use this. A more detailed inference code explanation, including the significance of common parameters. Refer to [flovd_demo_script](inference/inference_scripts/flovd_demo.sh)
108
-
109
- + [flovd_fvsm_demo](inference/flovd_fvsm_demo.py): You can solely use FVSM model for more accurate camera control with little object motions. This code omits OMSM and only uses FVSM. (The script will be released soon.)
110
-
111
- + [flovd_ddp_demo](inference/flovd_ddp_demo.py): If you want to sample large number of videos, you can use this. Note that you need to prepare dataset in advance following our dataset preprocessing pipeline. (The preprocessing pipeline will be released.)
112
-
113
- ### Tools
114
-
115
- This folder contains some tools for camera trajectory generation, visualization, etc.
116
-
117
- + [generate_camparam](tools/generate_camparam.py): Generate manual camera parameters such as zoom-in, zoom-out, etc.
118
-
119
- + [visualize trajectory](tools/visualize_trajectory.py): Converts SAT model weights to Huggingface model weights.
120
-
121
-
122
-
123
- ## Citation
124
-
125
- 🌟 If you find our work helpful, please leave us a star and cite our paper.
126
-
127
- ```
128
- @article{jin2025flovd,
129
- title={FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis},
130
- author={Jin, Wonjoon and Dai, Qi and Luo, Chong and Baek, Seung-Hwan and Cho, Sunghyun},
131
- journal={arXiv preprint arXiv:2502.08244},
132
- year={2025}
133
- }
134
- ```
135
-
136
- ## Reference
137
- We thank [CogVideoX](https://github.com/THUDM/CogVideo) for open source
138
-
139
- ## Model-License
140
-
141
- The CogVideoX-5B model (Transformers module, include I2V and T2V) is released under
142
- the [CogVideoX LICENSE](https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE).
 
1
+ ---
2
+ title: FloVD
3
+ emoji: πŸ‘€
4
+ colorFrom: red
5
+ colorTo: blue
6
+ sdk: gradio
7
+ sdk_version: 5.35.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ ---
12
+
13
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference