File size: 15,331 Bytes
d863eb0 2a8a54e d863eb0 58df4c9 d863eb0 a29a2e6 d863eb0 de524cb d863eb0 a29a2e6 d863eb0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 |
---
license: other
license_name: tencent-hunyuan-community
license_link: LICENSE
---
<!-- ## **HunyuanVideo** -->
<p align="center">
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo-I2V/refs/heads/main/assets/logo.png" height=100>
</p>
# **HunyuanVideo-I2V** π
Following the great successful open-sourcing of our [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we proudly present the [HunyuanVideo-I2V](https://github.com/Tencent/HunyuanVideo-I2V), a new image-to-video generation framework to accelerate open-source community exploration!
This repo contains offical PyTorch model definitions, pre-trained weights and inference/sampling code. You can find more visualizations on our [project page](https://aivideo.hunyuan.tencent.com). Meanwhile, we have released the LoRA training code for customizable special effects, which can be used to create more interesting video effects.
> [**HunyuanVideo: A Systematic Framework For Large Video Generation Model**](https://arxiv.org/abs/2412.03603) <be>
## π₯π₯π₯ News!!
* Mar 06, 2025: π We release the inference code and model weights of HunyuanVideo-I2V. [Download](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/ckpts/README.md).
## π Open-source Plan
- HunyuanVideo-I2V (Image-to-Video Model)
- [x] Lora training scripts
- [x] Inference
- [x] Checkpoints
- [x] ComfyUI
- [ ] Multi-gpus Sequence Parallel inference (Faster inference speed on more gpus)
- [ ] Diffusers
- [ ] FP8 Quantified weight
## Contents
- [**HunyuanVideo-I2V** π
](#hunyuanvideo-i2v-)
- [π₯π₯π₯ News!!](#-news)
- [π Open-source Plan](#-open-source-plan)
- [Contents](#contents)
- [**HunyuanVideo-I2V Overall Architecture**](#hunyuanvideo-i2v-overall-architecture)
- [π Requirements](#-requirements)
- [π οΈ Dependencies and Installation](#οΈ-dependencies-and-installation)
- [Installation Guide for Linux](#installation-guide-for-linux)
- [𧱠Download Pretrained Models](#-download-pretrained-models)
- [π Single-gpu Inference](#-single-gpu-inference)
- [Using Command Line](#using-command-line)
- [More Configurations](#more-configurations)
- [π Customizable I2V LoRA effects training](#-customizable-i2v-lora-effects-training)
- [Requirements](#requirements)
- [Environment](#environment)
- [Training data construction](#training-data-construction)
- [Training](#training)
- [Inference](#inference)
- [π BibTeX](#-bibtex)
- [Acknowledgements](#acknowledgements)
---
## **HunyuanVideo-I2V Overall Architecture**
Leveraging the advanced video generation capabilities of [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), we have extended its application to image-to-video generation tasks. To achieve this, we employ an image latent concatenation technique to effectively reconstruct and incorporate reference image information into the video generation process.
Since we utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder, we can significantly enhance the model's ability to comprehend the semantic content of the input image and to seamlessly integrate information from both the image and its associated caption. Specifically, the input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data.
The overall architecture of our system is designed to maximize the synergy between image and text modalities, ensuring a robust and coherent generation of video content from static images. This integration not only improves the fidelity of the generated videos but also enhances the model's ability to interpret and utilize complex multimodal inputs. The overall architecture is as follows.
<p align="center">
<img src="https://raw.githubusercontent.com/Tencent/HunyuanVideo-I2V/refs/heads/main/assets/backbone.png" style="max-width: 45%; height: auto;">
</p>
## π Requirements
The following table shows the requirements for running HunyuanVideo-I2V model (batch size = 1) to generate videos:
| Model | Resolution | GPU Peak Memory |
|:----------------:|:-----------:|:----------------:|
| HunyuanVideo-I2V | 720p | 60GB |
* An NVIDIA GPU with CUDA support is required.
* The model is tested on a single 80G GPU.
* **Minimum**: The minimum GPU memory required is 60GB for 720p.
* **Recommended**: We recommend using a GPU with 80GB of memory for better generation quality.
* Tested operating system: Linux
## π οΈ Dependencies and Installation
Begin by cloning the repository:
```shell
git clone https://github.com/tencent/HunyuanVideo-I2V
cd HunyuanVideo-I2V
```
### Installation Guide for Linux
We recommend CUDA versions 12.4 or 11.8 for the manual installation.
Conda's installation instructions are available [here](https://docs.anaconda.com/free/miniconda/index.html).
```shell
# 1. Create conda environment
conda create -n HunyuanVideo-I2V python==3.11.9
# 2. Activate the environment
conda activate HunyuanVideo-I2V
# 3. Install PyTorch and other dependencies using conda
# For CUDA 12.4
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
# 4. Install pip dependencies
python -m pip install -r requirements.txt
# 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
python -m pip install ninja
python -m pip install git+https://github.com/Dao-AILab/[email protected]
```
In case of running into float point exception(core dump) on the specific GPU type, you may try the following solutions:
```shell
# Making sure you have installed CUDA 12.4, CUBLAS>=12.4.5.8, and CUDNN>=9.00 (or simply using our CUDA 12 docker image).
pip install nvidia-cublas-cu12==12.4.5.8
export LD_LIBRARY_PATH=/opt/conda/lib/python3.8/site-packages/nvidia/cublas/lib/
```
Additionally, HunyuanVideo-I2V also provides a pre-built Docker image. Use the following command to pull and run the docker image.
```shell
# For CUDA 12.4 (updated to avoid float point exception)
docker pull hunyuanvideo/hunyuanvideo-i2v:cuda_12
docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo-i2v --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo-i2v:cuda_12
```
## 𧱠Download Pretrained Models
The details of download pretrained models are shown [here](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/ckpts/README.md).
## π Single-gpu Inference
Similar to [HunyuanVideo](https://github.com/Tencent/HunyuanVideo), HunyuanVideo-I2V supports high-resolution video generation, with resolution up to 720P and video length up to 129 frames (5 seconds).
### Using Command Line
<!-- ### Run a Gradio Server
```bash
python3 gradio_server.py --flow-reverse
# set SERVER_NAME and SERVER_PORT manually
# SERVER_NAME=0.0.0.0 SERVER_PORT=8081 python3 gradio_server.py --flow-reverse
``` -->
```bash
cd HunyuanVideo-I2V
python3 sample_image2video.py \
--model HYVideo-T/2 \
--prompt "A man with short gray hair plays a red electric guitar." \
--i2v-mode \
--i2v-image-path ./assets/demo/i2v/imgs/0.png \
--i2v-resolution 720p \
--video-length 129 \
--infer-steps 50 \
--flow-reverse \
--flow-shift 17.0 \
--seed 0 \
--use-cpu-offload \
--save-path ./results
```
### More Configurations
We list some more useful configurations for easy usage:
| Argument | Default | Description |
|:----------------------:|:-----------------------------:|:------------------------------------------------------------:|
| `--prompt` | None | The text prompt for video generation. |
| `--model` | HYVideo-T/2-cfgdistill | Here we use HYVideo-T/2 for I2V, HYVideo-T/2-cfgdistill is used for T2V mode. |
| `--i2v-mode` | False | Whether to open i2v mode. |
| `--i2v-image-path` | ./assets/demo/i2v/imgs/0.png | The reference image for video generation. |
| `--i2v-resolution` | 720p | The resolution for the generated video. |
| `--video-length` | 129 | The length of the generated video. |
| `--infer-steps` | 50 | The number of steps for sampling. |
| `--flow-shift` | 7.0 | Shift factor for flow matching schedulers . |
| `--flow-reverse` | False | If reverse, learning/sampling from t=1 -> t=0. |
| `--seed` | None | The random seed for generating video, if None, we init a random seed. |
| `--use-cpu-offload` | False | Use CPU offload for the model load to save more memory, necessary for high-res video generation. |
| `--save-path` | ./results | Path to save the generated video. |
## π Customizable I2V LoRA effects training
### Requirements
The following table shows the requirements for training HunyuanVideo-I2V lora model (batch size = 1) to generate videos:
| Model | Resolution | GPU Peak Memory |
|:----------------:|:----------:|:---------------:|
| HunyuanVideo-I2V | 360p | 79GB |
* An NVIDIA GPU with CUDA support is required.
* The model is tested on a single 80G GPU.
* **Minimum**: The minimum GPU memory required is 79GB for 360p.
* **Recommended**: We recommend using a GPU with 80GB of memory for better generation quality.
* Tested operating system: Linux
* Note: You can train with 360p data and directly infer 720p videos
### Environment
```
pip install -r requirements.txt
```
### Training data construction
Prompt description: The trigger word is written directly in the video caption. It is recommended to use a phrase or short sentence.
For example, AI hair growth effect (trigger): rapid_hair_growth, The hair of the characters in the video is growing rapidly. + original prompt
After having the training video and prompt pair, refer to [here](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/hyvideo/hyvae_extract/README.md) for training data construction.
### Training
```
sh scripts/run_train_image2video_lora.sh
```
We list some training specific configurations for easy usage:
| Argument | Default | Description |
|:----------------:|:-------------------------------------------------------------:|:-----------------------------------------------------------:|
| `SAVE_BASE` | . | Root path for saving experimental results. |
| `EXP_NAME` | i2v_lora | Path suffix for saving experimental results. |
| `DATA_JSONS_DIR` | ./assets/demo/i2v_lora/train_dataset/processed_data/json_path | Data jsons dir generated by hyvideo/hyvae_extract/start.sh. |
| `CHIEF_IP` | 127.0.0.1 | Master node IP of the machine. |
After training, you can find `pytorch_lora_kohaya_weights.safetensors` in `{SAVE_BASE}/log_EXP/*_{EXP_NAME}/checkpoints/global_step{*}/pytorch_lora_kohaya_weights.safetensors` and set it in `--lora-path` to perform inference.
### Inference
```bash
python3 sample_image2video.py \
--model HYVideo-T/2 \
--prompt "Two people hugged tightly, In the video, two people are standing apart from each other. They then move closer to each other and begin to hug tightly. The hug is very affectionate, with the two people holding each other tightly and looking into each other's eyes. The interaction is very emotional and heartwarming, with the two people expressing their love and affection for each other." \
--i2v-mode \
--i2v-image-path ./assets/demo/i2v_lora/imgs/embrace.png \
--i2v-resolution 720p \
--infer-steps 50 \
--video-length 129 \
--flow-reverse \
--flow-shift 5.0 \
--seed 0 \
--use-cpu-offload \
--save-path ./results \
--use-lora \
--lora-scale 1.0 \
--lora-path ./ckpts/hunyuan-video-i2v-720p/lora/embrace_kohaya_weights.safetensors
```
We list some lora specific configurations for easy usage:
| Argument | Default | Description |
|:-------------------:|:-------:|:----------------------------:|
| `--use-lora` | False | Whether to open lora mode. |
| `--lora-scale` | 1.0 | Fusion scale for lora model. |
| `--lora-path` | "" | Weight path for lora model. |
## π BibTeX
If you find [HunyuanVideo](https://arxiv.org/abs/2412.03603) useful for your research and applications, please cite using this BibTeX:
```BibTeX
@misc{kong2024hunyuanvideo,
title={HunyuanVideo: A Systematic Framework For Large Video Generative Models},
author={Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Junkun Yuan, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, Weiyan Wang, Wenqing Yu, Xinchi Deng, Yang Li, Yanxin Long, Yi Chen, Yutao Cui, Yuanbo Peng, Zhentao Yu, Zhiyu He, Zhiyong Xu, Zixiang Zhou, Zunnan Xu, Yangyu Tao, Qinglin Lu, Songtao Liu, Dax Zhou, Hongfa Wang, Yong Yang, Di Wang, Yuhong Liu, and Jie Jiang, along with Caesar Zhong},
year={2024},
archivePrefix={arXiv preprint arXiv:2412.03603},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.03603},
}
```
## Acknowledgements
We would like to thank the contributors to the [SD3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [FLUX](https://github.com/black-forest-labs/flux), [Llama](https://github.com/meta-llama/llama), [LLaVA](https://github.com/haotian-liu/LLaVA), [Xtuner](https://github.com/InternLM/xtuner), [diffusers](https://github.com/huggingface/diffusers) and [HuggingFace](https://huggingface.co) repositories, for their open research and exploration.
Additionally, we also thank the Tencent Hunyuan Multimodal team for their help with the text encoder.
<!-- ## Github Star History
<a href="https://star-history.com/#Tencent/HunyuanVideo&Date">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=Tencent/HunyuanVideo&type=Date&theme=dark" />
<source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=Tencent/HunyuanVideo&type=Date" />
<img alt="Star History Chart" src="https://api.star-history.com/svg?repos=Tencent/HunyuanVideo&type=Date" />
</picture>
</a> -->
|