Image-to-Video
noaltian commited on
Commit
3914f20
Β·
verified Β·
1 Parent(s): 5092807

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -37
README.md CHANGED
@@ -25,51 +25,19 @@ This repo contains offical PyTorch model definitions, pre-trained weights and in
25
 
26
 
27
  ## πŸ”₯πŸ”₯πŸ”₯ News!!
 
28
  * Mar 07, 2025: πŸ”₯ We have fixed the bug in our open-source version that caused ID changes. Please try the new model weights of [HunyuanVideo-I2V](https://huggingface.co/tencent/HunyuanVideo-I2V) to ensure full visual consistency in the first frame and produce higher quality videos.
29
  * Mar 06, 2025: πŸ‘‹ We release the inference code and model weights of HunyuanVideo-I2V. [Download](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/ckpts/README.md).
30
 
31
 
32
- <!-- ### Frist Frame Consistency Demo
33
- | Reference Image | Generated Video |
34
- |:----------------:|:----------------:|
35
- | <img src="https://github.com/user-attachments/assets/83e7a097-ffca-40db-9c72-be01d866aa7d" width="80%"> | <video src="https://github.com/user-attachments/assets/f81d2c88-bb1a-43f8-b40f-1ccc20774563" width="100%"> </video> |
36
- | <img src="https://github.com/user-attachments/assets/c385a11f-60c7-4919-b0f1-bc5e715f673c" width="80%"> | <video src="https://github.com/user-attachments/assets/0c29ede9-0481-4d40-9c67-a4b6267fdc2d" width="100%"> </video> |
37
- | <img src="https://github.com/user-attachments/assets/5763f5eb-0be5-4b36-866a-5199e31c5802" width="95%"> | <video src="https://github.com/user-attachments/assets/a8da0a1b-ba7d-45a4-a901-5d213ceaf50e" width="100%"> </video> |
38
- -->
39
- <!-- ### Customizable I2V LoRA Demo
40
-
41
- | I2V Lora Effect | Reference Image | Generated Video |
42
- |:---------------:|:--------------------------------:|:----------------:|
43
- | Hair growth | <img src="./assets/demo/i2v_lora/imgs/hair_growth.png" width="40%"> | <video src="https://github.com/user-attachments/assets/06b998ae-bbde-4c1f-96cb-a25a9197d5cb" width="100%"> </video> |
44
- | Embrace | <img src="./assets/demo/i2v_lora/imgs/embrace.png" width="40%"> | <video src="https://github.com/user-attachments/assets/f8c99eb1-2a43-489a-ba02-6bd50a6dd260" width="100%" > </video> |
45
- <!-- | Hair growth | <img src="./assets/demo/i2v_lora/imgs/hair_growth.png" width="40%"> | <video src="https://github.com/user-attachments/assets/06b998ae-bbde-4c1f-96cb-a25a9197d5cb" width="100%" poster="./assets/demo/i2v_lora/imgs/hair_growth.png"> </video> |
46
- | Embrace | <img src="./assets/demo/i2v_lora/imgs/embrace.png" width="40%"> | <video src="https://github.com/user-attachments/assets/f8c99eb1-2a43-489a-ba02-6bd50a6dd260" width="100%" poster="./assets/demo/i2v_lora/imgs/hair_growth.png"> </video> | -->
47
-
48
- <!-- ## 🧩 Community Contributions -->
49
-
50
- <!-- If you develop/use HunyuanVideo-I2V in your projects, welcome to let us know. -->
51
-
52
- <!-- - ComfyUI-Kijai (FP8 Inference, V2V and IP2V Generation): [ComfyUI-HunyuanVideoWrapper](https://github.com/kijai/ComfyUI-HunyuanVideoWrapper) by [Kijai](https://github.com/kijai) -->
53
- <!-- - ComfyUI-Native (Native Support): [ComfyUI-HunyuanVideo](https://comfyanonymous.github.io/ComfyUI_examples/hunyuan_video/) by [ComfyUI Official](https://github.com/comfyanonymous/ComfyUI) -->
54
-
55
- <!-- - FastVideo (Consistency Distilled Model and Sliding Tile Attention): [FastVideo](https://github.com/hao-ai-lab/FastVideo) and [Sliding Tile Attention](https://hao-ai-lab.github.io/blogs/sta/) by [Hao AI Lab](https://hao-ai-lab.github.io/)
56
- - HunyuanVideo-gguf (GGUF Version and Quantization): [HunyuanVideo-gguf](https://huggingface.co/city96/HunyuanVideo-gguf) by [city96](https://huggingface.co/city96)
57
- - Enhance-A-Video (Better Generated Video for Free): [Enhance-A-Video](https://github.com/NUS-HPC-AI-Lab/Enhance-A-Video) by [NUS-HPC-AI-Lab](https://ai.comp.nus.edu.sg/)
58
- - TeaCache (Cache-based Accelerate): [TeaCache](https://github.com/LiewFeng/TeaCache) by [Feng Liu](https://github.com/LiewFeng)
59
- - HunyuanVideoGP (GPU Poor version): [HunyuanVideoGP](https://github.com/deepbeepmeep/HunyuanVideoGP) by [DeepBeepMeep](https://github.com/deepbeepmeep)
60
- -->
61
-
62
-
63
-
64
  ## πŸ“‘ Open-source Plan
65
  - HunyuanVideo-I2V (Image-to-Video Model)
66
  - [x] Inference
67
  - [x] Checkpoints
68
  - [x] ComfyUI
69
  - [x] Lora training scripts
70
- - [ ] Multi-gpus Sequence Parallel inference (Faster inference speed on more gpus)
71
- - [ ] Diffusers
72
- - [ ] FP8 Quantified weight
73
 
74
  ## Contents
75
  - [**HunyuanVideo-I2V** πŸŒ…](#hunyuanvideo-i2v-)
@@ -91,6 +59,8 @@ This repo contains offical PyTorch model definitions, pre-trained weights and in
91
  - [Training data construction](#training-data-construction)
92
  - [Training](#training)
93
  - [Inference](#inference)
 
 
94
  - [πŸ”— BibTeX](#-bibtex)
95
  - [Acknowledgements](#acknowledgements)
96
  ---
@@ -107,6 +77,7 @@ The overall architecture of our system is designed to maximize the synergy betwe
107
 
108
 
109
 
 
110
  ## πŸ“œ Requirements
111
 
112
  The following table shows the requirements for running HunyuanVideo-I2V model (batch size = 1) to generate videos:
@@ -153,6 +124,9 @@ python -m pip install -r requirements.txt
153
  # 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
154
  python -m pip install ninja
155
  python -m pip install git+https://github.com/Dao-AILab/[email protected]
 
 
 
156
  ```
157
 
158
  In case of running into float point exception(core dump) on the specific GPU type, you may try the following solutions:
@@ -167,8 +141,8 @@ Additionally, HunyuanVideo-I2V also provides a pre-built Docker image. Use the f
167
 
168
  ```shell
169
  # For CUDA 12.4 (updated to avoid float point exception)
170
- docker pull hunyuanvideo/hunyuanvideo-i2v:cuda_12
171
- docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo-i2v --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo-i2v:cuda_12
172
  ```
173
 
174
 
@@ -341,6 +315,98 @@ We list some lora specific configurations for easy usage:
341
  | `--lora-scale` | 1.0 | Fusion scale for lora model. |
342
  | `--lora-path` | "" | Weight path for lora model. |
343
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
344
 
345
  ## πŸ”— BibTeX
346
 
@@ -365,6 +431,8 @@ We would like to thank the contributors to the [SD3](https://huggingface.co/stab
365
  Additionally, we also thank the Tencent Hunyuan Multimodal team for their help with the text encoder.
366
 
367
 
 
 
368
  <!-- ## Github Star History
369
  <a href="https://star-history.com/#Tencent/HunyuanVideo&Date">
370
  <picture>
 
25
 
26
 
27
  ## πŸ”₯πŸ”₯πŸ”₯ News!!
28
+ * Mar 13, 2025: πŸš€ We release the parallel inference code for HunyuanVideo-I2V powered by [xDiT](https://github.com/xdit-project/xDiT).
29
  * Mar 07, 2025: πŸ”₯ We have fixed the bug in our open-source version that caused ID changes. Please try the new model weights of [HunyuanVideo-I2V](https://huggingface.co/tencent/HunyuanVideo-I2V) to ensure full visual consistency in the first frame and produce higher quality videos.
30
  * Mar 06, 2025: πŸ‘‹ We release the inference code and model weights of HunyuanVideo-I2V. [Download](https://github.com/Tencent/HunyuanVideo-I2V/blob/main/ckpts/README.md).
31
 
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ## πŸ“‘ Open-source Plan
34
  - HunyuanVideo-I2V (Image-to-Video Model)
35
  - [x] Inference
36
  - [x] Checkpoints
37
  - [x] ComfyUI
38
  - [x] Lora training scripts
39
+ - [x] Multi-gpus Sequence Parallel inference (Faster inference speed on more gpus)
40
+ - [ ] Diffusers
 
41
 
42
  ## Contents
43
  - [**HunyuanVideo-I2V** πŸŒ…](#hunyuanvideo-i2v-)
 
59
  - [Training data construction](#training-data-construction)
60
  - [Training](#training)
61
  - [Inference](#inference)
62
+ - [πŸš€ Parallel Inference on Multiple GPUs by xDiT](#-parallel-inference-on-multiple-gpus-by-xdit)
63
+ - [Using Command Line](#using-command-line-1)
64
  - [πŸ”— BibTeX](#-bibtex)
65
  - [Acknowledgements](#acknowledgements)
66
  ---
 
77
 
78
 
79
 
80
+
81
  ## πŸ“œ Requirements
82
 
83
  The following table shows the requirements for running HunyuanVideo-I2V model (batch size = 1) to generate videos:
 
124
  # 5. Install flash attention v2 for acceleration (requires CUDA 11.8 or above)
125
  python -m pip install ninja
126
  python -m pip install git+https://github.com/Dao-AILab/[email protected]
127
+
128
+ # 6. Install xDiT for parallel inference (It is recommended to use torch 2.4.0 and flash-attn 2.6.3)
129
+ python -m pip install xfuser==0.4.0
130
  ```
131
 
132
  In case of running into float point exception(core dump) on the specific GPU type, you may try the following solutions:
 
141
 
142
  ```shell
143
  # For CUDA 12.4 (updated to avoid float point exception)
144
+ docker pull hunyuanvideo/hunyuanvideo-i2v:cuda12
145
+ docker run -itd --gpus all --init --net=host --uts=host --ipc=host --name hunyuanvideo-i2v --security-opt=seccomp=unconfined --ulimit=stack=67108864 --ulimit=memlock=-1 --privileged hunyuanvideo/hunyuanvideo-i2v:cuda12
146
  ```
147
 
148
 
 
315
  | `--lora-scale` | 1.0 | Fusion scale for lora model. |
316
  | `--lora-path` | "" | Weight path for lora model. |
317
 
318
+ ## πŸš€ Parallel Inference on Multiple GPUs by xDiT
319
+
320
+ [xDiT](https://github.com/xdit-project/xDiT) is a Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters.
321
+ It has successfully provided low-latency parallel inference solutions for a variety of DiTs models, including mochi-1, CogVideoX, Flux.1, SD3, etc. This repo adopted the [Unified Sequence Parallelism (USP)](https://arxiv.org/abs/2405.07719) APIs for parallel inference of the HunyuanVideo-I2V model.
322
+
323
+ ### Using Command Line
324
+
325
+ For example, to generate a video with 8 GPUs, you can use the following command:
326
+
327
+ ```bash
328
+ cd HunyuanVideo-I2V
329
+
330
+ torchrun --nproc_per_node=8 sample_image2video.py \
331
+ --model HYVideo-T/2 \
332
+ --prompt "An Asian man with short hair in black tactical uniform and white clothes waves a firework stick." \
333
+ --i2v-mode \
334
+ --i2v-image-path ./assets/demo/i2v/imgs/0.jpg \
335
+ --i2v-resolution 720p \
336
+ --i2v-stability \
337
+ --infer-steps 50 \
338
+ --video-length 129 \
339
+ --flow-reverse \
340
+ --flow-shift 7.0 \
341
+ --seed 0 \
342
+ --embedded-cfg-scale 6.0 \
343
+ --save-path ./results \
344
+ --ulysses-degree 8 \
345
+ --ring-degree 1 \
346
+ --video-size 1280 720 \
347
+ --xdit-adaptive-size
348
+ ```
349
+
350
+ You can change the `--ulysses-degree` and `--ring-degree` to control the parallel configurations for the best performance.
351
+ Note that you need to set `--video-size` since xDiT's acceleration mechanism has requirements for the size of the video to be generated.
352
+ To prevent black padding after converting the original image height/width to the target height/width, you can use `--xdit-adaptive-size`.
353
+ The valid parallel configurations are shown in the following table.
354
+
355
+ <details>
356
+ <summary>Supported Parallel Configurations (Click to expand)</summary>
357
+
358
+ | --video-size | --video-length | --ulysses-degree x --ring-degree | --nproc_per_node |
359
+ |----------------------|----------------|----------------------------------|------------------|
360
+ | 1280 720 or 720 1280 | 129 | 8x1,4x2,2x4,1x8 | 8 |
361
+ | 1280 720 or 720 1280 | 129 | 1x5 | 5 |
362
+ | 1280 720 or 720 1280 | 129 | 4x1,2x2,1x4 | 4 |
363
+ | 1280 720 or 720 1280 | 129 | 3x1,1x3 | 3 |
364
+ | 1280 720 or 720 1280 | 129 | 2x1,1x2 | 2 |
365
+ | 1104 832 or 832 1104 | 129 | 4x1,2x2,1x4 | 4 |
366
+ | 1104 832 or 832 1104 | 129 | 3x1,1x3 | 3 |
367
+ | 1104 832 or 832 1104 | 129 | 2x1,1x2 | 2 |
368
+ | 960 960 | 129 | 6x1,3x2,2x3,1x6 | 6 |
369
+ | 960 960 | 129 | 4x1,2x2,1x4 | 4 |
370
+ | 960 960 | 129 | 3x1,1x3 | 3 |
371
+ | 960 960 | 129 | 1x2,2x1 | 2 |
372
+ | 960 544 or 544 960 | 129 | 6x1,3x2,2x3,1x6 | 6 |
373
+ | 960 544 or 544 960 | 129 | 4x1,2x2,1x4 | 4 |
374
+ | 960 544 or 544 960 | 129 | 3x1,1x3 | 3 |
375
+ | 960 544 or 544 960 | 129 | 1x2,2x1 | 2 |
376
+ | 832 624 or 624 832 | 129 | 4x1,2x2,1x4 | 4 |
377
+ | 624 832 or 624 832 | 129 | 3x1,1x3 | 3 |
378
+ | 832 624 or 624 832 | 129 | 2x1,1x2 | 2 |
379
+ | 720 720 | 129 | 1x5 | 5 |
380
+ | 720 720 | 129 | 3x1,1x3 | 3 |
381
+
382
+ </details>
383
+
384
+
385
+ <p align="center">
386
+ <table align="center">
387
+ <thead>
388
+ <tr>
389
+ <th colspan="4">Latency (Sec) for 1280x720 (129 frames 50 steps) on 8xGPU</th>
390
+ </tr>
391
+ <tr>
392
+ <th>1</th>
393
+ <th>2</th>
394
+ <th>4</th>
395
+ <th>8</th>
396
+ </tr>
397
+ </thead>
398
+ <tbody>
399
+ <tr>
400
+ <th>1904.08</th>
401
+ <th>934.09 (2.04x)</th>
402
+ <th>514.08 (3.70x)</th>
403
+ <th>337.58 (5.64x)</th>
404
+ </tr>
405
+
406
+ </tbody>
407
+ </table>
408
+ </p>
409
+
410
 
411
  ## πŸ”— BibTeX
412
 
 
431
  Additionally, we also thank the Tencent Hunyuan Multimodal team for their help with the text encoder.
432
 
433
 
434
+
435
+
436
  <!-- ## Github Star History
437
  <a href="https://star-history.com/#Tencent/HunyuanVideo&Date">
438
  <picture>