NadaGh's picture
End of training
dde5d93 verified
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Stable Video Diffusion
[[open-in-colab]]
[Stable Video Diffusion (SVD)](https://huggingface.co/papers/2311.15127)은 μž…λ ₯ 이미지에 맞좰 2~4초 λΆ„λŸ‰μ˜ 고해상도(576x1024) λΉ„λ””μ˜€λ₯Ό 생성할 수 μžˆλŠ” κ°•λ ₯ν•œ image-to-video 생성 λͺ¨λΈμž…λ‹ˆλ‹€.
이 κ°€μ΄λ“œμ—μ„œλŠ” SVDλ₯Ό μ‚¬μš©ν•˜μ—¬ μ΄λ―Έμ§€μ—μ„œ 짧은 λ™μ˜μƒμ„ μƒμ„±ν•˜λŠ” 방법을 μ„€λͺ…ν•©λ‹ˆλ‹€.
μ‹œμž‘ν•˜κΈ° 전에 λ‹€μŒ λΌμ΄λΈŒλŸ¬λ¦¬κ°€ μ„€μΉ˜λ˜μ–΄ μžˆλŠ”μ§€ ν™•μΈν•˜μ„Έμš”:
```py
!pip install -q -U diffusers transformers accelerate
```
이 λͺ¨λΈμ—λŠ” [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid)와 [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) 두 가지 μ’…λ₯˜κ°€ μžˆμŠ΅λ‹ˆλ‹€. SVD μ²΄ν¬ν¬μΈνŠΈλŠ” 14개의 ν”„λ ˆμž„μ„ μƒμ„±ν•˜λ„λ‘ ν•™μŠ΅λ˜μ—ˆκ³ , SVD-XT μ²΄ν¬ν¬μΈνŠΈλŠ” 25개의 ν”„λ ˆμž„μ„ μƒμ„±ν•˜λ„λ‘ νŒŒμΈνŠœλ‹λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
이 κ°€μ΄λ“œμ—μ„œλŠ” SVD-XT 체크포인트λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€.
```python
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()
# Conditioning 이미지 뢈러였기
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))
generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
```
<div class="flex gap-4">
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">"source image of a rocket"</figcaption>
</div>
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">"generated video from source image"</figcaption>
</div>
</div>
## torch.compile
UNet을 [컴파일](../optimization/torch2.0#torchcompile)ν•˜λ©΄ λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ 살짝 μ¦κ°€ν•˜μ§€λ§Œ, 20~25%의 속도 ν–₯상을 얻을 수 μžˆμŠ΅λ‹ˆλ‹€.
```diff
- pipe.enable_model_cpu_offload()
+ pipe.to("cuda")
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
```
## λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰ 쀄이기
λΉ„λ””μ˜€ 생성은 기본적으둜 배치 크기가 큰 text-to-image 생성과 μœ μ‚¬ν•˜κ²Œ 'num_frames'λ₯Ό ν•œ λ²ˆμ— μƒμ„±ν•˜κΈ° λ•Œλ¬Έμ— λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ 맀우 λ†’μŠ΅λ‹ˆλ‹€. λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ„ 쀄이기 μœ„ν•΄ μΆ”λ‘  속도와 λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ„ μ ˆμΆ©ν•˜λŠ” μ—¬λŸ¬ 가지 μ˜΅μ…˜μ΄ μžˆμŠ΅λ‹ˆλ‹€:
- λͺ¨λΈ μ˜€ν”„λ‘œλ§ ν™œμ„±ν™”: νŒŒμ΄ν”„λΌμΈμ˜ 각 ꡬ성 μš”μ†Œκ°€ 더 이상 ν•„μš”ν•˜μ§€ μ•Šμ„ λ•Œ CPU둜 μ˜€ν”„λ‘œλ“œλ©λ‹ˆλ‹€.
- Feed-forward chunking ν™œμ„±ν™”: feed-forward λ ˆμ΄μ–΄κ°€ 배치 크기가 큰 단일 feed-forwardλ₯Ό μ‹€ν–‰ν•˜λŠ” λŒ€μ‹  λ£¨ν”„λ‘œ λ°˜λ³΅ν•΄μ„œ μ‹€ν–‰λ©λ‹ˆλ‹€.
- `decode_chunk_size` κ°μ†Œ: VAEκ°€ ν”„λ ˆμž„λ“€μ„ ν•œκΊΌλ²ˆμ— λ””μ½”λ”©ν•˜λŠ” λŒ€μ‹  chunk λ‹¨μœ„λ‘œ λ””μ½”λ”©ν•©λ‹ˆλ‹€. `decode_chunk_size=1`을 μ„€μ •ν•˜λ©΄ ν•œ λ²ˆμ— ν•œ ν”„λ ˆμž„μ”© λ””μ½”λ”©ν•˜κ³  μ΅œμ†Œν•œμ˜ λ©”λͺ¨λ¦¬λ§Œ μ‚¬μš©ν•˜μ§€λ§Œ(GPU λ©”λͺ¨λ¦¬μ— 따라 이 값을 μ‘°μ •ν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€), λ™μ˜μƒμ— μ•½κ°„μ˜ κΉœλ°•μž„μ΄ λ°œμƒν•  수 μžˆμŠ΅λ‹ˆλ‹€.
```diff
- pipe.enable_model_cpu_offload()
- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]
+ pipe.enable_model_cpu_offload()
+ pipe.unet.enable_forward_chunking()
+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0]
```
μ΄λŸ¬ν•œ λͺ¨λ“  방법듀을 μ‚¬μš©ν•˜λ©΄ λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ΄ 8GAM VRAM보닀 적을 κ²ƒμž…λ‹ˆλ‹€.
## Micro-conditioning
Stable Diffusion VideoλŠ” λ˜ν•œ 이미지 conditoning 외에도 micro-conditioning을 ν—ˆμš©ν•˜λ―€λ‘œ μƒμ„±λœ λΉ„λ””μ˜€λ₯Ό 더 잘 μ œμ–΄ν•  수 μžˆμŠ΅λ‹ˆλ‹€:
- `fps`: μƒμ„±λœ λΉ„λ””μ˜€μ˜ μ΄ˆλ‹Ή ν”„λ ˆμž„ μˆ˜μž…λ‹ˆλ‹€.
- `motion_bucket_id`: μƒμ„±λœ λ™μ˜μƒμ— μ‚¬μš©ν•  λͺ¨μ…˜ 버킷 μ•„μ΄λ””μž…λ‹ˆλ‹€. μƒμ„±λœ λ™μ˜μƒμ˜ λͺ¨μ…˜μ„ μ œμ–΄ν•˜λŠ” 데 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λͺ¨μ…˜ 버킷 아이디λ₯Ό 늘리면 μƒμ„±λ˜λŠ” λ™μ˜μƒμ˜ λͺ¨μ…˜μ΄ μ¦κ°€ν•©λ‹ˆλ‹€.
- `noise_aug_strength`: Conditioning 이미지에 μΆ”κ°€λ˜λŠ” λ…Έμ΄μ¦ˆμ˜ μ–‘μž…λ‹ˆλ‹€. 값이 클수둝 λΉ„λ””μ˜€κ°€ conditioning 이미지와 덜 μœ μ‚¬ν•΄μ§‘λ‹ˆλ‹€. 이 값을 높이면 μƒμ„±λœ λΉ„λ””μ˜€μ˜ μ›€μ§μž„λ„ μ¦κ°€ν•©λ‹ˆλ‹€.
예λ₯Ό λ“€μ–΄, λͺ¨μ…˜μ΄ 더 λ§Žμ€ λ™μ˜μƒμ„ μƒμ„±ν•˜λ €λ©΄ `motion_bucket_id` 및 `noise_aug_strength` micro-conditioning νŒŒλΌλ―Έν„°λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€:
```python
import torch
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video
pipe = StableVideoDiffusionPipeline.from_pretrained(
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()
# Conditioning 이미지 뢈러였기
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png")
image = image.resize((1024, 576))
generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0]
export_to_video(frames, "generated.mp4", fps=7)
```
![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket_with_conditions.gif)