|
<!--Copyright 2023 The HuggingFace Team. All rights reserved. |
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with |
|
the License. You may obtain a copy of the License at |
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on |
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
|
specific language governing permissions and limitations under the License. |
|
--> |
|
|
|
# Stable Video Diffusion |
|
|
|
[[open-in-colab]] |
|
|
|
[Stable Video Diffusion (SVD)](https://huggingface.co/papers/2311.15127)μ μ
λ ₯ μ΄λ―Έμ§μ λ§μΆ° 2~4μ΄ λΆλμ κ³ ν΄μλ(576x1024) λΉλμ€λ₯Ό μμ±ν μ μλ κ°λ ₯ν image-to-video μμ± λͺ¨λΈμ
λλ€. |
|
|
|
μ΄ κ°μ΄λμμλ SVDλ₯Ό μ¬μ©νμ¬ μ΄λ―Έμ§μμ 짧μ λμμμ μμ±νλ λ°©λ²μ μ€λͺ
ν©λλ€. |
|
|
|
μμνκΈ° μ μ λ€μ λΌμ΄λΈλ¬λ¦¬κ° μ€μΉλμ΄ μλμ§ νμΈνμΈμ: |
|
|
|
```py |
|
!pip install -q -U diffusers transformers accelerate |
|
``` |
|
|
|
μ΄ λͺ¨λΈμλ [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid)μ [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) λ κ°μ§ μ’
λ₯κ° μμ΅λλ€. SVD 체ν¬ν¬μΈνΈλ 14κ°μ νλ μμ μμ±νλλ‘ νμ΅λμκ³ , SVD-XT 체ν¬ν¬μΈνΈλ 25κ°μ νλ μμ μμ±νλλ‘ νμΈνλλμμ΅λλ€. |
|
|
|
μ΄ κ°μ΄λμμλ SVD-XT 체ν¬ν¬μΈνΈλ₯Ό μ¬μ©ν©λλ€. |
|
|
|
```python |
|
import torch |
|
|
|
from diffusers import StableVideoDiffusionPipeline |
|
from diffusers.utils import load_image, export_to_video |
|
|
|
pipe = StableVideoDiffusionPipeline.from_pretrained( |
|
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" |
|
) |
|
pipe.enable_model_cpu_offload() |
|
|
|
# Conditioning μ΄λ―Έμ§ λΆλ¬μ€κΈ° |
|
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") |
|
image = image.resize((1024, 576)) |
|
|
|
generator = torch.manual_seed(42) |
|
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0] |
|
|
|
export_to_video(frames, "generated.mp4", fps=7) |
|
``` |
|
|
|
<div class="flex gap-4"> |
|
<div> |
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png"/> |
|
<figcaption class="mt-2 text-center text-sm text-gray-500">"source image of a rocket"</figcaption> |
|
</div> |
|
<div> |
|
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif"/> |
|
<figcaption class="mt-2 text-center text-sm text-gray-500">"generated video from source image"</figcaption> |
|
</div> |
|
</div> |
|
|
|
## torch.compile |
|
|
|
UNetμ [μ»΄νμΌ](../optimization/torch2.0#torchcompile)νλ©΄ λ©λͺ¨λ¦¬ μ¬μ©λμ΄ μ΄μ§ μ¦κ°νμ§λ§, 20~25%μ μλ ν₯μμ μ»μ μ μμ΅λλ€. |
|
|
|
```diff |
|
- pipe.enable_model_cpu_offload() |
|
+ pipe.to("cuda") |
|
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) |
|
``` |
|
|
|
## λ©λͺ¨λ¦¬ μ¬μ©λ μ€μ΄κΈ° |
|
|
|
λΉλμ€ μμ±μ κΈ°λ³Έμ μΌλ‘ λ°°μΉ ν¬κΈ°κ° ν° text-to-image μμ±κ³Ό μ μ¬νκ² 'num_frames'λ₯Ό ν λ²μ μμ±νκΈ° λλ¬Έμ λ©λͺ¨λ¦¬ μ¬μ©λμ΄ λ§€μ° λμ΅λλ€. λ©λͺ¨λ¦¬ μ¬μ©λμ μ€μ΄κΈ° μν΄ μΆλ‘ μλμ λ©λͺ¨λ¦¬ μ¬μ©λμ μ μΆ©νλ μ¬λ¬ κ°μ§ μ΅μ
μ΄ μμ΅λλ€: |
|
|
|
- λͺ¨λΈ μ€νλ‘λ§ νμ±ν: νμ΄νλΌμΈμ κ° κ΅¬μ± μμκ° λ μ΄μ νμνμ§ μμ λ CPUλ‘ μ€νλ‘λλ©λλ€. |
|
- Feed-forward chunking νμ±ν: feed-forward λ μ΄μ΄κ° λ°°μΉ ν¬κΈ°κ° ν° λ¨μΌ feed-forwardλ₯Ό μ€ννλ λμ 루νλ‘ λ°λ³΅ν΄μ μ€νλ©λλ€. |
|
- `decode_chunk_size` κ°μ: VAEκ° νλ μλ€μ νκΊΌλ²μ λμ½λ©νλ λμ chunk λ¨μλ‘ λμ½λ©ν©λλ€. `decode_chunk_size=1`μ μ€μ νλ©΄ ν λ²μ ν νλ μμ© λμ½λ©νκ³ μ΅μνμ λ©λͺ¨λ¦¬λ§ μ¬μ©νμ§λ§(GPU λ©λͺ¨λ¦¬μ λ°λΌ μ΄ κ°μ μ‘°μ νλ κ²μ΄ μ’μ΅λλ€), λμμμ μ½κ°μ κΉλ°μμ΄ λ°μν μ μμ΅λλ€. |
|
|
|
```diff |
|
- pipe.enable_model_cpu_offload() |
|
- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0] |
|
+ pipe.enable_model_cpu_offload() |
|
+ pipe.unet.enable_forward_chunking() |
|
+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0] |
|
``` |
|
|
|
μ΄λ¬ν λͺ¨λ λ°©λ²λ€μ μ¬μ©νλ©΄ λ©λͺ¨λ¦¬ μ¬μ©λμ΄ 8GAM VRAMλ³΄λ€ μ μ κ²μ
λλ€. |
|
|
|
## Micro-conditioning |
|
|
|
Stable Diffusion Videoλ λν μ΄λ―Έμ§ conditoning μΈμλ micro-conditioningμ νμ©νλ―λ‘ μμ±λ λΉλμ€λ₯Ό λ μ μ μ΄ν μ μμ΅λλ€: |
|
|
|
- `fps`: μμ±λ λΉλμ€μ μ΄λΉ νλ μ μμ
λλ€. |
|
- `motion_bucket_id`: μμ±λ λμμμ μ¬μ©ν λͺ¨μ
λ²ν· μμ΄λμ
λλ€. μμ±λ λμμμ λͺ¨μ
μ μ μ΄νλ λ° μ¬μ©ν μ μμ΅λλ€. λͺ¨μ
λ²ν· μμ΄λλ₯Ό λ리면 μμ±λλ λμμμ λͺ¨μ
μ΄ μ¦κ°ν©λλ€. |
|
- `noise_aug_strength`: Conditioning μ΄λ―Έμ§μ μΆκ°λλ λ
Έμ΄μ¦μ μμ
λλ€. κ°μ΄ ν΄μλ‘ λΉλμ€κ° conditioning μ΄λ―Έμ§μ λ μ μ¬ν΄μ§λλ€. μ΄ κ°μ λμ΄λ©΄ μμ±λ λΉλμ€μ μμ§μλ μ¦κ°ν©λλ€. |
|
|
|
μλ₯Ό λ€μ΄, λͺ¨μ
μ΄ λ λ§μ λμμμ μμ±νλ €λ©΄ `motion_bucket_id` λ° `noise_aug_strength` micro-conditioning νλΌλ―Έν°λ₯Ό μ¬μ©ν©λλ€: |
|
|
|
```python |
|
import torch |
|
|
|
from diffusers import StableVideoDiffusionPipeline |
|
from diffusers.utils import load_image, export_to_video |
|
|
|
pipe = StableVideoDiffusionPipeline.from_pretrained( |
|
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" |
|
) |
|
pipe.enable_model_cpu_offload() |
|
|
|
# Conditioning μ΄λ―Έμ§ λΆλ¬μ€κΈ° |
|
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") |
|
image = image.resize((1024, 576)) |
|
|
|
generator = torch.manual_seed(42) |
|
frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0] |
|
export_to_video(frames, "generated.mp4", fps=7) |
|
``` |
|
|
|
 |
|
|