Spaces:
Running
on
Zero
Running
on
Zero
<!--Copyright 2023 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# Stable Video Diffusion | |
[[open-in-colab]] | |
[Stable Video Diffusion (SVD)](https://huggingface.co/papers/2311.15127)μ μ λ ₯ μ΄λ―Έμ§μ λ§μΆ° 2~4μ΄ λΆλμ κ³ ν΄μλ(576x1024) λΉλμ€λ₯Ό μμ±ν μ μλ κ°λ ₯ν image-to-video μμ± λͺ¨λΈμ λλ€. | |
μ΄ κ°μ΄λμμλ SVDλ₯Ό μ¬μ©νμ¬ μ΄λ―Έμ§μμ μ§§μ λμμμ μμ±νλ λ°©λ²μ μ€λͺ ν©λλ€. | |
μμνκΈ° μ μ λ€μ λΌμ΄λΈλ¬λ¦¬κ° μ€μΉλμ΄ μλμ§ νμΈνμΈμ: | |
```py | |
!pip install -q -U diffusers transformers accelerate | |
``` | |
μ΄ λͺ¨λΈμλ [SVD](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid)μ [SVD-XT](https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt) λ κ°μ§ μ’ λ₯κ° μμ΅λλ€. SVD 체ν¬ν¬μΈνΈλ 14κ°μ νλ μμ μμ±νλλ‘ νμ΅λμκ³ , SVD-XT 체ν¬ν¬μΈνΈλ 25κ°μ νλ μμ μμ±νλλ‘ νμΈνλλμμ΅λλ€. | |
μ΄ κ°μ΄λμμλ SVD-XT 체ν¬ν¬μΈνΈλ₯Ό μ¬μ©ν©λλ€. | |
```python | |
import torch | |
from diffusers import StableVideoDiffusionPipeline | |
from diffusers.utils import load_image, export_to_video | |
pipe = StableVideoDiffusionPipeline.from_pretrained( | |
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" | |
) | |
pipe.enable_model_cpu_offload() | |
# Conditioning μ΄λ―Έμ§ λΆλ¬μ€κΈ° | |
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") | |
image = image.resize((1024, 576)) | |
generator = torch.manual_seed(42) | |
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0] | |
export_to_video(frames, "generated.mp4", fps=7) | |
``` | |
<div class="flex gap-4"> | |
<div> | |
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png"/> | |
<figcaption class="mt-2 text-center text-sm text-gray-500">"source image of a rocket"</figcaption> | |
</div> | |
<div> | |
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/output_rocket.gif"/> | |
<figcaption class="mt-2 text-center text-sm text-gray-500">"generated video from source image"</figcaption> | |
</div> | |
</div> | |
## torch.compile | |
UNetμ [μ»΄νμΌ](../optimization/torch2.0#torchcompile)νλ©΄ λ©λͺ¨λ¦¬ μ¬μ©λμ΄ μ΄μ§ μ¦κ°νμ§λ§, 20~25%μ μλ ν₯μμ μ»μ μ μμ΅λλ€. | |
```diff | |
- pipe.enable_model_cpu_offload() | |
+ pipe.to("cuda") | |
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) | |
``` | |
## λ©λͺ¨λ¦¬ μ¬μ©λ μ€μ΄κΈ° | |
λΉλμ€ μμ±μ κΈ°λ³Έμ μΌλ‘ λ°°μΉ ν¬κΈ°κ° ν° text-to-image μμ±κ³Ό μ μ¬νκ² 'num_frames'λ₯Ό ν λ²μ μμ±νκΈ° λλ¬Έμ λ©λͺ¨λ¦¬ μ¬μ©λμ΄ λ§€μ° λμ΅λλ€. λ©λͺ¨λ¦¬ μ¬μ©λμ μ€μ΄κΈ° μν΄ μΆλ‘ μλμ λ©λͺ¨λ¦¬ μ¬μ©λμ μ μΆ©νλ μ¬λ¬ κ°μ§ μ΅μ μ΄ μμ΅λλ€: | |
- λͺ¨λΈ μ€νλ‘λ§ νμ±ν: νμ΄νλΌμΈμ κ° κ΅¬μ± μμκ° λ μ΄μ νμνμ§ μμ λ CPUλ‘ μ€νλ‘λλ©λλ€. | |
- Feed-forward chunking νμ±ν: feed-forward λ μ΄μ΄κ° λ°°μΉ ν¬κΈ°κ° ν° λ¨μΌ feed-forwardλ₯Ό μ€ννλ λμ 루νλ‘ λ°λ³΅ν΄μ μ€νλ©λλ€. | |
- `decode_chunk_size` κ°μ: VAEκ° νλ μλ€μ νκΊΌλ²μ λμ½λ©νλ λμ chunk λ¨μλ‘ λμ½λ©ν©λλ€. `decode_chunk_size=1`μ μ€μ νλ©΄ ν λ²μ ν νλ μμ© λμ½λ©νκ³ μ΅μνμ λ©λͺ¨λ¦¬λ§ μ¬μ©νμ§λ§(GPU λ©λͺ¨λ¦¬μ λ°λΌ μ΄ κ°μ μ‘°μ νλ κ²μ΄ μ’μ΅λλ€), λμμμ μ½κ°μ κΉλ°μμ΄ λ°μν μ μμ΅λλ€. | |
```diff | |
- pipe.enable_model_cpu_offload() | |
- frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0] | |
+ pipe.enable_model_cpu_offload() | |
+ pipe.unet.enable_forward_chunking() | |
+ frames = pipe(image, decode_chunk_size=2, generator=generator, num_frames=25).frames[0] | |
``` | |
μ΄λ¬ν λͺ¨λ λ°©λ²λ€μ μ¬μ©νλ©΄ λ©λͺ¨λ¦¬ μ¬μ©λμ΄ 8GAM VRAMλ³΄λ€ μ μ κ²μ λλ€. | |
## Micro-conditioning | |
Stable Diffusion Videoλ λν μ΄λ―Έμ§ conditoning μΈμλ micro-conditioningμ νμ©νλ―λ‘ μμ±λ λΉλμ€λ₯Ό λ μ μ μ΄ν μ μμ΅λλ€: | |
- `fps`: μμ±λ λΉλμ€μ μ΄λΉ νλ μ μμ λλ€. | |
- `motion_bucket_id`: μμ±λ λμμμ μ¬μ©ν λͺ¨μ λ²ν· μμ΄λμ λλ€. μμ±λ λμμμ λͺ¨μ μ μ μ΄νλ λ° μ¬μ©ν μ μμ΅λλ€. λͺ¨μ λ²ν· μμ΄λλ₯Ό λ리면 μμ±λλ λμμμ λͺ¨μ μ΄ μ¦κ°ν©λλ€. | |
- `noise_aug_strength`: Conditioning μ΄λ―Έμ§μ μΆκ°λλ λ Έμ΄μ¦μ μμ λλ€. κ°μ΄ ν΄μλ‘ λΉλμ€κ° conditioning μ΄λ―Έμ§μ λ μ μ¬ν΄μ§λλ€. μ΄ κ°μ λμ΄λ©΄ μμ±λ λΉλμ€μ μμ§μλ μ¦κ°ν©λλ€. | |
μλ₯Ό λ€μ΄, λͺ¨μ μ΄ λ λ§μ λμμμ μμ±νλ €λ©΄ `motion_bucket_id` λ° `noise_aug_strength` micro-conditioning νλΌλ―Έν°λ₯Ό μ¬μ©ν©λλ€: | |
```python | |
import torch | |
from diffusers import StableVideoDiffusionPipeline | |
from diffusers.utils import load_image, export_to_video | |
pipe = StableVideoDiffusionPipeline.from_pretrained( | |
"stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16" | |
) | |
pipe.enable_model_cpu_offload() | |
# Conditioning μ΄λ―Έμ§ λΆλ¬μ€κΈ° | |
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png") | |
image = image.resize((1024, 576)) | |
generator = torch.manual_seed(42) | |
frames = pipe(image, decode_chunk_size=8, generator=generator, motion_bucket_id=180, noise_aug_strength=0.1).frames[0] | |
export_to_video(frames, "generated.mp4", fps=7) | |
``` | |
 | |