NadaGh's picture
End of training
dde5d93 verified
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
-->
# Kandinsky
[[open-in-colab]]
Kandinsky ๋ชจ๋ธ์€ ์ผ๋ จ์˜ ๋‹ค๊ตญ์–ด text-to-image ์ƒ์„ฑ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. Kandinsky 2.0 ๋ชจ๋ธ์€ ๋‘ ๊ฐœ์˜ ๋‹ค๊ตญ์–ด ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ์—ฐ๊ฒฐํ•ด UNet์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
[Kandinsky 2.1](../api/pipelines/kandinsky)์€ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ ๊ฐ„์˜ ๋งคํ•‘์„ ์ƒ์„ฑํ•˜๋Š” image prior ๋ชจ๋ธ([`CLIP`](https://huggingface.co/docs/transformers/model_doc/clip))์„ ํฌํ•จํ•˜๋„๋ก ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋ณ€๊ฒฝํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋งคํ•‘์€ ๋” ๋‚˜์€ text-image alignment๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ํ•™์Šต ์ค‘์— ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉ๋˜์–ด ๋” ๋†’์€ ํ’ˆ์งˆ์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, Kandinsky 2.1์€ spatial conditional ์ •๊ทœํ™” ๋ ˆ์ด์–ด๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์‚ฌ์‹ค๊ฐ์„ ๋†’์—ฌ์ฃผ๋Š” [Modulating Quantized Vectors (MoVQ)](https://huggingface.co/papers/2209.09002) ๋””์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ latents๋ฅผ ์ด๋ฏธ์ง€๋กœ ๋””์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.
[Kandinsky 2.2](../api/pipelines/kandinsky_v22)๋Š” image prior ๋ชจ๋ธ์˜ ์ด๋ฏธ์ง€ ์ธ์ฝ”๋”๋ฅผ ๋” ํฐ CLIP-ViT-G ๋ชจ๋ธ๋กœ ๊ต์ฒดํ•˜์—ฌ ํ’ˆ์งˆ์„ ๊ฐœ์„ ํ•จ์œผ๋กœ์จ ์ด์ „ ๋ชจ๋ธ์„ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ image prior ๋ชจ๋ธ์€ ํ•ด์ƒ๋„์™€ ์ข…ํšก๋น„๊ฐ€ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋กœ ์žฌํ›ˆ๋ จ๋˜์–ด ๋” ๋†’์€ ํ•ด์ƒ๋„์˜ ์ด๋ฏธ์ง€์™€ ๋‹ค์–‘ํ•œ ์ด๋ฏธ์ง€ ํฌ๊ธฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
[Kandinsky 3](../api/pipelines/kandinsky3)๋Š” ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋‹จ์ˆœํ™”ํ•˜๊ณ  prior ๋ชจ๋ธ๊ณผ diffusion ๋ชจ๋ธ์„ ํฌํ•จํ•˜๋Š” 2๋‹จ๊ณ„ ์ƒ์„ฑ ํ”„๋กœ์„ธ์Šค์—์„œ ๋ฒ—์–ด๋‚˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋Œ€์‹ , Kandinsky 3๋Š” [Flan-UL2](https://huggingface.co/google/flan-ul2)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ , [BigGan-deep](https://hf.co/papers/1809.11096) ๋ธ”๋ก์ด ํฌํ•จ๋œ UNet์„ ์‚ฌ์šฉํ•˜๋ฉฐ, [Sber-MoVQGAN](https://github.com/ai-forever/MoVQGAN)์„ ์‚ฌ์šฉํ•˜์—ฌ latents๋ฅผ ์ด๋ฏธ์ง€๋กœ ๋””์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค. ํ…์ŠคํŠธ ์ดํ•ด์™€ ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€ ํ’ˆ์งˆ์€ ์ฃผ๋กœ ๋” ํฐ ํ…์ŠคํŠธ ์ธ์ฝ”๋”์™€ UNet์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋‹ฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.
์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” text-to-image, image-to-image, ์ธํŽ˜์ธํŒ…, ๋ณด๊ฐ„ ๋“ฑ์„ ์œ„ํ•ด Kandinsky ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋‹ค์Œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์„ค์น˜๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”:
```py
# Colab์—์„œ ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์„์„ ์ œ์™ธํ•˜์„ธ์š”
#!pip install -q diffusers transformers accelerate
```
<Tip warning={true}>
Kandinsky 2.1๊ณผ 2.2์˜ ์‚ฌ์šฉ๋ฒ•์€ ๋งค์šฐ ์œ ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ์œ ์ผํ•œ ์ฐจ์ด์ ์€ Kandinsky 2.2๋Š” latents๋ฅผ ๋””์ฝ”๋”ฉํ•  ๋•Œ `ํ”„๋กฌํ”„ํŠธ`๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์ง€ ์•Š๋Š”๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋Œ€์‹ , Kandinsky 2.2๋Š” ๋””์ฝ”๋”ฉ ์ค‘์—๋Š” `image_embeds`๋งŒ ๋ฐ›์•„๋“ค์ž…๋‹ˆ๋‹ค.
<br>
Kandinsky 3๋Š” ๋” ๊ฐ„๊ฒฐํ•œ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ prior ๋ชจ๋ธ์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ฆ‰, [Stable Diffusion XL](sdxl)๊ณผ ๊ฐ™์€ ๋‹ค๋ฅธ diffusion ๋ชจ๋ธ๊ณผ ์‚ฌ์šฉ๋ฒ•์ด ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
</Tip>
## Text-to-image
๋ชจ๋“  ์ž‘์—…์— Kandinsky ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ํ•ญ์ƒ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•˜๋Š” prior ํŒŒ์ดํ”„๋ผ์ธ์„ ์„ค์ •ํ•˜๋Š” ๊ฒƒ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด์ „ ํŒŒ์ดํ”„๋ผ์ธ์€ negative ํ”„๋กฌํ”„ํŠธ `""`์— ํ•ด๋‹นํ•˜๋Š” `negative_image_embeds`๋„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋” ๋‚˜์€ ๊ฒฐ๊ณผ๋ฅผ ์–ป์œผ๋ ค๋ฉด ์ด์ „ ํŒŒ์ดํ”„๋ผ์ธ์— ์‹ค์ œ `negative_prompt`๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด prior ํŒŒ์ดํ”„๋ผ์ธ์˜ ์œ ํšจ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 2๋ฐฐ๋กœ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
<hfoptions id="text-to-image">
<hfoption id="Kandinsky 2.1">
```py
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
import torch
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16).to("cuda")
pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda")
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality" # negative ํ”„๋กฌํ”„ํŠธ ํฌํ•จ์€ ์„ ํƒ์ ์ด์ง€๋งŒ, ๋ณดํ†ต ๊ฒฐ๊ณผ๋Š” ๋” ์ข‹์Šต๋‹ˆ๋‹ค
image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt, guidance_scale=1.0).to_tuple()
```
์ด์ œ ๋ชจ๋“  ํ”„๋กฌํ”„ํŠธ์™€ ์ž„๋ฒ ๋”ฉ์„ [`KandinskyPipeline`]์— ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
```py
image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
image
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/cheeseburger.png"/>
</div>
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
import torch
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16).to("cuda")
pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda")
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality" # negative ํ”„๋กฌํ”„ํŠธ ํฌํ•จ์€ ์„ ํƒ์ ์ด์ง€๋งŒ, ๋ณดํ†ต ๊ฒฐ๊ณผ๋Š” ๋” ์ข‹์Šต๋‹ˆ๋‹ค
image_embeds, negative_image_embeds = prior_pipeline(prompt, guidance_scale=1.0).to_tuple()
```
์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์œ„ํ•ด `image_embeds`์™€ `negative_image_embeds`๋ฅผ [`KandinskyV22Pipeline`]์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค:
```py
image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0]
image
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-text-to-image.png"/>
</div>
</hfoption>
<hfoption id="Kandinsky 3">
Kandinsky 3๋Š” prior ๋ชจ๋ธ์ด ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ [`Kandinsky3Pipeline`]์„ ์ง์ ‘ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์ด๋ฏธ์ง€ ์ƒ์„ฑ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```py
from diffusers import Kandinsky3Pipeline
import torch
pipeline = Kandinsky3Pipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
image = pipeline(prompt).images[0]
image
```
</hfoption>
</hfoptions>
๐Ÿค— Diffusers๋Š” ๋˜ํ•œ [`KandinskyCombinedPipeline`] ๋ฐ [`KandinskyV22CombinedPipeline`]์ด ํฌํ•จ๋œ end-to-end API๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ text-to-image ๋ณ€ํ™˜ ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ณ„๋„๋กœ ๋ถˆ๋Ÿฌ์˜ฌ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ฒฐํ•ฉ๋œ ํŒŒ์ดํ”„๋ผ์ธ์€ prior ๋ชจ๋ธ๊ณผ ๋””์ฝ”๋”๋ฅผ ๋ชจ๋‘ ์ž๋™์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค. ์›ํ•˜๋Š” ๊ฒฝ์šฐ `prior_guidance_scale` ๋ฐ `prior_num_inference_steps` ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ prior ํŒŒ์ดํ”„๋ผ์ธ์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ๊ฐ’์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‚ด๋ถ€์—์„œ ๊ฒฐํ•ฉ๋œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ž๋™์œผ๋กœ ํ˜ธ์ถœํ•˜๋ ค๋ฉด [`AutoPipelineForText2Image`]๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:
<hfoptions id="text-to-image">
<hfoption id="Kandinsky 2.1">
```py
from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
image
```
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
from diffusers import AutoPipelineForText2Image
import torch
pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0]
image
```
</hfoption>
</hfoptions>
## Image-to-image
Image-to-image ๊ฒฝ์šฐ, ์ดˆ๊ธฐ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ์— ์ด๋ฏธ์ง€๋ฅผ conditioningํ•ฉ๋‹ˆ๋‹ค. Prior ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ถˆ๋Ÿฌ์˜ค๋Š” ๊ฒƒ์œผ๋กœ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค:
<hfoptions id="image-to-image">
<hfoption id="Kandinsky 2.1">
```py
import torch
from diffusers import KandinskyImg2ImgPipeline, KandinskyPriorPipeline
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyImg2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
```
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
import torch
from diffusers import KandinskyV22Img2ImgPipeline, KandinskyPriorPipeline
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
```
</hfoption>
<hfoption id="Kandinsky 3">
Kandinsky 3๋Š” prior ๋ชจ๋ธ์ด ํ•„์š”ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ image-to-image ํŒŒ์ดํ”„๋ผ์ธ์„ ์ง์ ‘ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```py
from diffusers import Kandinsky3Img2ImgPipeline
from diffusers.utils import load_image
import torch
pipeline = Kandinsky3Img2ImgPipeline.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()
```
</hfoption>
</hfoptions>
Conditioningํ•  ์ด๋ฏธ์ง€๋ฅผ ๋‹ค์šด๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค:
```py
from diffusers.utils import load_image
# ์ด๋ฏธ์ง€ ๋‹ค์šด๋กœ๋“œ
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
original_image = original_image.resize((768, 512))
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"/>
</div>
Prior ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ `image_embeds`์™€ `negative_image_embeds`๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
```py
prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"
image_embeds, negative_image_embeds = prior_pipeline(prompt, negative_prompt).to_tuple()
```
์ด์ œ ์›๋ณธ ์ด๋ฏธ์ง€์™€ ๋ชจ๋“  ํ”„๋กฌํ”„ํŠธ ๋ฐ ์ž„๋ฒ ๋”ฉ์„ ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
<hfoptions id="image-to-image">
<hfoption id="Kandinsky 2.1">
```py
from diffusers.utils import make_image_grid
image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/img2img_fantasyland.png"/>
</div>
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
from diffusers.utils import make_image_grid
image = pipeline(image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-image-to-image.png"/>
</div>
</hfoption>
<hfoption id="Kandinsky 3">
```py
image = pipeline(prompt, negative_prompt=negative_prompt, image=image, strength=0.75, num_inference_steps=25).images[0]
image
```
</hfoption>
</hfoptions>
๋˜ํ•œ ๐Ÿค— Diffusers์—์„œ๋Š” [`KandinskyImg2ImgCombinedPipeline`] ๋ฐ [`KandinskyV22Img2ImgCombinedPipeline`]์ด ํฌํ•จ๋œ end-to-end API๋ฅผ ์ œ๊ณตํ•˜๋ฏ€๋กœ prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ image-to-image ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ณ„๋„๋กœ ๋ถˆ๋Ÿฌ์˜ฌ ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ฒฐํ•ฉ๋œ ํŒŒ์ดํ”„๋ผ์ธ์€ prior ๋ชจ๋ธ๊ณผ ๋””์ฝ”๋”๋ฅผ ๋ชจ๋‘ ์ž๋™์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค. ์›ํ•˜๋Š” ๊ฒฝ์šฐ `prior_guidance_scale` ๋ฐ `prior_num_inference_steps` ๋งค๊ฐœ ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด์ „ ํŒŒ์ดํ”„๋ผ์ธ์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ๊ฐ’์„ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๋‚ด๋ถ€์—์„œ ๊ฒฐํ•ฉ๋œ ํŒŒ์ดํ”„๋ผ์ธ์„ ์ž๋™์œผ๋กœ ํ˜ธ์ถœํ•˜๋ ค๋ฉด [`AutoPipelineForImage2Image`]๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:
<hfoptions id="image-to-image">
<hfoption id="Kandinsky 2.1">
```py
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
import torch
pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True)
pipeline.enable_model_cpu_offload()
prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
original_image.thumbnail((768, 768))
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
```
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
from diffusers import AutoPipelineForImage2Image
from diffusers.utils import make_image_grid, load_image
import torch
pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()
prompt = "A fantasy landscape, Cinematic lighting"
negative_prompt = "low quality, bad quality"
url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
original_image = load_image(url)
original_image.thumbnail((768, 768))
image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0]
make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
```
</hfoption>
</hfoptions>
## Inpainting
<Tip warning={true}>
โš ๏ธ Kandinsky ๋ชจ๋ธ์€ ์ด์ œ ๊ฒ€์€์ƒ‰ ํ”ฝ์…€ ๋Œ€์‹  โฌœ๏ธ **ํฐ์ƒ‰ ํ”ฝ์…€**์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์Šคํฌ ์˜์—ญ์„ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. ํ”„๋กœ๋•์…˜์—์„œ [`KandinskyInpaintPipeline`]์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ํฐ์ƒ‰ ํ”ฝ์…€์„ ์‚ฌ์šฉํ•˜๋„๋ก ๋งˆ์Šคํฌ๋ฅผ ๋ณ€๊ฒฝํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:
```py
# PIL ์ž…๋ ฅ์— ๋Œ€ํ•ด
import PIL.ImageOps
mask = PIL.ImageOps.invert(mask)
# PyTorch์™€ NumPy ์ž…๋ ฅ์— ๋Œ€ํ•ด
mask = 1 - mask
```
</Tip>
์ธํŽ˜์ธํŒ…์—์„œ๋Š” ์›๋ณธ ์ด๋ฏธ์ง€, ์›๋ณธ ์ด๋ฏธ์ง€์—์„œ ๋Œ€์ฒดํ•  ์˜์—ญ์˜ ๋งˆ์Šคํฌ, ์ธํŽ˜์ธํŒ…ํ•  ๋‚ด์šฉ์— ๋Œ€ํ•œ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. Prior ํŒŒ์ดํ”„๋ผ์ธ์„ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค:
<hfoptions id="inpaint">
<hfoption id="Kandinsky 2.1">
```py
from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline
from diffusers.utils import load_image, make_image_grid
import torch
import numpy as np
from PIL import Image
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
```
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline
from diffusers.utils import load_image, make_image_grid
import torch
import numpy as np
from PIL import Image
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
```
</hfoption>
</hfoptions>
์ดˆ๊ธฐ ์ด๋ฏธ์ง€๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
```py
init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# mask area above cat's head
mask[:250, 250:-250] = 1
```
Prior ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
```py
prompt = "a hat"
prior_output = prior_pipeline(prompt)
```
์ด์ œ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์„ ์œ„ํ•ด ์ดˆ๊ธฐ ์ด๋ฏธ์ง€, ๋งˆ์Šคํฌ, ํ”„๋กฌํ”„ํŠธ์™€ ์ž„๋ฒ ๋”ฉ์„ ํŒŒ์ดํ”„๋ผ์ธ์— ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค:
<hfoptions id="inpaint">
<hfoption id="Kandinsky 2.1">
```py
output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/inpaint_cat_hat.png"/>
</div>
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
output_image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinskyv22-inpaint.png"/>
</div>
</hfoption>
</hfoptions>
[`KandinskyInpaintCombinedPipeline`] ๋ฐ [`KandinskyV22InpaintCombinedPipeline`]์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‚ด๋ถ€์—์„œ prior ๋ฐ ๋””์ฝ”๋” ํŒŒ์ดํ”„๋ผ์ธ์„ ํ•จ๊ป˜ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด [`AutoPipelineForInpainting`]์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค:
<hfoptions id="inpaint">
<hfoption id="Kandinsky 2.1">
```py
import torch
import numpy as np
from PIL import Image
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid
pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# ๊ณ ์–‘์ด ๋จธ๋ฆฌ ์œ„ ๋งˆ์Šคํฌ ์ง€์—ญ
mask[:250, 250:-250] = 1
prompt = "a hat"
output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
```
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
import torch
import numpy as np
from PIL import Image
from diffusers import AutoPipelineForInpainting
from diffusers.utils import load_image, make_image_grid
pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
mask = np.zeros((768, 768), dtype=np.float32)
# ๊ณ ์–‘์ด ๋จธ๋ฆฌ ์œ„ ๋งˆ์Šคํฌ ์˜์—ญ
mask[:250, 250:-250] = 1
prompt = "a hat"
output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0]
mask = Image.fromarray((mask*255).astype('uint8'), 'L')
make_image_grid([init_image, mask, output_image], rows=1, cols=3)
```
</hfoption>
</hfoptions>
## Interpolation (๋ณด๊ฐ„)
Interpolation(๋ณด๊ฐ„)์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์‚ฌ์ด์˜ latent space๋ฅผ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ์–ด prior ๋ชจ๋ธ์˜ ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฌผ์„ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋ฉ‹์ง„ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. Prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ ๋ณด๊ฐ„ํ•˜๋ ค๋Š” ๋‘ ๊ฐœ์˜ ์ด๋ฏธ์ง€๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค:
<hfoptions id="interpolate">
<hfoption id="Kandinsky 2.1">
```py
from diffusers import KandinskyPriorPipeline, KandinskyPipeline
from diffusers.utils import load_image, make_image_grid
import torch
prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
```
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline
from diffusers.utils import load_image, make_image_grid
import torch
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png")
img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg")
make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2)
```
</hfoption>
</hfoptions>
<div class="flex gap-4">
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">a cat</figcaption>
</div>
<div>
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg"/>
<figcaption class="mt-2 text-center text-sm text-gray-500">Van Gogh's Starry Night painting</figcaption>
</div>
</div>
๋ณด๊ฐ„ํ•  ํ…์ŠคํŠธ ๋˜๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ง€์ •ํ•˜๊ณ  ๊ฐ ํ…์ŠคํŠธ ๋˜๋Š” ์ด๋ฏธ์ง€์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ๊ฐ€์ค‘์น˜๋ฅผ ์‹คํ—˜ํ•˜์—ฌ ๋ณด๊ฐ„์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”!
```py
images_texts = ["a cat", img_1, img_2]
weights = [0.3, 0.3, 0.4]
```
`interpolate` ํ•จ์ˆ˜๋ฅผ ํ˜ธ์ถœํ•˜์—ฌ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•œ ๋‹ค์Œ, ํŒŒ์ดํ”„๋ผ์ธ์œผ๋กœ ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
<hfoptions id="interpolate">
<hfoption id="Kandinsky 2.1">
```py
# ํ”„๋กฌํ”„ํŠธ๋Š” ๋นˆ์นธ์œผ๋กœ ๋‚จ๊ฒจ๋„ ๋ฉ๋‹ˆ๋‹ค
prompt = ""
prior_out = prior_pipeline.interpolate(images_texts, weights)
pipeline = KandinskyPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
image = pipeline(prompt, **prior_out, height=768, width=768).images[0]
image
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinsky-docs/starry_cat.png"/>
</div>
</hfoption>
<hfoption id="Kandinsky 2.2">
```py
# ํ”„๋กฌํ”„ํŠธ๋Š” ๋นˆ์นธ์œผ๋กœ ๋‚จ๊ฒจ๋„ ๋ฉ๋‹ˆ๋‹ค
prompt = ""
prior_out = prior_pipeline.interpolate(images_texts, weights)
pipeline = KandinskyV22Pipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True).to("cuda")
image = pipeline(prompt, **prior_out, height=768, width=768).images[0]
image
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/kandinskyv22-interpolate.png"/>
</div>
</hfoption>
</hfoptions>
## ControlNet
<Tip warning={true}>
โš ๏ธ ControlNet์€ Kandinsky 2.2์—์„œ๋งŒ ์ง€์›๋ฉ๋‹ˆ๋‹ค!
</Tip>
ControlNet์„ ์‚ฌ์šฉํ•˜๋ฉด depth map์ด๋‚˜ edge detection์™€ ๊ฐ™์€ ์ถ”๊ฐ€ ์ž…๋ ฅ์„ ํ†ตํ•ด ์‚ฌ์ „ํ•™์Šต๋œ large diffusion ๋ชจ๋ธ์„ conditioningํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ชจ๋ธ์ด depth map์˜ ๊ตฌ์กฐ๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋ณด์กดํ•  ์ˆ˜ ์žˆ๋„๋ก ๊นŠ์ด ๋งต์œผ๋กœ Kandinsky 2.2๋ฅผ conditioningํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด๋ฏธ์ง€๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ  depth map์„ ์ถ”์ถœํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค:
```py
from diffusers.utils import load_image
img = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))
img
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"/>
</div>
๊ทธ๋Ÿฐ ๋‹ค์Œ ๐Ÿค— Transformers์˜ `depth-estimation` [`~transformers.Pipeline`]์„ ์‚ฌ์šฉํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ฒ˜๋ฆฌํ•ด depth map์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```py
import torch
import numpy as np
from transformers import pipeline
def make_hint(image, depth_estimator):
image = depth_estimator(image)["depth"]
image = np.array(image)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
detected_map = torch.from_numpy(image).float() / 255.0
hint = detected_map.permute(2, 0, 1)
return hint
depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
```
### Text-to-image [[controlnet-text-to-image]]
Prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ [`KandinskyV22ControlnetPipeline`]๋ฅผ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค:
```py
from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline
prior_pipeline = KandinskyV22PriorPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
pipeline = KandinskyV22ControlnetPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
).to("cuda")
```
ํ”„๋กฌํ”„ํŠธ์™€ negative ํ”„๋กฌํ”„ํŠธ๋กœ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
```py
prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
generator = torch.Generator(device="cuda").manual_seed(43)
image_emb, zero_image_emb = prior_pipeline(
prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator
).to_tuple()
```
๋งˆ์ง€๋ง‰์œผ๋กœ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ๊ณผ depth ์ด๋ฏธ์ง€๋ฅผ [`KandinskyV22ControlnetPipeline`]์— ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
```py
image = pipeline(image_embeds=image_emb, negative_image_embeds=zero_image_emb, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
image
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat_text2img.png"/>
</div>
### Image-to-image [[controlnet-image-to-image]]
ControlNet์„ ์‚ฌ์šฉํ•œ image-to-image์˜ ๊ฒฝ์šฐ, ๋‹ค์Œ์„ ์‚ฌ์šฉํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:
- [`KandinskyV22PriorEmb2EmbPipeline`]๋กœ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์™€ ์ด๋ฏธ์ง€์—์„œ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
- [`KandinskyV22ControlnetImg2ImgPipeline`]๋กœ ์ดˆ๊ธฐ ์ด๋ฏธ์ง€์™€ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์—์„œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.
๐Ÿค— Transformers์—์„œ `depth-estimation` [`~transformers.Pipeline`]์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ ์–‘์ด์˜ ์ดˆ๊ธฐ ์ด๋ฏธ์ง€์˜ depth map์„ ์ฒ˜๋ฆฌํ•ด ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค:
```py
import torch
import numpy as np
from diffusers import KandinskyV22PriorEmb2EmbPipeline, KandinskyV22ControlnetImg2ImgPipeline
from diffusers.utils import load_image
from transformers import pipeline
img = load_image(
"https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png"
).resize((768, 768))
def make_hint(image, depth_estimator):
image = depth_estimator(image)["depth"]
image = np.array(image)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
detected_map = torch.from_numpy(image).float() / 255.0
hint = detected_map.permute(2, 0, 1)
return hint
depth_estimator = pipeline("depth-estimation")
hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda")
```
Prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ [`KandinskyV22ControlnetImg2ImgPipeline`]์„ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค:
```py
prior_pipeline = KandinskyV22PriorEmb2EmbPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True
).to("cuda")
pipeline = KandinskyV22ControlnetImg2ImgPipeline.from_pretrained(
"kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16
).to("cuda")
```
ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์™€ ์ดˆ๊ธฐ ์ด๋ฏธ์ง€๋ฅผ ์ด์ „ ํŒŒ์ดํ”„๋ผ์ธ์— ์ „๋‹ฌํ•˜์—ฌ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค:
```py
prompt = "A robot, 4k photo"
negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature"
generator = torch.Generator(device="cuda").manual_seed(43)
img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator)
negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator)
```
์ด์ œ [`KandinskyV22ControlnetImg2ImgPipeline`]์„ ์‹คํ–‰ํ•˜์—ฌ ์ดˆ๊ธฐ ์ด๋ฏธ์ง€์™€ ์ด๋ฏธ์ง€ ์ž„๋ฒ ๋”ฉ์œผ๋กœ๋ถ€ํ„ฐ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```py
image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0]
make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2)
```
<div class="flex justify-center">
<img class="rounded-xl" src="https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/robot_cat.png"/>
</div>
## ์ตœ์ ํ™”
Kandinsky๋Š” mapping์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ prior ํŒŒ์ดํ”„๋ผ์ธ๊ณผ latents๋ฅผ ์ด๋ฏธ์ง€๋กœ ๋””์ฝ”๋”ฉํ•˜๊ธฐ ์œ„ํ•œ ๋‘ ๋ฒˆ์งธ ํŒŒ์ดํ”„๋ผ์ธ์ด ํ•„์š”ํ•˜๋‹ค๋Š” ์ ์—์„œ ๋…ํŠนํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ๊ณ„์‚ฐ์ด ๋‘ ๋ฒˆ์งธ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ์ด๋ฃจ์–ด์ง€๋ฏ€๋กœ ์ตœ์ ํ™”์˜ ๋…ธ๋ ฅ์€ ๋‘ ๋ฒˆ์งธ ํŒŒ์ดํ”„๋ผ์ธ์— ์ง‘์ค‘๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ์€ ์ถ”๋ก  ์ค‘ Kandinskyํ‚ค๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋ช‡ ๊ฐ€์ง€ ํŒ์ž…๋‹ˆ๋‹ค.
1. PyTorch < 2.0์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ [xFormers](../optimization/xformers)์„ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค.
```diff
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+ pipe.enable_xformers_memory_efficient_attention()
```
2. PyTorch >= 2.0์„ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ `torch.compile`์„ ํ™œ์„ฑํ™”ํ•˜์—ฌ scaled dot-product attention (SDPA)๋ฅผ ์ž๋™์œผ๋กœ ์‚ฌ์šฉํ•˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค:
```diff
pipe.unet.to(memory_format=torch.channels_last)
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
```
์ด๋Š” attention processor๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ [`~models.attention_processor.AttnAddedKVProcessor2_0`]์„ ์‚ฌ์šฉํ•˜๋„๋ก ์„ค์ •ํ•˜๋Š” ๊ฒƒ๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค:
```py
from diffusers.models.attention_processor import AttnAddedKVProcessor2_0
pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0())
```
3. ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ ์˜ค๋ฅ˜๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด [`~KandinskyPriorPipeline.enable_model_cpu_offload`]๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ CPU๋กœ ์˜คํ”„๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค:
```diff
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
+ pipe.enable_model_cpu_offload()
```
4. ๊ธฐ๋ณธ์ ์œผ๋กœ text-to-image ํŒŒ์ดํ”„๋ผ์ธ์€ [`DDIMScheduler`]๋ฅผ ์‚ฌ์šฉํ•˜์ง€๋งŒ, [`DDPMScheduler`]์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ์Šค์ผ€์ค„๋Ÿฌ๋กœ ๋Œ€์ฒดํ•˜์—ฌ ์ถ”๋ก  ์†๋„์™€ ์ด๋ฏธ์ง€ ํ’ˆ์งˆ ๊ฐ„์˜ ๊ท ํ˜•์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:
```py
from diffusers import DDPMScheduler
from diffusers import DiffusionPipeline
scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler")
pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda")
```