|
<!--Copyright 2024 The HuggingFace Team. All rights reserved. |
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with |
|
the License. You may obtain a copy of the License at |
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on |
|
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the |
|
specific language governing permissions and limitations under the License. |
|
--> |
|
|
|
# InstructPix2Pix |
|
|
|
[InstructPix2Pix](https://arxiv.org/abs/2211.09800)λ text-conditioned diffusion λͺ¨λΈμ΄ ν μ΄λ―Έμ§μ νΈμ§μ λ°λ₯Ό μ μλλ‘ νμΈνλνλ λ°©λ²μ
λλ€. μ΄ λ°©λ²μ μ¬μ©νμ¬ νμΈνλλ λͺ¨λΈμ λ€μμ μ
λ ₯μΌλ‘ μ¬μ©ν©λλ€: |
|
|
|
<p align="center"> |
|
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/evaluation_diffusion_models/edit-instruction.png" alt="instructpix2pix-inputs" width=600/> |
|
</p> |
|
|
|
μΆλ ₯μ μ
λ ₯ μ΄λ―Έμ§μ νΈμ§ μ§μκ° λ°μλ "μμ λ" μ΄λ―Έμ§μ
λλ€: |
|
|
|
<p align="center"> |
|
<img src="https://huggingface.co/datasets/diffusers/docs-images/resolve/main/output-gs%407-igs%401-steps%4050.png" alt="instructpix2pix-output" width=600/> |
|
</p> |
|
|
|
`train_instruct_pix2pix.py` μ€ν¬λ¦½νΈ([μ¬κΈ°](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py)μμ μ°Ύμ μ μμ΅λλ€.)λ νμ΅ μ μ°¨λ₯Ό μ€λͺ
νκ³ Stable Diffusionμ μ μ©ν μ μλ λ°©λ²μ 보μ¬μ€λλ€. |
|
|
|
|
|
*** `train_instruct_pix2pix.py`λ [μλ ꡬν](https://github.com/timothybrooks/instruct-pix2pix)μ μΆ©μ€νλ©΄μ InstructPix2Pix νμ΅ μ μ°¨λ₯Ό ꡬννκ³ μμ§λ§, [μκ·λͺ¨ λ°μ΄ν°μ
](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples)μμλ§ ν
μ€νΈλ₯Ό νμ΅λλ€. μ΄λ μ΅μ’
κ²°κ³Όμ μν₯μ λΌμΉ μ μμ΅λλ€. λ λμ κ²°κ³Όλ₯Ό μν΄, λ ν° λ°μ΄ν°μ
μμ λ κΈΈκ² νμ΅νλ κ²μ κΆμ₯ν©λλ€. [μ¬κΈ°](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered)μμ InstructPix2Pix νμ΅μ μν΄ ν° λ°μ΄ν°μ
μ μ°Ύμ μ μμ΅λλ€. |
|
*** |
|
|
|
## PyTorchλ‘ λ‘컬μμ μ€ννκΈ° |
|
|
|
### μ’
μμ±(dependencies) μ€μΉνκΈ° |
|
|
|
μ΄ μ€ν¬λ¦½νΈλ₯Ό μ€ννκΈ° μ μ, λΌμ΄λΈλ¬λ¦¬μ νμ΅ μ’
μμ±μ μ€μΉνμΈμ: |
|
|
|
**μ€μ** |
|
|
|
μ΅μ λ²μ μ μμ μ€ν¬λ¦½νΈλ₯Ό μ±κ³΅μ μΌλ‘ μ€ννκΈ° μν΄, **μλ³ΈμΌλ‘λΆν° μ€μΉ**νλ κ²κ³Ό μμ μ€ν¬λ¦½νΈλ₯Ό μμ£Ό μ
λ°μ΄νΈνκ³ μμ λ³ μꡬμ¬νμ μ€μΉνκΈ° λλ¬Έμ μ΅μ μνλ‘ μ μ§νλ κ²μ κΆμ₯ν©λλ€. μ΄λ₯Ό μν΄, μλ‘μ΄ κ°μ νκ²½μμ λ€μ μ€ν
μ μ€ννμΈμ: |
|
|
|
```bash |
|
git clone https://github.com/huggingface/diffusers |
|
cd diffusers |
|
pip install -e . |
|
``` |
|
|
|
cd λͺ
λ Ήμ΄λ‘ μμ ν΄λλ‘ μ΄λνμΈμ. |
|
```bash |
|
cd examples/instruct_pix2pix |
|
``` |
|
|
|
μ΄μ μ€ννμΈμ. |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
κ·Έλ¦¬κ³ [π€Accelerate](https://github.com/huggingface/accelerate/) νκ²½μμ μ΄κΈ°ννμΈμ: |
|
|
|
```bash |
|
accelerate config |
|
``` |
|
|
|
νΉμ νκ²½μ λν μ§λ¬Έ μμ΄ κΈ°λ³Έμ μΈ accelerate ꡬμ±μ μ¬μ©νλ €λ©΄ λ€μμ μ€ννμΈμ. |
|
|
|
```bash |
|
accelerate config default |
|
``` |
|
|
|
νΉμ μ¬μ© μ€μΈ νκ²½μ΄ notebookκ³Ό κ°μ λνν μμ μ§μνμ§ μλ κ²½μ°λ λ€μ μ μ°¨λ₯Ό λ°λΌμ£ΌμΈμ. |
|
|
|
```python |
|
from accelerate.utils import write_basic_config |
|
|
|
write_basic_config() |
|
``` |
|
|
|
### μμ |
|
|
|
μ΄μ μ μΈκΈνλ―μ΄, νμ΅μ μν΄ [μμ λ°μ΄ν°μ
](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples)μ μ¬μ©ν κ²μ
λλ€. κ·Έ λ°μ΄ν°μ
μ InstructPix2Pix λ
Όλ¬Έμμ μ¬μ©λ [μλμ λ°μ΄ν°μ
](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered)λ³΄λ€ μμ λ²μ μ
λλ€. μμ μ λ°μ΄ν°μ
μ μ¬μ©νκΈ° μν΄, [νμ΅μ μν λ°μ΄ν°μ
λ§λ€κΈ°](create_dataset) κ°μ΄λλ₯Ό μ°Έκ³ νμΈμ. |
|
|
|
`MODEL_NAME` νκ²½ λ³μ(νλΈ λͺ¨λΈ λ ν¬μ§ν 리 λλ λͺ¨λΈ κ°μ€μΉκ° ν¬ν¨λ ν΄λ κ²½λ‘)λ₯Ό μ§μ νκ³ [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) μΈμμ μ λ¬ν©λλ€. `DATASET_ID`μ λ°μ΄ν°μ
μ΄λ¦μ μ§μ ν΄μΌ ν©λλ€: |
|
|
|
|
|
```bash |
|
export MODEL_NAME="runwayml/stable-diffusion-v1-5" |
|
export DATASET_ID="fusing/instructpix2pix-1000-samples" |
|
``` |
|
|
|
μ§κΈ, νμ΅μ μ€νν μ μμ΅λλ€. μ€ν¬λ¦½νΈλ λ ν¬μ§ν 리μ νμ ν΄λμ λͺ¨λ ꡬμ±μμ(`feature_extractor`, `scheduler`, `text_encoder`, `unet` λ±)λ₯Ό μ μ₯ν©λλ€. |
|
|
|
```bash |
|
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \ |
|
--pretrained_model_name_or_path=$MODEL_NAME \ |
|
--dataset_name=$DATASET_ID \ |
|
--enable_xformers_memory_efficient_attention \ |
|
--resolution=256 --random_flip \ |
|
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \ |
|
--max_train_steps=15000 \ |
|
--checkpointing_steps=5000 --checkpoints_total_limit=1 \ |
|
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \ |
|
--conditioning_dropout_prob=0.05 \ |
|
--mixed_precision=fp16 \ |
|
--seed=42 \ |
|
--push_to_hub |
|
``` |
|
|
|
|
|
μΆκ°μ μΌλ‘, κ°μ€μΉμ λ°μ΄μ΄μ€λ₯Ό νμ΅ κ³Όμ μ λͺ¨λν°λ§νμ¬ κ²μ¦ μΆλ‘ μ μννλ κ²μ μ§μν©λλ€. `report_to="wandb"`μ μ΄ κΈ°λ₯μ μ¬μ©ν μ μμ΅λλ€: |
|
|
|
```bash |
|
accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \ |
|
--pretrained_model_name_or_path=$MODEL_NAME \ |
|
--dataset_name=$DATASET_ID \ |
|
--enable_xformers_memory_efficient_attention \ |
|
--resolution=256 --random_flip \ |
|
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \ |
|
--max_train_steps=15000 \ |
|
--checkpointing_steps=5000 --checkpoints_total_limit=1 \ |
|
--learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \ |
|
--conditioning_dropout_prob=0.05 \ |
|
--mixed_precision=fp16 \ |
|
--val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \ |
|
--validation_prompt="make the mountains snowy" \ |
|
--seed=42 \ |
|
--report_to=wandb \ |
|
--push_to_hub |
|
``` |
|
|
|
λͺ¨λΈ λλ²κΉ
μ μ μ©ν μ΄ νκ° λ°©λ² κΆμ₯ν©λλ€. μ΄λ₯Ό μ¬μ©νκΈ° μν΄ `wandb`λ₯Ό μ€μΉνλ κ²μ μ£Όλͺ©ν΄μ£ΌμΈμ. `pip install wandb`λ‘ μ€νν΄ `wandb`λ₯Ό μ€μΉν μ μμ΅λλ€. |
|
|
|
[μ¬κΈ°](https://wandb.ai/sayakpaul/instruct-pix2pix/runs/ctr3kovq), λͺ κ°μ§ νκ° λ°©λ²κ³Ό νμ΅ νλΌλ―Έν°λ₯Ό ν¬ν¨νλ μμλ₯Ό λ³Ό μ μμ΅λλ€. |
|
|
|
***μ°Έκ³ : μλ³Έ λ
Όλ¬Έμμ, μ μλ€μ 256x256 μ΄λ―Έμ§ ν΄μλλ‘ νμ΅ν λͺ¨λΈλ‘ 512x512μ κ°μ λ ν° ν΄μλλ‘ μ μΌλ°νλλ κ²μ λ³Ό μ μμμ΅λλ€. μ΄λ νμ΅μ μ¬μ©ν ν° λ°μ΄ν°μ
μ μ¬μ©νκΈ° λλ¬Έμ
λλ€.*** |
|
|
|
## λ€μμ GPUλ‘ νμ΅νκΈ° |
|
|
|
`accelerate`λ μνν λ€μμ GPUλ‘ νμ΅μ κ°λ₯νκ² ν©λλ€. `accelerate`λ‘ λΆμ° νμ΅μ μ€ννλ [μ¬κΈ°](https://huggingface.co/docs/accelerate/basic_tutorials/launch) μ€λͺ
μ λ°λΌ ν΄ μ£ΌμκΈ° λ°λλλ€. μμμ λͺ
λ Ήμ΄ μ
λλ€: |
|
|
|
|
|
```bash |
|
accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py \ |
|
--pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 \ |
|
--dataset_name=sayakpaul/instructpix2pix-1000-samples \ |
|
--use_ema \ |
|
--enable_xformers_memory_efficient_attention \ |
|
--resolution=512 --random_flip \ |
|
--train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \ |
|
--max_train_steps=15000 \ |
|
--checkpointing_steps=5000 --checkpoints_total_limit=1 \ |
|
--learning_rate=5e-05 --lr_warmup_steps=0 \ |
|
--conditioning_dropout_prob=0.05 \ |
|
--mixed_precision=fp16 \ |
|
--seed=42 \ |
|
--push_to_hub |
|
``` |
|
|
|
## μΆλ‘ νκΈ° |
|
|
|
μΌλ¨ νμ΅μ΄ μλ£λλ©΄, μΆλ‘ ν μ μμ΅λλ€: |
|
|
|
```python |
|
import PIL |
|
import requests |
|
import torch |
|
from diffusers import StableDiffusionInstructPix2PixPipeline |
|
|
|
model_id = "your_model_id" # <- μ΄λ₯Ό μμ νμΈμ. |
|
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") |
|
generator = torch.Generator("cuda").manual_seed(0) |
|
|
|
url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png" |
|
|
|
|
|
def download_image(url): |
|
image = PIL.Image.open(requests.get(url, stream=True).raw) |
|
image = PIL.ImageOps.exif_transpose(image) |
|
image = image.convert("RGB") |
|
return image |
|
|
|
|
|
image = download_image(url) |
|
prompt = "wipe out the lake" |
|
num_inference_steps = 20 |
|
image_guidance_scale = 1.5 |
|
guidance_scale = 10 |
|
|
|
edited_image = pipe( |
|
prompt, |
|
image=image, |
|
num_inference_steps=num_inference_steps, |
|
image_guidance_scale=image_guidance_scale, |
|
guidance_scale=guidance_scale, |
|
generator=generator, |
|
).images[0] |
|
edited_image.save("edited_image.png") |
|
``` |
|
|
|
νμ΅ μ€ν¬λ¦½νΈλ₯Ό μ¬μ©ν΄ μ»μ μμμ λͺ¨λΈ λ ν¬μ§ν 리λ μ¬κΈ° [sayakpaul/instruct-pix2pix](https://huggingface.co/sayakpaul/instruct-pix2pix)μμ νμΈν μ μμ΅λλ€. |
|
|
|
μ±λ₯μ μν μλμ νμ§μ μ μ΄νκΈ° μν΄ μΈ κ°μ§ νλΌλ―Έν°λ₯Ό μ¬μ©νλ κ²μ΄ μ’μ΅λλ€: |
|
|
|
* `num_inference_steps` |
|
* `image_guidance_scale` |
|
* `guidance_scale` |
|
|
|
νΉν, `image_guidance_scale`μ `guidance_scale`λ μμ±λ("μμ λ") μ΄λ―Έμ§μμ ν° μν₯μ λ―ΈμΉ μ μμ΅λλ€.([μ¬κΈ°](https://twitter.com/RisingSayak/status/1628392199196151808?s=20)μμλ₯Ό μ°Έκ³ ν΄μ£ΌμΈμ.) |
|
|
|
|
|
λ§μ½ InstructPix2Pix νμ΅ λ°©λ²μ μ¬μ©ν΄ λͺ κ°μ§ ν₯λ―Έλ‘μ΄ λ°©λ²μ μ°Ύκ³ μλ€λ©΄, μ΄ λΈλ‘κ·Έ κ²μλ¬Ό[Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd)μ νμΈν΄μ£ΌμΈμ. |