Music-TTA

Runtime error

Music-TTA / diffusers /docs /source /en /api /pipelines /stable_unclip.mdx

hungchiayu1

initial commit

ffead1e over 1 year ago

6.61 kB

	<!--Copyright 2023 The HuggingFace Team. All rights reserved.

	Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
	an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
	specific language governing permissions and limitations under the License.
	-->

	# Stable unCLIP

	Stable unCLIP checkpoints are finetuned from [stable diffusion 2.1](./stable_diffusion_2) checkpoints to condition on CLIP image embeddings.
	Stable unCLIP also still conditions on text embeddings. Given the two separate conditionings, stable unCLIP can be used
	for text guided image variation. When combined with an unCLIP prior, it can also be used for full text to image generation.

	To know more about the unCLIP process, check out the following paper:

	[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125) by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen.

	## Tips

	Stable unCLIP takes a `noise_level` as input during inference. `noise_level` determines how much noise is added
	to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default,
	we do not add any additional noise to the image embeddings i.e. `noise_level = 0`.

	### Available checkpoints:

	* Image variation
	* [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip)
	* [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small)
	* Text-to-image
	* [stabilityai/stable-diffusion-2-1-unclip-small](https://hf.co/stabilityai/stable-diffusion-2-1-unclip-small)

	### Text-to-Image Generation
	Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha)

	```python
	import torch
	from diffusers import UnCLIPScheduler, DDPMScheduler, StableUnCLIPPipeline
	from diffusers.models import PriorTransformer
	from transformers import CLIPTokenizer, CLIPTextModelWithProjection

	prior_model_id = "kakaobrain/karlo-v1-alpha"
	data_type = torch.float16
	prior = PriorTransformer.from_pretrained(prior_model_id, subfolder="prior", torch_dtype=data_type)

	prior_text_model_id = "openai/clip-vit-large-patch14"
	prior_tokenizer = CLIPTokenizer.from_pretrained(prior_text_model_id)
	prior_text_model = CLIPTextModelWithProjection.from_pretrained(prior_text_model_id, torch_dtype=data_type)
	prior_scheduler = UnCLIPScheduler.from_pretrained(prior_model_id, subfolder="prior_scheduler")
	prior_scheduler = DDPMScheduler.from_config(prior_scheduler.config)

	stable_unclip_model_id = "stabilityai/stable-diffusion-2-1-unclip-small"

	pipe = StableUnCLIPPipeline.from_pretrained(
	stable_unclip_model_id,
	torch_dtype=data_type,
	variant="fp16",
	prior_tokenizer=prior_tokenizer,
	prior_text_encoder=prior_text_model,
	prior=prior,
	prior_scheduler=prior_scheduler,
	)

	pipe = pipe.to("cuda")
	wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular"

	images = pipe(prompt=wave_prompt).images
	images[0].save("waves.png")
	```
	<Tip warning={true}>

	For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use.

	</Tip>

	### Text guided Image-to-Image Variation

	```python
	from diffusers import StableUnCLIPImg2ImgPipeline
	from diffusers.utils import load_image
	import torch

	pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
	"stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16"
	)
	pipe = pipe.to("cuda")

	url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png"
	init_image = load_image(url)

	images = pipe(init_image).images
	images[0].save("variation_image.png")
	```

	Optionally, you can also pass a prompt to `pipe` such as:

	```python
	prompt = "A fantasy landscape, trending on artstation"

	images = pipe(init_image, prompt=prompt).images
	images[0].save("variation_image_two.png")
	```

	### Memory optimization

	If you are short on GPU memory, you can enable smart CPU offloading so that models that are not needed
	immediately for a computation can be offloaded to CPU:

	```python
	from diffusers import StableUnCLIPImg2ImgPipeline
	from diffusers.utils import load_image
	import torch

	pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
	"stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16"
	)
	# Offload to CPU.
	pipe.enable_model_cpu_offload()

	url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png"
	init_image = load_image(url)

	images = pipe(init_image).images
	images[0]
	```

	Further memory optimizations are possible by enabling VAE slicing on the pipeline:

	```python
	from diffusers import StableUnCLIPImg2ImgPipeline
	from diffusers.utils import load_image
	import torch

	pipe = StableUnCLIPImg2ImgPipeline.from_pretrained(
	"stabilityai/stable-diffusion-2-1-unclip", torch_dtype=torch.float16, variation="fp16"
	)
	pipe.enable_model_cpu_offload()
	pipe.enable_vae_slicing()

	url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png"
	init_image = load_image(url)

	images = pipe(init_image).images
	images[0]
	```

	### StableUnCLIPPipeline

	[[autodoc]] StableUnCLIPPipeline
	- all
	- __call__
	- enable_attention_slicing
	- disable_attention_slicing
	- enable_vae_slicing
	- disable_vae_slicing
	- enable_xformers_memory_efficient_attention
	- disable_xformers_memory_efficient_attention


	### StableUnCLIPImg2ImgPipeline

	[[autodoc]] StableUnCLIPImg2ImgPipeline
	- all
	- __call__
	- enable_attention_slicing
	- disable_attention_slicing
	- enable_vae_slicing
	- disable_vae_slicing
	- enable_xformers_memory_efficient_attention
	- disable_xformers_memory_efficient_attention