Diffusers documentation
LEDITS++
LEDITS++
LEDITS++ was proposed in LEDITS++: Limitless Image Editing using Text-to-Image Models by Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, Apolinário Passos.
The abstract from the paper is:
Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming fine-tuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++‘s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. The project page is available at https://leditsplusplus-project.static.hf.space .
You can find additional information about LEDITS++ on the project page and try it out in a demo.
Due to some backward compatibility issues with the current diffusers implementation of DPMSolverMultistepScheduler this implementation of LEdits++ can no longer guarantee perfect inversion. This issue is unlikely to have any noticeable effects on applied use-cases. However, we provide an alternative implementation that guarantees perfect inversion in a dedicated GitHub repo.
We provide two distinct pipelines based on different pre-trained models.
LEditsPPPipelineStableDiffusion
class diffusers.LEditsPPPipelineStableDiffusion
< source >( vae: AutoencoderKL text_encoder: CLIPTextModel tokenizer: CLIPTokenizer unet: UNet2DConditionModel scheduler: typing.Union[diffusers.schedulers.scheduling_ddim.DDIMScheduler, diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler] safety_checker: StableDiffusionSafetyChecker feature_extractor: CLIPImageProcessor requires_safety_checker: bool = True )
Parameters
- vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
- text_encoder (CLIPTextModel) — Frozen text-encoder. Stable Diffusion uses the text portion of CLIP, specifically the clip-vit-large-patch14 variant.
- tokenizer (CLIPTokenizer) — Tokenizer of class CLIPTokenizer.
- unet (UNet2DConditionModel) — Conditional U-Net architecture to denoise the encoded image latents.
- scheduler (DPMSolverMultistepScheduler or DDIMScheduler) —
A scheduler to be used in combination with
unet
to denoise the encoded image latens. Can be one of DPMSolverMultistepScheduler or DDIMScheduler. If any other scheduler is passed it will automatically be set to DPMSolverMultistepScheduler. - safety_checker (
StableDiffusionSafetyChecker
) — Classification module that estimates whether generated images could be considered offensive or harmful. Please, refer to the model card for details. - feature_extractor (CLIPImageProcessor) —
Model that extracts features from generated images to be used as inputs for the
safety_checker
.
Pipeline for textual image editing using LEDits++ with Stable Diffusion.
This model inherits from DiffusionPipeline and builds on the StableDiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
< source >( negative_prompt: typing.Union[str, typing.List[str], NoneType] = None generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True editing_prompt: typing.Union[str, typing.List[str], NoneType] = None editing_prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None reverse_editing_direction: typing.Union[bool, typing.List[bool], NoneType] = False edit_guidance_scale: typing.Union[float, typing.List[float], NoneType] = 5 edit_warmup_steps: typing.Union[int, typing.List[int], NoneType] = 0 edit_cooldown_steps: typing.Union[int, typing.List[int], NoneType] = None edit_threshold: typing.Union[float, typing.List[float], NoneType] = 0.9 user_mask: typing.Optional[torch.Tensor] = None sem_guidance: typing.Optional[typing.List[torch.Tensor]] = None use_cross_attn_mask: bool = False use_intersect_mask: bool = True attn_store_steps: typing.Optional[typing.List[int]] = [] store_averaged_over_steps: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → LEditsPPDiffusionPipelineOutput or tuple
Parameters
- negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - generator (
torch.Generator
, optional) — One or a list of torch generator(s) to make generation deterministic. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a LEditsPPDiffusionPipelineOutput instead of a plain tuple. - editing_prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. The image is reconstructed by settingediting_prompt = None
. Guidance direction of prompt should be specified viareverse_editing_direction
. - editing_prompt_embeds (
torch.Tensor>
, optional) — Pre-computed embeddings to use for guiding the image generation. Guidance direction of embedding should be specified viareverse_editing_direction
. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided,negative_prompt_embeds
are generated from thenegative_prompt
input argument. - reverse_editing_direction (
bool
orList[bool]
, optional, defaults toFalse
) — Whether the corresponding prompt inediting_prompt
should be increased or decreased. - edit_guidance_scale (
float
orList[float]
, optional, defaults to 5) — Guidance scale for guiding the image generation. If provided as list values should correspond toediting_prompt
.edit_guidance_scale
is defined ass_e
of equation 12 of LEDITS++ Paper. - edit_warmup_steps (
float
orList[float]
, optional, defaults to 10) — Number of diffusion steps (for each prompt) for which guidance will not be applied. - edit_cooldown_steps (
float
orList[float]
, optional, defaults toNone
) — Number of diffusion steps (for each prompt) after which guidance will no longer be applied. - edit_threshold (
float
orList[float]
, optional, defaults to 0.9) — Masking threshold of guidance. Threshold should be proportional to the image region that is modified. ‘edit_threshold’ is defined as ‘λ’ of equation 12 of LEDITS++ Paper. - user_mask (
torch.Tensor
, optional) — User-provided mask for even better control over the editing process. This is helpful when LEDITS++‘s implicit masks do not meet user preferences. - sem_guidance (
List[torch.Tensor]
, optional) — List of pre-generated guidance vectors to be applied at generation. Length of the list has to correspond tonum_inference_steps
. - use_cross_attn_mask (
bool
, defaults toFalse
) — Whether cross-attention masks are used. Cross-attention masks are always used when use_intersect_mask is set to true. Cross-attention masks are defined as ‘M^1’ of equation 12 of LEDITS++ paper. - use_intersect_mask (
bool
, defaults toTrue
) — Whether the masking term is calculated as intersection of cross-attention masks and masks derived from the noise estimate. Cross-attention mask are defined as ‘M^1’ and masks derived from the noise estimate are defined as ‘M^2’ of equation 12 of LEDITS++ paper. - attn_store_steps (
List[int]
, optional) — Steps for which the attention maps are stored in the AttentionStore. Just for visualization purposes. - store_averaged_over_steps (
bool
, defaults toTrue
) — Whether the attention maps for the ‘attn_store_steps’ are stored averaged over the diffusion steps. If False, attention maps for each step are stores separately. Just for visualization purposes. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. - guidance_rescale (
float
, optional, defaults to 0.0) — Guidance rescale factor from Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
LEditsPPDiffusionPipelineOutput or tuple
LEditsPPDiffusionPipelineOutput if return_dict
is True, otherwise a tuple. When returning a tuple, the first element is a list with the generated images, and the second element is a list of
bools denoting whether the corresponding generated image likely represents "not-safe-for-work" (nsfw) content, according to the
safety_checker`.
The call function to the pipeline for editing. The invert() method has to be called beforehand. Edits will always be performed for the last inverted image(s).
Examples:
>>> import torch
>>> from diffusers import LEditsPPPipelineStableDiffusion
>>> from diffusers.utils import load_image
>>> pipe = LEditsPPPipelineStableDiffusion.from_pretrained(
... "stable-diffusion-v1-5/stable-diffusion-v1-5", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe.enable_vae_tiling()
>>> pipe = pipe.to("cuda")
>>> img_url = "https://www.aiml.informatik.tu-darmstadt.de/people/mbrack/cherry_blossom.png"
>>> image = load_image(img_url).resize((512, 512))
>>> _ = pipe.invert(image=image, num_inversion_steps=50, skip=0.1)
>>> edited_image = pipe(
... editing_prompt=["cherry blossom"], edit_guidance_scale=10.0, edit_threshold=0.75
... ).images[0]
invert
< source >( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] source_prompt: str = '' source_guidance_scale: float = 3.5 num_inversion_steps: int = 30 skip: float = 0.15 generator: typing.Optional[torch._C.Generator] = None cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None clip_skip: typing.Optional[int] = None height: typing.Optional[int] = None width: typing.Optional[int] = None resize_mode: typing.Optional[str] = 'default' crops_coords: typing.Optional[typing.Tuple[int, int, int, int]] = None ) → LEditsPPInversionPipelineOutput
Parameters
- image (
PipelineImageInput
) — Input for the image(s) that are to be edited. Multiple input images have to default to the same aspect ratio. - source_prompt (
str
, defaults to""
) — Prompt describing the input image that will be used for guidance during inversion. Guidance is disabled if thesource_prompt
is""
. - source_guidance_scale (
float
, defaults to3.5
) — Strength of guidance during inversion. - num_inversion_steps (
int
, defaults to30
) — Number of total performed inversion steps after discarding the initialskip
steps. - skip (
float
, defaults to0.15
) — Portion of initial steps that will be ignored for inversion and subsequent generation. Lower values will lead to stronger changes to the input image.skip
has to be between0
and1
. - generator (
torch.Generator
, optional) — Atorch.Generator
to make inversion deterministic. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined inself.processor
. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - height (
int
, optional, defaults toNone
) — The height in preprocessed image. IfNone
, will use theget_default_height_width()
to get default height. - width (
int
, optional, defaults to
None) -- The width in preprocessed. If
None, will use get_default_height_width()
to get the default width. - resize_mode (
str
, optional, defaults todefault
) — The resize mode, can be one ofdefault
orfill
. Ifdefault
, will resize the image to fit within the specified width and height, and it may not maintaining the original aspect ratio. Iffill
, will resize the image to fit within the specified width and height, maintaining the aspect ratio, and then center the image within the dimensions, filling empty with data from image. Ifcrop
, will resize the image to fit within the specified width and height, maintaining the aspect ratio, and then center the image within the dimensions, cropping the excess. Note that resize_modefill
andcrop
are only supported for PIL image input. - crops_coords (
List[Tuple[int, int, int, int]]
, optional, defaults toNone
) — The crop coordinates for each image in the batch. IfNone
, will not crop the image.
Returns
Output will contain the resized input image(s) and respective VAE reconstruction(s).
The function to the pipeline for image inversion as described by the LEDITS++ Paper. If the scheduler is set to DDIMScheduler the inversion proposed by edit-friendly DPDM will be performed instead.
Disable sliced VAE decoding. If enable_vae_slicing
was previously enabled, this method will go back to
computing decoding in one step.
Disable tiled VAE decoding. If enable_vae_tiling
was previously enabled, this method will go back to
computing decoding in one step.
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.
encode_prompt
< source >( device num_images_per_prompt enable_edit_guidance negative_prompt = None editing_prompt = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None editing_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None )
Parameters
- device — (
torch.device
): torch device - num_images_per_prompt (
int
) — number of images that should be generated per prompt - enable_edit_guidance (
bool
) — whether to perform any editing or reconstruct the input image instead - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - editing_prompt (
str
orList[str]
, optional) — Editing prompt(s) to be encoded. If not defined, one has to passediting_prompt_embeds
instead. - editing_prompt_embeds (
torch.Tensor
, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated fromprompt
input argument. - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - lora_scale (
float
, optional) — A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
Encodes the prompt into text encoder hidden states.
LEditsPPPipelineStableDiffusionXL
class diffusers.LEditsPPPipelineStableDiffusionXL
< source >( vae: AutoencoderKL text_encoder: CLIPTextModel text_encoder_2: CLIPTextModelWithProjection tokenizer: CLIPTokenizer tokenizer_2: CLIPTokenizer unet: UNet2DConditionModel scheduler: typing.Union[diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, diffusers.schedulers.scheduling_ddim.DDIMScheduler] image_encoder: CLIPVisionModelWithProjection = None feature_extractor: CLIPImageProcessor = None force_zeros_for_empty_prompt: bool = True add_watermarker: typing.Optional[bool] = None )
Parameters
- vae (AutoencoderKL) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
- text_encoder (CLIPTextModel) — Frozen text-encoder. Stable Diffusion XL uses the text portion of CLIP, specifically the clip-vit-large-patch14 variant.
- text_encoder_2 (CLIPTextModelWithProjection) — Second frozen text-encoder. Stable Diffusion XL uses the text and pool portion of CLIP, specifically the laion/CLIP-ViT-bigG-14-laion2B-39B-b160k variant.
- tokenizer (CLIPTokenizer) — Tokenizer of class CLIPTokenizer.
- tokenizer_2 (CLIPTokenizer) — Second Tokenizer of class CLIPTokenizer.
- unet (UNet2DConditionModel) — Conditional U-Net architecture to denoise the encoded image latents.
- scheduler (DPMSolverMultistepScheduler or DDIMScheduler) —
A scheduler to be used in combination with
unet
to denoise the encoded image latens. Can be one of DPMSolverMultistepScheduler or DDIMScheduler. If any other scheduler is passed it will automatically be set to DPMSolverMultistepScheduler. - force_zeros_for_empty_prompt (
bool
, optional, defaults to"True"
) — Whether the negative prompt embeddings shall be forced to always be set to 0. Also see the config ofstabilityai/stable-diffusion-xl-base-1-0
. - add_watermarker (
bool
, optional) — Whether to use the invisible_watermark library to watermark output images. If not defined, it will default to True if the package is installed, otherwise no watermarker will be used.
Pipeline for textual image editing using LEDits++ with Stable Diffusion XL.
This model inherits from DiffusionPipeline and builds on the StableDiffusionXLPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
In addition the pipeline inherits the following loading methods:
- LoRA: LEditsPPPipelineStableDiffusionXL.load_lora_weights()
- Ckpt: loaders.FromSingleFileMixin.from_single_file()
as well as the following saving methods:
- LoRA:
loaders.StableDiffusionXLPipeline.save_lora_weights
__call__
< source >( denoising_end: typing.Optional[float] = None negative_prompt: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_2: typing.Union[str, typing.List[str], NoneType] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None ip_adapter_image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None guidance_rescale: float = 0.0 crops_coords_top_left: typing.Tuple[int, int] = (0, 0) target_size: typing.Optional[typing.Tuple[int, int]] = None editing_prompt: typing.Union[str, typing.List[str], NoneType] = None editing_prompt_embeddings: typing.Optional[torch.Tensor] = None editing_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None reverse_editing_direction: typing.Union[bool, typing.List[bool], NoneType] = False edit_guidance_scale: typing.Union[float, typing.List[float], NoneType] = 5 edit_warmup_steps: typing.Union[int, typing.List[int], NoneType] = 0 edit_cooldown_steps: typing.Union[int, typing.List[int], NoneType] = None edit_threshold: typing.Union[float, typing.List[float], NoneType] = 0.9 sem_guidance: typing.Optional[typing.List[torch.Tensor]] = None use_cross_attn_mask: bool = False use_intersect_mask: bool = False user_mask: typing.Optional[torch.Tensor] = None attn_store_steps: typing.Optional[typing.List[int]] = [] store_averaged_over_steps: bool = True clip_skip: typing.Optional[int] = None callback_on_step_end: typing.Optional[typing.Callable[[int, int, typing.Dict], NoneType]] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] **kwargs ) → LEditsPPDiffusionPipelineOutput or tuple
Parameters
- denoising_end (
float
, optional) — When specified, determines the fraction (between 0.0 and 1.0) of the total denoising process to be completed before it is intentionally prematurely terminated. As a result, the returned sample will still retain a substantial amount of noise as determined by the discrete timesteps selected by the scheduler. The denoising_end parameter should ideally be utilized when this pipeline forms a part of a “Mixture of Denoisers” multi-pipeline setup, as elaborated in [**Refining the Image - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used in both text-encoders - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - negative_pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - ip_adapter_image — (
PipelineImageInput
, optional): Optional image input to work with IP Adapters. - output_type (
str
, optional, defaults to"pil"
) — The output format of the generate image. Choose between PIL:PIL.Image.Image
ornp.array
. - return_dict (
bool
, optional, defaults toTrue
) — Whether or not to return a~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput
instead of a plain tuple. - callback (
Callable
, optional) — A function that will be called everycallback_steps
steps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor)
. - callback_steps (
int
, optional, defaults to 1) — The frequency at which thecallback
function will be called. If not specified, the callback will be called at every step. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor. - guidance_rescale (
float
, optional, defaults to 0.7) — Guidance rescale factor proposed by Common Diffusion Noise Schedules and Sample Steps are Flawedguidance_scale
is defined asφ
in equation 16. of Common Diffusion Noise Schedules and Sample Steps are Flawed. Guidance rescale factor should fix overexposure when using zero terminal SNR. - crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) —crops_coords_top_left
can be used to generate an image that appears to be “cropped” from the positioncrops_coords_top_left
downwards. Favorable, well-centered images are usually achieved by settingcrops_coords_top_left
to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - target_size (
Tuple[int]
, optional, defaults to (1024, 1024)) — For most cases,target_size
should be set to the desired height and width of the generated image. If not specified it will default to(width, height)
. Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - editing_prompt (
str
orList[str]
, optional) — The prompt or prompts to guide the image generation. The image is reconstructed by settingediting_prompt = None
. Guidance direction of prompt should be specified viareverse_editing_direction
. - editing_prompt_embeddings (
torch.Tensor
, optional) — Pre-generated edit text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, editing_prompt_embeddings will be generated fromediting_prompt
input argument. - editing_pooled_prompt_embeddings (
torch.Tensor
, optional) — Pre-generated pooled edit text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, editing_prompt_embeddings will be generated fromediting_prompt
input argument. - reverse_editing_direction (
bool
orList[bool]
, optional, defaults toFalse
) — Whether the corresponding prompt inediting_prompt
should be increased or decreased. - edit_guidance_scale (
float
orList[float]
, optional, defaults to 5) — Guidance scale for guiding the image generation. If provided as list values should correspond toediting_prompt
.edit_guidance_scale
is defined ass_e
of equation 12 of LEDITS++ Paper. - edit_warmup_steps (
float
orList[float]
, optional, defaults to 10) — Number of diffusion steps (for each prompt) for which guidance is not applied. - edit_cooldown_steps (
float
orList[float]
, optional, defaults toNone
) — Number of diffusion steps (for each prompt) after which guidance is no longer applied. - edit_threshold (
float
orList[float]
, optional, defaults to 0.9) — Masking threshold of guidance. Threshold should be proportional to the image region that is modified. ‘edit_threshold’ is defined as ‘λ’ of equation 12 of LEDITS++ Paper. - sem_guidance (
List[torch.Tensor]
, optional) — List of pre-generated guidance vectors to be applied at generation. Length of the list has to correspond tonum_inference_steps
. - use_cross_attn_mask — Whether cross-attention masks are used. Cross-attention masks are always used when use_intersect_mask is set to true. Cross-attention masks are defined as ‘M^1’ of equation 12 of LEDITS++ paper.
- use_intersect_mask — Whether the masking term is calculated as intersection of cross-attention masks and masks derived from the noise estimate. Cross-attention mask are defined as ‘M^1’ and masks derived from the noise estimate are defined as ‘M^2’ of equation 12 of LEDITS++ paper.
- user_mask — User-provided mask for even better control over the editing process. This is helpful when LEDITS++‘s implicit masks do not meet user preferences.
- attn_store_steps — Steps for which the attention maps are stored in the AttentionStore. Just for visualization purposes.
- store_averaged_over_steps — Whether the attention maps for the ‘attn_store_steps’ are stored averaged over the diffusion steps. If False, attention maps for each step are stores separately. Just for visualization purposes.
- clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - callback_on_step_end (
Callable
, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)
.callback_kwargs
will include a list of all tensors as specified bycallback_on_step_end_tensor_inputs
. - callback_on_step_end_tensor_inputs (
List
, optional) — The list of tensor inputs for thecallback_on_step_end
function. The tensors specified in the list will be passed ascallback_kwargs
argument. You will only be able to include variables listed in the._callback_tensor_inputs
attribute of your pipeline class.
Returns
LEditsPPDiffusionPipelineOutput or tuple
LEditsPPDiffusionPipelineOutput if return_dict
is True, otherwise a `tuple. When
returning a tuple, the first element is a list with the generated images.
The call function to the pipeline for editing. The invert() method has to be called beforehand. Edits will always be performed for the last inverted image(s).
Examples:
>>> import torch
>>> from diffusers import LEditsPPPipelineStableDiffusionXL
>>> from diffusers.utils import load_image
>>> pipe = LEditsPPPipelineStableDiffusionXL.from_pretrained(
... "stabilityai/stable-diffusion-xl-base-1.0", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe.enable_vae_tiling()
>>> pipe = pipe.to("cuda")
>>> img_url = "https://www.aiml.informatik.tu-darmstadt.de/people/mbrack/tennis.jpg"
>>> image = load_image(img_url).resize((1024, 1024))
>>> _ = pipe.invert(image=image, num_inversion_steps=50, skip=0.2)
>>> edited_image = pipe(
... editing_prompt=["tennis ball", "tomato"],
... reverse_editing_direction=[True, False],
... edit_guidance_scale=[5.0, 10.0],
... edit_threshold=[0.9, 0.85],
... ).images[0]
invert
< source >( image: typing.Union[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List[PIL.Image.Image], typing.List[numpy.ndarray], typing.List[torch.Tensor]] source_prompt: str = '' source_guidance_scale = 3.5 negative_prompt: str = None negative_prompt_2: str = None num_inversion_steps: int = 50 skip: float = 0.15 generator: typing.Optional[torch._C.Generator] = None crops_coords_top_left: typing.Tuple[int, int] = (0, 0) num_zero_noise_steps: int = 3 cross_attention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = None height: typing.Optional[int] = None width: typing.Optional[int] = None resize_mode: typing.Optional[str] = 'default' crops_coords: typing.Optional[typing.Tuple[int, int, int, int]] = None ) → LEditsPPInversionPipelineOutput
Parameters
- image (
PipelineImageInput
) — Input for the image(s) that are to be edited. Multiple input images have to default to the same aspect ratio. - source_prompt (
str
, defaults to""
) — Prompt describing the input image that will be used for guidance during inversion. Guidance is disabled if thesource_prompt
is""
. - source_guidance_scale (
float
, defaults to3.5
) — Strength of guidance during inversion. - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. Ignored when not using guidance (i.e., ignored ifguidance_scale
is less than1
). - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used in both text-encoders - num_inversion_steps (
int
, defaults to50
) — Number of total performed inversion steps after discarding the initialskip
steps. - skip (
float
, defaults to0.15
) — Portion of initial steps that will be ignored for inversion and subsequent generation. Lower values will lead to stronger changes to the input image.skip
has to be between0
and1
. - generator (
torch.Generator
, optional) — Atorch.Generator
to make inversion deterministic. - crops_coords_top_left (
Tuple[int]
, optional, defaults to (0, 0)) —crops_coords_top_left
can be used to generate an image that appears to be “cropped” from the positioncrops_coords_top_left
downwards. Favorable, well-centered images are usually achieved by settingcrops_coords_top_left
to (0, 0). Part of SDXL’s micro-conditioning as explained in section 2.2 of https://huggingface.co/papers/2307.01952. - num_zero_noise_steps (
int
, defaults to3
) — Number of final diffusion steps that will not renoise the current image. If no steps are set to zero SD-XL in combination with DPMSolverMultistepScheduler will produce noise artifacts. - cross_attention_kwargs (
dict
, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessor
as defined underself.processor
in diffusers.models.attention_processor.
Returns
Output will contain the resized input image(s) and respective VAE reconstruction(s).
The function to the pipeline for image inversion as described by the LEDITS++ Paper. If the scheduler is set to DDIMScheduler the inversion proposed by edit-friendly DPDM will be performed instead.
Disable sliced VAE decoding. If enable_vae_slicing
was previously enabled, this method will go back to
computing decoding in one step.
Disable tiled VAE decoding. If enable_vae_tiling
was previously enabled, this method will go back to
computing decoding in one step.
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.
encode_prompt
< source >( device: typing.Optional[torch.device] = None num_images_per_prompt: int = 1 negative_prompt: typing.Optional[str] = None negative_prompt_2: typing.Optional[str] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None negative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None lora_scale: typing.Optional[float] = None clip_skip: typing.Optional[int] = None enable_edit_guidance: bool = True editing_prompt: typing.Optional[str] = None editing_prompt_embeds: typing.Optional[torch.Tensor] = None editing_pooled_prompt_embeds: typing.Optional[torch.Tensor] = None )
Parameters
- device — (
torch.device
): torch device - num_images_per_prompt (
int
) — number of images that should be generated per prompt - negative_prompt (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds
instead. - negative_prompt_2 (
str
orList[str]
, optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2
andtext_encoder_2
. If not defined,negative_prompt
is used in both text-encoders - negative_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_prompt
input argument. - negative_pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_prompt
input argument. - lora_scale (
float
, optional) — A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded. - clip_skip (
int
, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings. - enable_edit_guidance (
bool
) — Whether to guide towards an editing prompt or not. - editing_prompt (
str
orList[str]
, optional) — Editing prompt(s) to be encoded. If not defined and ‘enable_edit_guidance’ is True, one has to passediting_prompt_embeds
instead. - editing_prompt_embeds (
torch.Tensor
, optional) — Pre-generated edit text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided and ‘enable_edit_guidance’ is True, editing_prompt_embeds will be generated fromediting_prompt
input argument. - editing_pooled_prompt_embeds (
torch.Tensor
, optional) — Pre-generated edit pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled editing_pooled_prompt_embeds will be generated fromediting_prompt
input argument.
Encodes the prompt into text encoder hidden states.
get_guidance_scale_embedding
< source >( w: Tensor embedding_dim: int = 512 dtype: dtype = torch.float32 ) → torch.Tensor
Parameters
- w (
torch.Tensor
) — Generate embedding vectors with a specified guidance scale to subsequently enrich timestep embeddings. - embedding_dim (
int
, optional, defaults to 512) — Dimension of the embeddings to generate. - dtype (
torch.dtype
, optional, defaults totorch.float32
) — Data type of the generated embeddings.
Returns
torch.Tensor
Embedding vectors with shape (len(w), embedding_dim)
.
LEditsPPDiffusionPipelineOutput
class diffusers.pipelines.LEditsPPDiffusionPipelineOutput
< source >( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] nsfw_content_detected: typing.Optional[typing.List[bool]] )
Parameters
- images (
List[PIL.Image.Image]
ornp.ndarray
) — List of denoised PIL images of lengthbatch_size
or NumPy array of shape(batch_size, height, width, num_channels)
. - nsfw_content_detected (
List[bool]
) — List indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content orNone
if safety checking could not be performed.
Output class for LEdits++ Diffusion pipelines.
LEditsPPInversionPipelineOutput
class diffusers.pipelines.LEditsPPInversionPipelineOutput
< source >( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] vae_reconstruction_images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )
Parameters
- input_images (
List[PIL.Image.Image]
ornp.ndarray
) — List of the cropped and resized input images as PIL images of lengthbatch_size
or NumPy array of shape(batch_size, height, width, num_channels)
. - vae_reconstruction_images (
List[PIL.Image.Image]
ornp.ndarray
) — List of VAE reconstruction of all input images as PIL images of lengthbatch_size
or NumPy array of shape(batch_size, height, width, num_channels)
.
Output class for LEdits++ Diffusion pipelines.