Diffusers documentation
Kandinsky 3
Kandinsky 3
Kandinsky 3 is created by Vladimir Arkhipkin,Anastasia Maltseva,Igor Pavlov,Andrei Filatov,Arseniy Shakhmatov,Andrey Kuznetsov,Denis Dimitrov, Zein Shaheen
The description from it’s Github page:
Kandinsky 3.0 is an open-source text-to-image diffusion model built upon the Kandinsky2-x model family. In comparison to its predecessors, enhancements have been made to the text understanding and visual quality of the model, achieved by increasing the size of the text encoder and Diffusion U-Net models, respectively.
Its architecture includes 3 main components:
- FLAN-UL2, which is an encoder decoder model based on the T5 architecture.
- New U-Net architecture featuring BigGAN-deep blocks doubles depth while maintaining the same number of parameters.
- Sber-MoVQGAN is a decoder proven to have superior results in image restoration.
The original codebase can be found at ai-forever/Kandinsky-3.
Check out the Kandinsky Community organization on the Hub for the official model checkpoints for tasks like text-to-image, image-to-image, and inpainting.
Make sure to check out the schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
Kandinsky3Pipeline
class diffusers.Kandinsky3Pipeline
< source >( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: Kandinsky3UNet scheduler: DDPMScheduler movq: VQModel )
__call__
< source >( prompt: Union = None num_inference_steps: int = 25 guidance_scale: float = 3.0 negative_prompt: Union = None num_images_per_prompt: Optional = 1 height: Optional = 1024 width: Optional = 1024 generator: Union = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None attention_mask: Optional = None negative_attention_mask: Optional = None output_type: Optional = 'pil' return_dict: bool = True latents = None callback_on_step_end: Optional = None callback_on_step_end_tensor_inputs: List = ['latents'] **kwargs  ) → ImagePipelineOutput or tuple
Parameters
-  prompt (strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead.
-  num_inference_steps (int, optional, defaults to 25) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
-  timesteps (List[int], optional) — Custom timesteps to use for the denoising process. If not defined, equal spacednum_inference_stepstimesteps are used. Must be in descending order.
-  guidance_scale (float, optional, defaults to 3.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the textprompt, usually at the expense of lower image quality.
-  negative_prompt (strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1).
-  num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
-  height (int, optional, defaults to self.unet.config.sample_size) — The height in pixels of the generated image.
-  width (int, optional, defaults to self.unet.config.sample_size) — The width in pixels of the generated image.
-  eta (float, optional, defaults to 0.0) — Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to schedulers.DDIMScheduler, will be ignored for others.
-  generator (torch.GeneratororList[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
-  prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument.
-  negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument.
-  attention_mask (torch.Tensor, optional) — Pre-generated attention mask. Must provide if passingprompt_embedsdirectly.
-  negative_attention_mask (torch.Tensor, optional) — Pre-generated negative attention mask. Must provide if passingnegative_prompt_embedsdirectly.
-  output_type (str, optional, defaults to"pil") — The output format of the generate image. Choose between PIL:PIL.Image.Imageornp.array.
-  return_dict (bool, optional, defaults toTrue) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutputinstead of a plain tuple.
-  callback (Callable, optional) — A function that will be called everycallback_stepssteps during inference. The function will be called with the following arguments:callback(step: int, timestep: int, latents: torch.Tensor).
-  callback_steps (int, optional, defaults to 1) — The frequency at which thecallbackfunction will be called. If not specified, the callback will be called at every step.
-  clean_caption (bool, optional, defaults toTrue) — Whether or not to clean the caption before creating embeddings. Requiresbeautifulsoup4andftfyto be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt.
-  cross_attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor.
Returns
ImagePipelineOutput or tuple
Function invoked when calling the pipeline for generation.
Examples:
>>> from diffusers import AutoPipelineForText2Image
>>> import torch
>>> pipe = AutoPipelineForText2Image.from_pretrained(
...     "kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe.enable_model_cpu_offload()
>>> prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."
>>> generator = torch.Generator(device="cpu").manual_seed(0)
>>> image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]encode_prompt
< source >( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None _cut_context = False attention_mask: Optional = None negative_attention_mask: Optional = None )
Parameters
-  prompt (strorList[str], optional) — prompt to be encoded device — (torch.device, optional): torch device to place the resulting embeddings on
-  num_images_per_prompt (int, optional, defaults to 1) — number of images that should be generated per prompt
-  do_classifier_free_guidance (bool, optional, defaults toTrue) — whether to use classifier free guidance or not
-  negative_prompt (strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embeds. instead. If not defined, one has to passnegative_prompt_embeds. instead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1).
-  prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument.
-  negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument.
-  attention_mask (torch.Tensor, optional) — Pre-generated attention mask. Must provide if passingprompt_embedsdirectly.
-  negative_attention_mask (torch.Tensor, optional) — Pre-generated negative attention mask. Must provide if passingnegative_prompt_embedsdirectly.
Encodes the prompt into text encoder hidden states.
Kandinsky3Img2ImgPipeline
class diffusers.Kandinsky3Img2ImgPipeline
< source >( tokenizer: T5Tokenizer text_encoder: T5EncoderModel unet: Kandinsky3UNet scheduler: DDPMScheduler movq: VQModel )
__call__
< source >( prompt: Union = None image: Union = None strength: float = 0.3 num_inference_steps: int = 25 guidance_scale: float = 3.0 negative_prompt: Union = None num_images_per_prompt: Optional = 1 generator: Union = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None attention_mask: Optional = None negative_attention_mask: Optional = None output_type: Optional = 'pil' return_dict: bool = True callback_on_step_end: Optional = None callback_on_step_end_tensor_inputs: List = ['latents'] **kwargs  ) → ImagePipelineOutput or tuple
Parameters
-  prompt (strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead.
-  image (torch.Tensor,PIL.Image.Image,np.ndarray,List[torch.Tensor],List[PIL.Image.Image], orList[np.ndarray]) —Image, or tensor representing an image batch, that will be used as the starting point for the process.
-  strength (float, optional, defaults to 0.8) — Indicates extent to transform the referenceimage. Must be between 0 and 1.imageis used as a starting point and more noise is added the higher thestrength. The number of denoising steps depends on the amount of noise initially added. Whenstrengthis 1, added noise is maximum and the denoising process runs for the full number of iterations specified innum_inference_steps. A value of 1 essentially ignoresimage.
-  num_inference_steps (int, optional, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
-  guidance_scale (float, optional, defaults to 3.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the textprompt, usually at the expense of lower image quality.
-  negative_prompt (strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored ifguidance_scaleis less than1).
-  num_images_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
-  generator (torch.GeneratororList[torch.Generator], optional) — One or a list of torch generator(s) to make generation deterministic.
-  prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated frompromptinput argument.
-  negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument.
-  attention_mask (torch.Tensor, optional) — Pre-generated attention mask. Must provide if passingprompt_embedsdirectly.
-  negative_attention_mask (torch.Tensor, optional) — Pre-generated negative attention mask. Must provide if passingnegative_prompt_embedsdirectly.
-  output_type (str, optional, defaults to"pil") — The output format of the generate image. Choose between PIL:PIL.Image.Imageornp.array.
-  return_dict (bool, optional, defaults toTrue) — Whether or not to return a~pipelines.stable_diffusion.IFPipelineOutputinstead of a plain tuple.
-  callback_on_step_end (Callable, optional) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs.
-  callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class.
Returns
ImagePipelineOutput or tuple
Function invoked when calling the pipeline for generation.
Examples:
>>> from diffusers import AutoPipelineForImage2Image
>>> from diffusers.utils import load_image
>>> import torch
>>> pipe = AutoPipelineForImage2Image.from_pretrained(
...     "kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16
... )
>>> pipe.enable_model_cpu_offload()
>>> prompt = "A painting of the inside of a subway train with tiny raccoons."
>>> image = load_image(
...     "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png"
... )
>>> generator = torch.Generator(device="cpu").manual_seed(0)
>>> image = pipe(prompt, image=image, strength=0.75, num_inference_steps=25, generator=generator).images[0]encode_prompt
< source >( prompt do_classifier_free_guidance = True num_images_per_prompt = 1 device = None negative_prompt = None prompt_embeds: Optional = None negative_prompt_embeds: Optional = None _cut_context = False attention_mask: Optional = None negative_attention_mask: Optional = None )
Encodes the prompt into text encoder hidden states.
device: (torch.device, optional):
torch device to place the resulting embeddings on
num_images_per_prompt (int, optional, defaults to 1):
number of images that should be generated per prompt
do_classifier_free_guidance (bool, optional, defaults to True):
whether to use classifier free guidance or not
negative_prompt (str or List[str], optional):
The prompt or prompts not to guide the image generation. If not defined, one has to pass
negative_prompt_embeds. instead. If not defined, one has to pass negative_prompt_embeds. instead.
Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
prompt_embeds (torch.Tensor, optional):
Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not
provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.Tensor, optional):
Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt
weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input
argument.
attention_mask (torch.Tensor, optional):
Pre-generated attention mask. Must provide if passing prompt_embeds directly.
negative_attention_mask (torch.Tensor, optional):
Pre-generated negative attention mask. Must provide if passing negative_prompt_embeds directly.