Diffusers

You are viewing v0.33.1 version. A newer version v0.36.0 is available.

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

WanTransformer3DModel

A Diffusion Transformer model for 3D video-like data was introduced in Wan 2.1 by the Alibaba Wan Team.

The model can be loaded with the following code snippet.

from diffusers import WanTransformer3DModel

transformer = WanTransformer3DModel.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)

WanTransformer3DModel

class diffusers.WanTransformer3DModel

< source >

( patch_size: typing.Tuple[int] = (1, 2, 2) num_attention_heads: int = 40 attention_head_dim: int = 128 in_channels: int = 16 out_channels: int = 16 text_dim: int = 4096 freq_dim: int = 256 ffn_dim: int = 13824 num_layers: int = 40 cross_attn_norm: bool = True qk_norm: typing.Optional[str] = 'rms_norm_across_heads' eps: float = 1e-06 image_dim: typing.Optional[int] = None added_kv_proj_dim: typing.Optional[int] = None rope_max_seq_len: int = 1024 )

Parameters

patch_size (Tuple[int], defaults to (1, 2, 2)) — 3D patch dimensions for video embedding (t_patch, h_patch, w_patch).
num_attention_heads (int, defaults to 40) — Fixed length for text embeddings.
attention_head_dim (int, defaults to 128) — The number of channels in each head.
in_channels (int, defaults to 16) — The number of channels in the input.
out_channels (int, defaults to 16) — The number of channels in the output.
text_dim (int, defaults to 512) — Input dimension for text embeddings.
freq_dim (int, defaults to 256) — Dimension for sinusoidal time embeddings.
ffn_dim (int, defaults to 13824) — Intermediate dimension in feed-forward network.
num_layers (int, defaults to 40) — The number of layers of transformer blocks to use.
window_size (Tuple[int], defaults to (-1, -1)) — Window size for local attention (-1 indicates global attention).
cross_attn_norm (bool, defaults to True) — Enable cross-attention normalization.
qk_norm (bool, defaults to True) — Enable query/key normalization.
eps (float, defaults to 1e-6) — Epsilon value for normalization layers.
add_img_emb (bool, defaults to False) — Whether to use img_emb.
added_kv_proj_dim (int, optional, defaults to None) — The number of channels to use for the added key and value projections. If None, no projection is used.

A Transformer model for video-like data used in the Wan model.

Transformer2DModelOutput

class diffusers.models.modeling_outputs.Transformer2DModelOutput

< source >

( sample: torch.Tensor )

Parameters

sample (torch.Tensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) — The hidden states output conditioned on the encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.

The output of Transformer2DModel.

< > Update on GitHub

←TransformerTemporalModel StableCascadeUNet→