Diffusers documentation
WanTransformer3DModel
WanTransformer3DModel
A Diffusion Transformer model for 3D video-like data was introduced in Wan 2.1 by the Alibaba Wan Team.
The model can be loaded with the following code snippet.
from diffusers import WanTransformer3DModel
transformer = WanTransformer3DModel.from_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)WanTransformer3DModel
class diffusers.WanTransformer3DModel
< source >( patch_size: typing.Tuple[int] = (1, 2, 2) num_attention_heads: int = 40 attention_head_dim: int = 128 in_channels: int = 16 out_channels: int = 16 text_dim: int = 4096 freq_dim: int = 256 ffn_dim: int = 13824 num_layers: int = 40 cross_attn_norm: bool = True qk_norm: typing.Optional[str] = 'rms_norm_across_heads' eps: float = 1e-06 image_dim: typing.Optional[int] = None added_kv_proj_dim: typing.Optional[int] = None rope_max_seq_len: int = 1024 pos_embed_seq_len: typing.Optional[int] = None )
Parameters
-  patch_size (Tuple[int], defaults to(1, 2, 2)) — 3D patch dimensions for video embedding (t_patch, h_patch, w_patch).
-  num_attention_heads (int, defaults to40) — Fixed length for text embeddings.
-  attention_head_dim (int, defaults to128) — The number of channels in each head.
-  in_channels (int, defaults to16) — The number of channels in the input.
-  out_channels (int, defaults to16) — The number of channels in the output.
-  text_dim (int, defaults to512) — Input dimension for text embeddings.
-  freq_dim (int, defaults to256) — Dimension for sinusoidal time embeddings.
-  ffn_dim (int, defaults to13824) — Intermediate dimension in feed-forward network.
-  num_layers (int, defaults to40) — The number of layers of transformer blocks to use.
-  window_size (Tuple[int], defaults to(-1, -1)) — Window size for local attention (-1 indicates global attention).
-  cross_attn_norm (bool, defaults toTrue) — Enable cross-attention normalization.
-  qk_norm (bool, defaults toTrue) — Enable query/key normalization.
-  eps (float, defaults to1e-6) — Epsilon value for normalization layers.
-  add_img_emb (bool, defaults toFalse) — Whether to use img_emb.
-  added_kv_proj_dim (int, optional, defaults toNone) — The number of channels to use for the added key and value projections. IfNone, no projection is used.
A Transformer model for video-like data used in the Wan model.
Transformer2DModelOutput
class diffusers.models.modeling_outputs.Transformer2DModelOutput
< source >( sample: torch.Tensor )
Parameters
-  sample (torch.Tensorof shape(batch_size, num_channels, height, width)or(batch size, num_vector_embeds - 1, num_latent_pixels)if Transformer2DModel is discrete) — The hidden states output conditioned on theencoder_hidden_statesinput. If discrete, returns probability distributions for the unnoised latent pixels.
The output of Transformer2DModel.