Alibaba-Research-Intelligence-Computing/Tora

[🔥CVPR'25]Tora: Trajectory-oriented Diffusion Transformer for Video Generation

Zhenghao Zhang*, Junchao Liao*, Menghao Li, Zuozhuo Dai, Bingxue Qiu, Siyu Zhu, Long Qin, Weizhi Wang

* equal contribution

Please visit our Github repo for more details.

💡 Abstract

Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that integrates textual, visual, and trajectory conditions concurrently for video generation. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D video compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos following trajectories. Our design aligns seamlessly with DiT’s scalability, allowing precise control of video content’s dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate Tora’s excellence in achieving high motion fidelity, while also meticulously simulating the movement of physical world.

📣 Updates

2025/01/06 🔥🔥We released Tora Image-to-Video, including inference code and model weights.
2024/12/13 SageAttention2 and model compilation are supported in diffusers version. Tested on the A10, these approaches speed up every inference step by approximately 52%, except for the first step.
2024/12/09 🔥🔥Diffusers version of Tora and the corresponding model weights are released. Inference VRAM requirements are reduced to around 5 GiB. Please refer to this for details.
2024/11/25 🔥Text-to-Video training code released.
2024/10/31 Model weights uploaded to HuggingFace. We also provided an English demo on ModelScope.
2024/10/23 🔥🔥Our ModelScope Demo is launched. Welcome to try it out! We also upload the model weights to ModelScope.
2024/10/21 Thanks to @kijai for supporting Tora in ComfyUI! Link
2024/10/15 🔥🔥We released our inference code and model weights. Please note that this is a CogVideoX version of Tora, built on the CogVideoX-5B model. This version of Tora is meant for academic research purposes only. Due to our commercial plans, we will not be open-sourcing the complete version of Tora at this time.
2024/08/27 We released our v2 paper including appendix.
2024/07/31 We submitted our paper on arXiv and released our project page.

🎞️ Showcases

https://github.com/user-attachments/assets/949d5e99-18c9-49d6-b669-9003ccd44bf1

https://github.com/user-attachments/assets/7e7dbe87-a8ba-4710-afd0-9ef528ec329b

https://github.com/user-attachments/assets/4026c23d-229d-45d7-b5be-6f3eb9e4fd50

All videos are available in this Link

✅ TODO List

Release our inference code and model weights
Provide a ModelScope Demo
Release our training code
Release diffusers version and optimize the GPU memory usage
Release complete version of Tora

📦 Model Weights

Folder Structure

Tora
└── sat
    └── ckpts
        ├── t5-v1_1-xxl
        │   ├── model-00001-of-00002.safetensors
        │   └── ...
        ├── vae
        │   └── 3d-vae.pt
        ├── tora
        │   ├── i2v
        │   │   └── mp_rank_00_model_states.pt
        │   └── t2v
        │       └── mp_rank_00_model_states.pt
        └── CogVideoX-5b-sat # for training stage 1
            └── mp_rank_00_model_states.pt

Download Links

Note: Downloading the tora weights requires following the CogVideoX License. You can choose one of the following options: HuggingFace, ModelScope, or native links.
After downloading the model weights, you can put them in the Tora/sat/ckpts folder.

HuggingFace

# This can be faster
pip install "huggingface_hub[hf_transfer]"
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download Alibaba-Research-Intelligence-Computing/Tora --local-dir ckpts

# use git
git lfs install
git clone https://huggingface.co/Alibaba-Research-Intelligence-Computing/Tora

ModelScope

from modelscope import snapshot_download
model_dir = snapshot_download('xiaoche/Tora')

git clone https://www.modelscope.cn/xiaoche/Tora.git

Native

Download the VAE and T5 model following CogVideo:\
- VAE: https://cloud.tsinghua.edu.cn/f/fdba7608a49c463ba754/?dl=1
- T5: text_encoder, tokenizer
Tora t2v model weights: Link. Downloading this weight requires following the CogVideoX License.

🤝 Acknowledgements

We would like to express our gratitude to the following open-source projects that have been instrumental in the development of our project:

CogVideo: An open source video generation framework by THUKEG.
Open-Sora: An open source video generation framework by HPC-AI Tech.
MotionCtrl: A video generation model supporting motion control by ARC Lab, Tencent PCG.
ComfyUI-DragNUWA: An implementation of DragNUWA for ComfyUI.

Special thanks to the contributors of these libraries for their hard work and dedication!

📄 Our previous work

AnimateAnything: Fine Grained Open Domain Image Animation with Motion Guidance

📚 Citation

@misc{zhang2024toratrajectoryorienteddiffusiontransformer,
      title={Tora: Trajectory-oriented Diffusion Transformer for Video Generation},
      author={Zhenghao Zhang and Junchao Liao and Menghao Li and Zuozhuo Dai and Bingxue Qiu and Siyu Zhu and Long Qin and Weizhi Wang},
      year={2024},
      eprint={2407.21705},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.21705},
}

Alibaba-Research-Intelligence-Computing
/

Tora