Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Chaolong Yang ^1,3* Kai Yao ^2* Yuyao Yan ³ Chenru Jiang ⁴ Weiguang Zhao ^1,3
Jie Sun ³ Guangliang Cheng ¹ Yifei Zhang ⁵ Bin Dong ⁴ Kaizhu Huang ⁴

¹ University of Liverpool ² Ant Group ³ Xi’an Jiaotong-Liverpool University
⁴ Duke Kunshan University ⁵ Ricoh Software Research Center

Comparative videos

https://github.com/user-attachments/assets/08ebc6e0-41c5-4bf4-8ee8-2f7d317d92cd

Demo

Gradio Demo KDTalker. The model was trained using only 4,282 video clips from VoxCeleb.

To Do List

Train a community version using more datasets
Release training code

Environment

Our KDTalker could be conducted on one RTX4090 or RTX3090.

1. Clone the code and prepare the environment

Note: Make sure your system has git, conda, and FFmpeg installed.

git clone https://github.com/chaolongy/KDTalker
cd KDTalker

# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker

conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

2. Download pretrained weights

First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights. Ensuring the directory structure is as follows:

pretrained_weights
├── insightface
│   └── models
│       └── buffalo_l
│           ├── 2d106det.onnx
│           └── det_10g.onnx
└── liveportrait
    ├── base_models
    │   ├── appearance_feature_extractor.pth
    │   ├── motion_extractor.pth
    │   ├── spade_generator.pth
    │   └── warping_module.pth
    ├── landmark.onnx
    └── retargeting_models
        └── stitching_retargeting_module.pth

You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts.

OR, you can download above all weights in Huggingface.

Inference

python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at [email protected]

Citation

If you find this code helpful for your research, please cite:

@misc{yang2025kdtalker,
      title={Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait}, 
      author={Chaolong Yang and Kai Yao and Yuyao Yan and Chenru Jiang and Weiguang Zhao and Jie Sun and Guangliang Cheng and Yifei Zhang and Bin Dong and Kaizhu Huang},
      year={2025},
      eprint={2503.12963},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.12963}, 
}

Acknowledge

We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid etc.