A newer version of the Gradio SDK is available:
5.23.3
Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
Jie Sun 3 β Guangliang Cheng 1 β Yifei Zhang 5 β Bin Dong 4 β Kaizhu Huang 4 β
4 Duke Kunshan University β 5 Ricoh Software Research Center β
Comparative videos
https://github.com/user-attachments/assets/08ebc6e0-41c5-4bf4-8ee8-2f7d317d92cd
Demo
Gradio Demo KDTalker
. The model was trained using only 4,282 video clips from VoxCeleb
.
To Do List
- Train a community version using more datasets
- Release training code
Environment
Our KDTalker could be conducted on one RTX4090 or RTX3090.
1. Clone the code and prepare the environment
Note: Make sure your system has git
, conda
, and FFmpeg
installed.
git clone https://github.com/chaolongy/KDTalker
cd KDTalker
# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
2. Download pretrained weights
First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights
.
Ensuring the directory structure is as follows:
pretrained_weights
βββ insightface
β βββ models
β βββ buffalo_l
β βββ 2d106det.onnx
β βββ det_10g.onnx
βββ liveportrait
βββ base_models
β βββ appearance_feature_extractor.pth
β βββ motion_extractor.pth
β βββ spade_generator.pth
β βββ warping_module.pth
βββ landmark.onnx
βββ retargeting_models
βββ stitching_retargeting_module.pth
You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts
.
OR, you can download above all weights in Huggingface.
Inference
python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4
Contact
Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at [email protected]
Citation
If you find this code helpful for your research, please cite:
@misc{yang2025kdtalker,
title={Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait},
author={Chaolong Yang and Kai Yao and Yuyao Yan and Chenru Jiang and Weiguang Zhao and Jie Sun and Guangliang Cheng and Yifei Zhang and Bin Dong and Kaizhu Huang},
year={2025},
eprint={2503.12963},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.12963},
}
Acknowledge
We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid etc.