KDTalker / ORIGINAL_README.md
fffiloni's picture
Migrated from GitHub
a6028c9 verified

A newer version of the Gradio SDK is available: 5.23.3

Upgrade

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

arXiv License GitHub Stars


1 University of Liverpool   2 Ant Group   3 Xi’an Jiaotong-Liverpool University  
4 Duke Kunshan University   5 Ricoh Software Research Center  

Comparative videos

https://github.com/user-attachments/assets/08ebc6e0-41c5-4bf4-8ee8-2f7d317d92cd

Demo

Gradio Demo KDTalker. The model was trained using only 4,282 video clips from VoxCeleb.

shot

To Do List

  • Train a community version using more datasets
  • Release training code

Environment

Our KDTalker could be conducted on one RTX4090 or RTX3090.

1. Clone the code and prepare the environment

Note: Make sure your system has git, conda, and FFmpeg installed.

git clone https://github.com/chaolongy/KDTalker
cd KDTalker

# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker

conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

2. Download pretrained weights

First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights. Ensuring the directory structure is as follows:

pretrained_weights
β”œβ”€β”€ insightface
β”‚   └── models
β”‚       └── buffalo_l
β”‚           β”œβ”€β”€ 2d106det.onnx
β”‚           └── det_10g.onnx
└── liveportrait
    β”œβ”€β”€ base_models
    β”‚   β”œβ”€β”€ appearance_feature_extractor.pth
    β”‚   β”œβ”€β”€ motion_extractor.pth
    β”‚   β”œβ”€β”€ spade_generator.pth
    β”‚   └── warping_module.pth
    β”œβ”€β”€ landmark.onnx
    └── retargeting_models
        └── stitching_retargeting_module.pth

You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts.

OR, you can download above all weights in Huggingface.

Inference

python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at [email protected]

Citation

If you find this code helpful for your research, please cite:

@misc{yang2025kdtalker,
      title={Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait}, 
      author={Chaolong Yang and Kai Yao and Yuyao Yan and Chenru Jiang and Weiguang Zhao and Jie Sun and Guangliang Cheng and Yifei Zhang and Bin Dong and Kaizhu Huang},
      year={2025},
      eprint={2503.12963},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.12963}, 
}

Acknowledge

We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid etc.