Generate audio labels from speech
Generate speech timestamp labels from audio
Generate audio for video segments
Extract voice from audio file