metadata

tags:
  - pyannote
  - audio
  - voice
  - speech
  - speaker
  - speaker-segmentation
  - voice-activity-detection
  - overlapped-speech-detection
  - resegmentation
datasets:
  - ami
  - dihard
  - voxconverse
license: mit
inference: false

pyannote.audio // speaker segmentation

Model from End-to-end speaker segmentation for overlap-aware resegmentation,
by Hervé Bredin and Antoine Laurent.

Relies on pyannote.audio 2.0 currently in development: see installation instructions.

Support

For commercial enquiries and scientific consulting, please contact me.
For technical questions and bug reports, please check pyannote.audio Github repository.

Basic usage

from pyannote.audio import Inference
inference = Inference("pyannote/segmentation")
segmentation = inference("audio.wav")
# `segmentation` is a pyannote.core.SlidingWindowFeature
# instance containing raw segmentation scores like the 
# one pictured above (output)

from pyannote.audio.pipelines import Segmentation
pipeline = Segmentation(segmentation="pyannote/segmentation")
HYPER_PARAMETERS = {
  # onset/offset activation thresholds
  "onset": 0.5, "offset": 0.5,
  # remove speaker turn shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill within speaker pauses shorter than that many seconds.
  "min_duration_off": 0.0
}

pipeline.instantiate(HYPER_PARAMETERS)
segmentation = pipeline("audio.wav")
# `segmentation` now is a pyannote.core.Annotation
# instance containing a hard binary segmentation 
# like the one picutred above (reference)

Advanced usage

Voice activity detection

from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation="pyannote/segmentation")
pipeline.instantiate(HYPER_PARAMETERS)
vad = pipeline("audio.wav")

Overlapped speech detection

from pyannote.audio.pipelines import OverlappedSpeechDetection
pipeline = OverlappedSpeechDetection(segmentation="pyannote/segmentation")
pipeline.instantiate(HYPER_PARAMETERS)
osd = pipeline("audio.wav")

Resegmentation

from pyannote.audio.pipelines import Resegmentation
pipeline = Resegmentation(segmentation="pyannote/segmentation", 
                          diarization="baseline")
pipeline.instantiate(HYPER_PARAMETERS)
resegmented_baseline = pipeline({"audio": "audio.wav", "baseline": baseline})
# where `baseline` should be provided as a pyannote.core.Annotation instance

Reproducible research

In order to reproduce the results of the paper "End-to-end speaker segmentation for overlap-aware resegmentation ", use the following hyper-parameters:

Voice activity detection	`onset`	`offset`	`min_duration_on`	`min_duration_off`
AMI Mix-Headset	0.851	0.430	0.115	0.146
DIHARD3	0.855	0.292	0.036	0.001
VoxConverse	0.883	0.688	0.106	0.526

Overlapped speech detection	`onset`	`offset`	`min_duration_on`	`min_duration_off`
AMI Mix-Headset	0.552	0.311	0.131	0.180
DIHARD3	0.564	0.264	0.158	0.080
VoxConverse	0.617	0.387	0.367	0.334

Resegmentation of VBx	`onset`	`offset`	`min_duration_on`	`min_duration_off`
AMI Mix-Headset	0.542	0.527	0.044	0.705
DIHARD3	0.592	0.489	0.163	0.182
VoxConverse	0.537	0.724	0.410	0.563

Expected outputs (and VBx baseline) are also provided in the /reproducible_research sub-directories.

Citation

@inproceedings{Bredin2020,
  Title = {{pyannote.audio: neural building blocks for speaker diarization}},
  Author = {{Bredin}, Herv{\\\\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
  Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
  Address = {Barcelona, Spain},
  Month = {May},
  Year = {2020},
}