--- tags: - pyannote - pyannote-audio - pyannote-audio-model - audio - voice - speech - speaker - speaker-diarization - speaker-separation - speech-separation license: mit inference: false extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open-source, we will occasionnally email you about premium models and paid services around pyannote." extra_gated_fields: Company/university: text Website: text --- Using this open-source model in production? Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options. # đŸŽč ToTaToNet / joint speaker diarization and speech separation This model ingests 5 seconds of mono audio sampled at 16 kHz and outputs speaker diarization AND speech separation for up to 3 speakers. ![Example](model.png) It has been trained by [Joonas Kalda](https://www.linkedin.com/in/joonas-kalda-996499133) with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.3.0` using the [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) dataset (single distant microphone, SDM). These [paper](https://www.isca-archive.org/odyssey_2024/kalda24_odyssey.html) and [companion repository](https://github.com/joonaskalda/PixIT) describe the approach in more details. ## Requirements 1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.3.0` with `pip install pyannote.audio[separation]==3.3.0` 2. Accept [`pyannote/separation-ami-1.0`](https://hf.co/pyannote/separation-ami-1.0) user conditions 3. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens). ```python from pyannote.audio import Model model = Model.from_pretrained( "pyannote/separation-ami-1.0", use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE") ``` ## Usage ```python # model ingests 5s of mono audio sampled at 16kHz... duration = 5.0 num_channels = 1 sample_rate = 16000 waveforms = torch.randn(batch_size, num_channels, duration * sample_rate) waveforms.shape # (batch_size, num_channels = 1, num_samples = 80000) # ... and outputs both speaker diarization and separation with torch.inference_mode(): diarization, sources = model(waveform) diarization.shape # (batch_size, num_frames = 624, max_num_speakers = 3) # with values between 0 (speaker inactive) and 1 (speaker active) sources.shape # (batch_size, num_samples = 80000, max_num_speakers = 3) ``` ## Limitations This model cannot be used to perform speaker diarization and speech separation of full recordings on its own (it only processes 5s chunks): see [pyannote/speech-separation-ami-1.0](https://hf.co/pyannote/speaker-separation-ami-1.0) pipeline that uses an additional speaker embedding model to do that. ## Citations ```bibtex @inproceedings{Kalda24, author={Joonas Kalda and ClĂ©ment PagĂ©s and Ricard Marxer and Tanel AlumĂ€e and HervĂ© Bredin}, title={{PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings}}, year=2024, booktitle={Proc. Odyssey 2024}, } ``` ```bibtex @inproceedings{Bredin23, author={HervĂ© Bredin}, title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}}, year=2023, booktitle={Proc. INTERSPEECH 2023}, } ```