---
tags:
  - pyannote
  - pyannote-audio
  - pyannote-audio-model
  - audio
  - voice
  - speech
  - speaker
  - speaker-diarization
  - speaker-separation
  - speech-separation
license: mit
inference: false
extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this model uses MIT license and will always remain open-source, we will occasionnally email you about premium models and paid services around pyannote."
extra_gated_fields:
  Company/university: text
  Website: text
---

Using this open-source model in production?  
Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options.

# 🎹 ToTaToNet / joint speaker diarization and speech separation

This model ingests 5 seconds of mono audio sampled at 16 kHz and outputs speaker diarization AND speech separation for up to 3 speakers.

![Example](model.png)

It has been trained by [Joonas Kalda](https://www.linkedin.com/in/joonas-kalda-996499133) with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.3.0` using the [AMI](https://groups.inf.ed.ac.uk/ami/corpus/) dataset (single distant microphone, SDM). These [paper](https://www.isca-archive.org/odyssey_2024/kalda24_odyssey.html) and [companion repository](https://github.com/joonaskalda/PixIT) describe the approach in more details.

## Requirements

1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.3.0` with `pip install pyannote.audio[separation]==3.3.0`
2. Accept [`pyannote/separation-ami-1.0`](https://hf.co/pyannote/separation-ami-1.0) user conditions
3. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).

```python
from pyannote.audio import Model
model = Model.from_pretrained(
    "pyannote/separation-ami-1.0",
    use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
```

## Usage

```python
# model ingests 5s of mono audio sampled at 16kHz...
duration = 5.0
num_channels = 1
sample_rate = 16000

waveforms = torch.randn(batch_size, num_channels, duration * sample_rate)
waveforms.shape
# (batch_size, num_channels = 1, num_samples = 80000)

# ... and outputs both speaker diarization and separation
with torch.inference_mode():
    diarization, sources = model(waveform)

diarization.shape
# (batch_size, num_frames = 624, max_num_speakers = 3)
# with values between 0 (speaker inactive) and 1 (speaker active)

sources.shape
# (batch_size, num_samples = 80000, max_num_speakers = 3)
```

## Limitations

This model cannot be used to perform speaker diarization and speech separation of full recordings on its own (it only processes 5s chunks): see [pyannote/speech-separation-ami-1.0](https://hf.co/pyannote/speaker-separation-ami-1.0) pipeline that uses an additional speaker embedding model to do that.

## Citations

```bibtex
@inproceedings{Kalda24,
  author={Joonas Kalda and Clément Pagés and Ricard Marxer and Tanel Alumäe and Hervé Bredin},
  title={{PixIT: Joint Training of Speaker Diarization and Speech Separation from Real-world Multi-speaker Recordings}},
  year=2024,
  booktitle={Proc. Odyssey 2024},
}
```

```bibtex
@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}
```