|
<!DOCTYPE html> |
|
<html> |
|
<head> |
|
<meta charset="utf-8" /> |
|
<meta name="viewport" content="width=device-width" /> |
|
<title>ARCH: Audio Representation benCHmark</title> |
|
<link href='http://fonts.googleapis.com/css?family=Roboto' rel='stylesheet' type='text/css'> |
|
<link rel="stylesheet" href="style.css" /> |
|
</head> |
|
<body> |
|
|
|
<img src="arch_logo.png" class="center_img width500"> |
|
|
|
<br> |
|
|
|
<p style="text-align: center;"> |
|
ARCH is a framework designed to benchmark audio representations. The goal is to provide a unified framework for researchers to compare their audio representations and to provide a benchmark for the community to evaluate their models. |
|
The project is currently in its first release. The details about the datasets and the models are available in the <a href="https://github.com/MorenoLaQuatra/ARCH/" target="_blank">GitHub repository</a>. |
|
</p> |
|
|
|
<br><br> |
|
|
|
<h2 style="text-align: center;">Results on the ARCH benchmark - Version 1.0</h2> |
|
<style type="text/css"> |
|
.tg {border-collapse:collapse;border-color:#ccc;border-spacing:0;border-style:solid;border-width:1px;} |
|
.tg td{background-color:#fff;border-color:#ccc;border-style:solid;border-width:0px;color:#333; |
|
font-family:Arial, sans-serif;font-size:14px;overflow:hidden;padding:10px 5px;word-break:normal;} |
|
.tg th{background-color:#f0f0f0;border-color:#ccc;border-style:solid;border-width:0px;color:#333; |
|
font-family:Arial, sans-serif;font-size:14px;font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} |
|
.tg .tg-u5vp{background-color:#fd6864;border-color:inherit;color:#ffffff;text-align:center;vertical-align:top} |
|
.tg .tg-baqh{text-align:center;vertical-align:top} |
|
.tg .tg-c3ow{border-color:inherit;text-align:center;vertical-align:top} |
|
.tg .tg-wp8o{border-color:#000000;text-align:center;vertical-align:top} |
|
.tg .tg-0vih{background-color:#f9f9f9;font-weight:bold;text-align:center;vertical-align:top} |
|
.tg .tg-q860{background-color:#fd6864;border-color:inherit;color:#ffffff;text-align:center;vertical-align:top} |
|
.tg .tg-abip{background-color:#f9f9f9;border-color:inherit;text-align:center;vertical-align:top} |
|
.tg .tg-zwlc{background-color:#f9f9f9;border-color:inherit;font-weight:bold;text-align:center;vertical-align:top} |
|
.tg .tg-7btt{border-color:inherit;font-weight:bold;text-align:center;vertical-align:top} |
|
.tg .tg-mqa1{border-color:#000000;font-weight:bold;text-align:center;vertical-align:top} |
|
.tg .tg-amwm{font-weight:bold;text-align:center;vertical-align:top} |
|
.tg .tg-dzk6{background-color:#f9f9f9;text-align:center;vertical-align:top} |
|
</style> |
|
<table class="tg"> |
|
<thead> |
|
<tr> |
|
<th class="tg-u5vp" rowspan="2">Model</th> |
|
<th class="tg-u5vp" rowspan="2">Size</th> |
|
<th class="tg-u5vp" colspan="4">Sound</th> |
|
<th class="tg-u5vp" colspan="4">Music</th> |
|
<th class="tg-u5vp" colspan="4">Speech</th> |
|
</tr> |
|
<tr> |
|
<th class="tg-q860">ESC-50</th> |
|
<th class="tg-q860">US8K</th> |
|
<th class="tg-q860">FSD50K</th> |
|
<th class="tg-q860">VIVAE</th> |
|
<th class="tg-q860">FMA</th> |
|
<th class="tg-q860">MTT</th> |
|
<th class="tg-q860">IRMAS</th> |
|
<th class="tg-q860">MS-DB</th> |
|
<th class="tg-q860">RAVDESS</th> |
|
<th class="tg-q860">A-MNIST</th> |
|
<th class="tg-q860">SLURP</th> |
|
<th class="tg-q860">EMOVO</th> |
|
</tr> |
|
</thead> |
|
<tbody> |
|
<tr> |
|
<td class="tg-c3ow"><a href="https://huggingface.co/facebook/wav2vec2-base">facebook/wav2vec2-base</a></td> |
|
<td class="tg-c3ow">S</td> |
|
<td class="tg-c3ow">45.73</td> |
|
<td class="tg-c3ow">55.48</td> |
|
<td class="tg-c3ow">19.39</td> |
|
<td class="tg-c3ow">31.47</td> |
|
<td class="tg-c3ow">50.54</td> |
|
<td class="tg-c3ow">37.56</td> |
|
<td class="tg-c3ow">35.14</td> |
|
<td class="tg-c3ow">66.06</td> |
|
<td class="tg-c3ow">55.32</td> |
|
<td class="tg-c3ow">86.38</td> |
|
<td class="tg-c3ow">14.37</td> |
|
<td class="tg-c3ow">31.80</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-abip"><a href="https://huggingface.co/microsoft/wavlm-base">microsoft/wavlm-base</a></td> |
|
<td class="tg-abip">S</td> |
|
<td class="tg-abip">49.88</td> |
|
<td class="tg-abip">61.84</td> |
|
<td class="tg-abip">17.63</td> |
|
<td class="tg-abip">36.31</td> |
|
<td class="tg-abip">48.71</td> |
|
<td class="tg-abip">34.93</td> |
|
<td class="tg-abip">32.62</td> |
|
<td class="tg-abip">54.18</td> |
|
<td class="tg-zwlc"><span style="font-style:normal">67.94</span></td> |
|
<td class="tg-abip">99.50</td> |
|
<td class="tg-abip">30.98</td> |
|
<td class="tg-zwlc">43.08</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-c3ow"><a href="https://huggingface.co/microsoft/wavlm-base-plus">microsoft/wavlm-base-plus</a></td> |
|
<td class="tg-c3ow">S</td> |
|
<td class="tg-c3ow">58.73</td> |
|
<td class="tg-c3ow">64.07</td> |
|
<td class="tg-c3ow">21.57</td> |
|
<td class="tg-c3ow">36.17</td> |
|
<td class="tg-c3ow">56.17</td> |
|
<td class="tg-c3ow">38.24</td> |
|
<td class="tg-c3ow">35.76</td> |
|
<td class="tg-c3ow">57.51</td> |
|
<td class="tg-c3ow">52.20</td> |
|
<td class="tg-7btt">99.63</td> |
|
<td class="tg-c3ow">28.06</td> |
|
<td class="tg-c3ow">36.73</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-abip"><a href="https://huggingface.co/facebook/hubert-base-ls960">facebook/hubert-base-ls960</a></td> |
|
<td class="tg-abip">S</td> |
|
<td class="tg-abip">58.90</td> |
|
<td class="tg-abip">67.28</td> |
|
<td class="tg-abip">24.53</td> |
|
<td class="tg-zwlc">40.48</td> |
|
<td class="tg-abip">54.63</td> |
|
<td class="tg-abip">38.78</td> |
|
<td class="tg-abip">36.65</td> |
|
<td class="tg-abip">58.46</td> |
|
<td class="tg-abip">65.28</td> |
|
<td class="tg-abip">99.58</td> |
|
<td class="tg-abip">33.75</td> |
|
<td class="tg-abip">40.48</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-c3ow"><a href="https://huggingface.co/facebook/data2vec-audio-base">facebook/data2vec-audio-base</a></td> |
|
<td class="tg-c3ow">S</td> |
|
<td class="tg-c3ow">23.63</td> |
|
<td class="tg-c3ow">45.63</td> |
|
<td class="tg-c3ow">10.06</td> |
|
<td class="tg-c3ow">30.19</td> |
|
<td class="tg-c3ow">40.58</td> |
|
<td class="tg-c3ow">27.60</td> |
|
<td class="tg-c3ow">25.87</td> |
|
<td class="tg-c3ow">50.74</td> |
|
<td class="tg-c3ow">48.03</td> |
|
<td class="tg-c3ow">99.06</td> |
|
<td class="tg-7btt">43.57</td> |
|
<td class="tg-c3ow">27.27</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-abip"><a href="https://huggingface.co/ALM/wav2vec2-base-audioset" target="_blank" rel="noopener noreferrer">ALM/wav2vec2-base-audioset</a></td> |
|
<td class="tg-abip">S</td> |
|
<td class="tg-abip">52.61</td> |
|
<td class="tg-abip">70.48</td> |
|
<td class="tg-abip"><span style="font-weight:400;font-style:normal">21.29</span></td> |
|
<td class="tg-abip">31.26</td> |
|
<td class="tg-abip">59.50</td> |
|
<td class="tg-abip"><span style="font-weight:400;font-style:normal">37.92</span></td> |
|
<td class="tg-abip">35.85</td> |
|
<td class="tg-abip">64.61</td> |
|
<td class="tg-abip">45.94</td> |
|
<td class="tg-abip">88.09 </td> |
|
<td class="tg-abip">11.00</td> |
|
<td class="tg-abip">3<span style="font-weight:400;font-style:normal">0.83</span></td> |
|
</tr> |
|
<tr> |
|
<td class="tg-wp8o"><a href="https://huggingface.co/ALM/hubert-base-audioset" target="_blank" rel="noopener noreferrer">ALM/hubert-base-audioset</a></td> |
|
<td class="tg-wp8o">S</td> |
|
<td class="tg-mqa1">68.80</td> |
|
<td class="tg-mqa1"><span style="font-style:normal">79.09</span></td> |
|
<td class="tg-mqa1"><span style="font-style:normal">31.05</span></td> |
|
<td class="tg-wp8o">40.06</td> |
|
<td class="tg-mqa1"><span style="font-style:normal">65.87</span></td> |
|
<td class="tg-mqa1"><span style="font-style:normal">43.44</span></td> |
|
<td class="tg-mqa1"><span style="font-style:normal">47.67</span></td> |
|
<td class="tg-mqa1">67.81</td> |
|
<td class="tg-wp8o">63.54</td> |
|
<td class="tg-wp8o"><span style="font-weight:400;font-style:normal">98.84</span></td> |
|
<td class="tg-wp8o"><span style="font-weight:400;font-style:normal">20.53</span></td> |
|
<td class="tg-wp8o">33.39</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-abip"><a href="https://huggingface.co/facebook/wav2vec2-large-robust">facebook/wav2vec2-large-robust</a></td> |
|
<td class="tg-abip">M</td> |
|
<td class="tg-abip">13.13</td> |
|
<td class="tg-abip">42.70</td> |
|
<td class="tg-abip">5.80</td> |
|
<td class="tg-abip">22.01</td> |
|
<td class="tg-abip">41.71</td> |
|
<td class="tg-abip">20.95</td> |
|
<td class="tg-abip">19.91</td> |
|
<td class="tg-abip">50.23</td> |
|
<td class="tg-abip">11.57</td> |
|
<td class="tg-abip">45.74</td> |
|
<td class="tg-abip">7.33</td> |
|
<td class="tg-abip">19.27</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-c3ow"><a href="https://huggingface.co/facebook/wav2vec2-xls-r-300m">facebook/wav2vec2-xls-r-300m</a></td> |
|
<td class="tg-c3ow">M</td> |
|
<td class="tg-c3ow">51.28</td> |
|
<td class="tg-c3ow">69.96</td> |
|
<td class="tg-c3ow">23.71</td> |
|
<td class="tg-c3ow">36.28</td> |
|
<td class="tg-c3ow">56.96</td> |
|
<td class="tg-c3ow">38.28</td> |
|
<td class="tg-c3ow">38.42</td> |
|
<td class="tg-c3ow">66.71</td> |
|
<td class="tg-c3ow">31.48</td> |
|
<td class="tg-c3ow">98.88</td> |
|
<td class="tg-c3ow">12.74</td> |
|
<td class="tg-c3ow">20.35</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-abip"><a href="https://huggingface.co/microsoft/wavlm-large">microsoft/wavlm-large</a></td> |
|
<td class="tg-abip">M</td> |
|
<td class="tg-abip">67.20</td> |
|
<td class="tg-abip">70.92</td> |
|
<td class="tg-abip">32.21</td> |
|
<td class="tg-abip">42.51</td> |
|
<td class="tg-abip">61.13</td> |
|
<td class="tg-abip">41.29</td> |
|
<td class="tg-abip">42.53</td> |
|
<td class="tg-abip">68.00</td> |
|
<td class="tg-abip">71.76</td> |
|
<td class="tg-abip">99.75</td> |
|
<td class="tg-abip">42.34</td> |
|
<td class="tg-zwlc">45.29</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-c3ow"><a href="https://huggingface.co/facebook/hubert-large-ll60k">facebook/hubert-large-ll60k</a></td> |
|
<td class="tg-c3ow">M</td> |
|
<td class="tg-c3ow">63.98</td> |
|
<td class="tg-c3ow">70.00</td> |
|
<td class="tg-c3ow">29.51</td> |
|
<td class="tg-c3ow">40.95</td> |
|
<td class="tg-c3ow">54.79</td> |
|
<td class="tg-c3ow">38.36</td> |
|
<td class="tg-c3ow">36.81</td> |
|
<td class="tg-c3ow">64.08</td> |
|
<td class="tg-c3ow">72.57</td> |
|
<td class="tg-7btt">99.95</td> |
|
<td class="tg-7btt">45.26</td> |
|
<td class="tg-c3ow">43.76</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-abip"><a href="https://huggingface.co/facebook/data2vec-audio-large">facebook/data2vec-audio-large</a></td> |
|
<td class="tg-abip">M</td> |
|
<td class="tg-abip">25.35</td> |
|
<td class="tg-abip">49.15</td> |
|
<td class="tg-abip">10.82</td> |
|
<td class="tg-abip">30.57</td> |
|
<td class="tg-abip">43.46</td> |
|
<td class="tg-abip">28.52</td> |
|
<td class="tg-abip">27.08</td> |
|
<td class="tg-abip">44.20</td> |
|
<td class="tg-abip">45.14</td> |
|
<td class="tg-abip">99.15</td> |
|
<td class="tg-abip">28.60</td> |
|
<td class="tg-abip">23.07</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-baqh"><a href="https://huggingface.co/ALM/wav2vec2-large-audioset" target="_blank" rel="noopener noreferrer">ALM/wav2vec2-large-audioset</a></td> |
|
<td class="tg-baqh">M</td> |
|
<td class="tg-amwm">74.39</td> |
|
<td class="tg-amwm"><span style="font-style:normal">79.00</span></td> |
|
<td class="tg-amwm">37.58</td> |
|
<td class="tg-baqh">39.65</td> |
|
<td class="tg-baqh"><span style="font-weight:400;font-style:normal">66.58</span></td> |
|
<td class="tg-amwm"><span style="font-style:normal">44.51</span></td> |
|
<td class="tg-baqh">49.87</td> |
|
<td class="tg-baqh"><span style="font-style:normal">76.90</span></td> |
|
<td class="tg-baqh"><span style="font-weight:400;font-style:normal">59.49</span></td> |
|
<td class="tg-baqh">99.42</td> |
|
<td class="tg-baqh"><span style="font-weight:400;font-style:normal">17.74</span></td> |
|
<td class="tg-baqh">38.20</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-dzk6"><a href="https://huggingface.co/ALM/hubert-large-audioset" target="_blank" rel="noopener noreferrer">ALM/hubert-large-audioset</a></td> |
|
<td class="tg-dzk6">M</td> |
|
<td class="tg-dzk6"><span style="font-weight:400;font-style:normal">71.52</span></td> |
|
<td class="tg-dzk6">75.63</td> |
|
<td class="tg-dzk6">37.41</td> |
|
<td class="tg-0vih">44.28</td> |
|
<td class="tg-0vih"><span style="font-style:normal">67.54</span></td> |
|
<td class="tg-dzk6"><span style="font-weight:400;font-style:normal">43.35</span></td> |
|
<td class="tg-0vih"><span style="font-style:normal">50.46</span></td> |
|
<td class="tg-0vih"><span style="font-style:normal">77.82</span></td> |
|
<td class="tg-0vih"><span style="font-style:normal">73.26</span></td> |
|
<td class="tg-dzk6">99.59</td> |
|
<td class="tg-dzk6">20.46</td> |
|
<td class="tg-dzk6">38.61</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-c3ow"><a href="https://huggingface.co/facebook/wav2vec2-xls-r-1b">facebook/wav2vec2-xls-r-1b</a></td> |
|
<td class="tg-c3ow">L</td> |
|
<td class="tg-c3ow">66.95</td> |
|
<td class="tg-c3ow">75.90</td> |
|
<td class="tg-c3ow">31.61</td> |
|
<td class="tg-c3ow">40.41</td> |
|
<td class="tg-c3ow">62.79</td> |
|
<td class="tg-c3ow">41.99</td> |
|
<td class="tg-c3ow">43.57</td> |
|
<td class="tg-c3ow">69.79</td> |
|
<td class="tg-c3ow">55.44</td> |
|
<td class="tg-c3ow">99.86</td> |
|
<td class="tg-c3ow">25.14</td> |
|
<td class="tg-c3ow">34.58</td> |
|
</tr> |
|
<tr> |
|
<td class="tg-abip"><a href="https://huggingface.co/facebook/hubert-xlarge-ll60k">facebook/hubert-xlarge-ll60k</a></td> |
|
<td class="tg-abip">L</td> |
|
<td class="tg-abip">63.40</td> |
|
<td class="tg-abip">69.66</td> |
|
<td class="tg-abip">29.32</td> |
|
<td class="tg-abip">42.72</td> |
|
<td class="tg-abip">56.25</td> |
|
<td class="tg-abip">37.76</td> |
|
<td class="tg-abip">37.30</td> |
|
<td class="tg-abip">64.71</td> |
|
<td class="tg-zwlc">75.69</td> |
|
<td class="tg-zwlc">99.95</td> |
|
<td class="tg-zwlc">47.81</td> |
|
<td class="tg-zwlc">47.17</td> |
|
</tr> |
|
</tbody> |
|
</table> |
|
</body> |
|
</html> |
|
|