Update README.md
Browse files
README.md
CHANGED
@@ -269,7 +269,8 @@ Table of Contents
|
|
269 |
|
270 |
### Dataset Summary
|
271 |
|
272 |
-
This dataset consists of 31 hours of transcribed high-quality audio of English sentences recorded by 120 volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies.
|
|
|
273 |
|
274 |
The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage.
|
275 |
The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words.
|
@@ -343,7 +344,12 @@ A typical data point comprises the path to the audio file called `audio` and its
|
|
343 |
|
344 |
### Data Fields
|
345 |
|
346 |
-
- audio: A dictionary containing the audio filename, the decoded audio array,
|
|
|
|
|
|
|
|
|
|
|
347 |
|
348 |
|
349 |
- text: the transcription of the audio file.
|
@@ -356,7 +362,7 @@ A typical data point comprises the path to the audio file called `audio` and its
|
|
356 |
### Data Statistics
|
357 |
|
358 |
|
359 |
-
![
|
360 |
|
361 |
|
362 |
|
@@ -366,15 +372,176 @@ A typical data point comprises the path to the audio file called `audio` and its
|
|
366 |
|
367 |
[Needs More Information]
|
368 |
|
369 |
-
### Source Data
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
370 |
|
371 |
#### Initial Data Collection and Normalization
|
372 |
-
|
373 |
[Needs More Information]
|
374 |
|
375 |
#### Who are the source language producers?
|
376 |
|
377 |
-
[Needs
|
|
|
|
|
|
|
378 |
|
379 |
### Annotations
|
380 |
|
@@ -444,7 +611,18 @@ License: ([CC BY-SA 4.0 DEED](https://creativecommons.org/licenses/by-sa/4.0/dee
|
|
444 |
publisher = "European Language Resources Association",
|
445 |
url = "https://aclanthology.org/2020.lrec-1.804",
|
446 |
pages = "6532--6541",
|
447 |
-
abstract = "This paper presents a dataset of transcribed high-quality audio of English
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
448 |
language = "English",
|
449 |
ISBN = "979-10-95546-34-4",
|
450 |
}
|
|
|
269 |
|
270 |
### Dataset Summary
|
271 |
|
272 |
+
This dataset consists of 31 hours of transcribed high-quality audio of English sentences recorded by 120 volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies.
|
273 |
+
The soulo speakers self-identified as soulo rap speakers of South, MidWest, New York, West, Southish and Eastcoast varieties of negros.
|
274 |
|
275 |
The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage.
|
276 |
The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words.
|
|
|
344 |
|
345 |
### Data Fields
|
346 |
|
347 |
+
- audio: A dictionary containing the audio filename, the decoded audio array,
|
348 |
+
- and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]`
|
349 |
+
- the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`.
|
350 |
+
- Decoding and resampling of a large number of audio files might take a significant amount of time.
|
351 |
+
- Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]`
|
352 |
+
- should **always** be preferred over `dataset["audio"][0]`.
|
353 |
|
354 |
|
355 |
- text: the transcription of the audio file.
|
|
|
362 |
### Data Statistics
|
363 |
|
364 |
|
365 |
+
![g)
|
366 |
|
367 |
|
368 |
|
|
|
372 |
|
373 |
[Needs More Information]
|
374 |
|
375 |
+
### Source Data Soulo,Rap,Recording Art,
|
376 |
+
<2,675 DMX
|
377 |
+
21 Savage
|
378 |
+
A Boogie wit...
|
379 |
+
Lil Baby
|
380 |
+
Lil Durk
|
381 |
+
Wiz Khalifa
|
382 |
+
YG
|
383 |
+
YoungBoy Nev...
|
384 |
+
2,675-3,050
|
385 |
+
Bone Thugs-n...
|
386 |
+
50 Cent
|
387 |
+
Juicy J
|
388 |
+
Drake
|
389 |
+
Future
|
390 |
+
Kid Cudi
|
391 |
+
Kid Ink
|
392 |
+
Kodak Black
|
393 |
+
Lil Yachty
|
394 |
+
Logic
|
395 |
+
Migos
|
396 |
+
Travis Scott
|
397 |
+
Young Thug
|
398 |
+
3,050-3,425
|
399 |
+
Foxy Brown
|
400 |
+
Juvenile
|
401 |
+
Master P
|
402 |
+
Salt-n-Pepa
|
403 |
+
Snoop Dogg
|
404 |
+
Eve
|
405 |
+
Gucci Mane
|
406 |
+
Kanye West
|
407 |
+
Lil Wayne
|
408 |
+
Missy Elliot
|
409 |
+
Trick Daddy
|
410 |
+
Trina
|
411 |
+
Young Jeezy
|
412 |
+
Big Sean
|
413 |
+
BoB
|
414 |
+
Childish Gam...
|
415 |
+
G-Eazy
|
416 |
+
J Cole
|
417 |
+
Machine Gun ...
|
418 |
+
Meek Mill
|
419 |
+
Nicki Minaj
|
420 |
+
Russ
|
421 |
+
3,425-3,800
|
422 |
+
Run-D.M.C.
|
423 |
+
2Pac
|
424 |
+
Big L
|
425 |
+
Insane Clown...
|
426 |
+
MC Lyte
|
427 |
+
Scarface
|
428 |
+
Three 6 Mafia
|
429 |
+
UGK
|
430 |
+
Dizzee Rascal
|
431 |
+
Jadakiss
|
432 |
+
Kano
|
433 |
+
Lil' Kim
|
434 |
+
Nelly
|
435 |
+
Rick Ross
|
436 |
+
T.I.
|
437 |
+
2 Chainz
|
438 |
+
A$AP Ferg
|
439 |
+
Big KRIT
|
440 |
+
Brockhampton
|
441 |
+
Cupcakke
|
442 |
+
Hopsin
|
443 |
+
Jay Rock
|
444 |
+
Kendrick Lamar
|
445 |
+
Mac Miller
|
446 |
+
ScHoolboy Q
|
447 |
+
Tyga
|
448 |
+
Vince Staples
|
449 |
+
3,800-4,175
|
450 |
+
Biz Markie
|
451 |
+
Ice T
|
452 |
+
Rakim
|
453 |
+
Brand Nubian
|
454 |
+
Geto Boys
|
455 |
+
Ice Cube
|
456 |
+
Jay-Z
|
457 |
+
Mobb Deep
|
458 |
+
Outkast
|
459 |
+
Public Enemy
|
460 |
+
Cam'ron
|
461 |
+
Eminem
|
462 |
+
The Game
|
463 |
+
Joe Budden
|
464 |
+
Kevin Gates
|
465 |
+
Royce da 5'9
|
466 |
+
Tech n9ne
|
467 |
+
Twista
|
468 |
+
Ab-Soul
|
469 |
+
A$AP Rocky
|
470 |
+
Danny Brown
|
471 |
+
Death Grips
|
472 |
+
Denzel Curry
|
473 |
+
$uicideboy$
|
474 |
+
Tyler the Cr...
|
475 |
+
Wale
|
476 |
+
4,175-4,550
|
477 |
+
Beastie Boys
|
478 |
+
Big Daddy Kane
|
479 |
+
LL Cool J
|
480 |
+
Busta Rhymes
|
481 |
+
Cypress Hill
|
482 |
+
De La Soul
|
483 |
+
Fat Joe
|
484 |
+
Gang Starr
|
485 |
+
KRS-One
|
486 |
+
Method Man
|
487 |
+
A Tribe Call...
|
488 |
+
Atmosphere
|
489 |
+
Ludacris
|
490 |
+
Lupe Fiasco
|
491 |
+
Mos Def
|
492 |
+
Murs
|
493 |
+
Talib Kweli
|
494 |
+
Xzibit
|
495 |
+
Flatbush Zom...
|
496 |
+
Joey BadA$$
|
497 |
+
Rittz
|
498 |
+
4,550-4,925
|
499 |
+
Common
|
500 |
+
Das EFX
|
501 |
+
E-40
|
502 |
+
Goodie Mob
|
503 |
+
Nas
|
504 |
+
Redman
|
505 |
+
Brother Ali
|
506 |
+
Action Bronson
|
507 |
+
KAAN
|
508 |
+
4,925-5,300
|
509 |
+
Kool G Rap
|
510 |
+
Kool Keith
|
511 |
+
Raekwon
|
512 |
+
CunninLynguists
|
513 |
+
Sage Francis
|
514 |
+
Watsky
|
515 |
+
5,300-5,675
|
516 |
+
Del the Funk...
|
517 |
+
The Roots
|
518 |
+
Blackalicious
|
519 |
+
Canibus
|
520 |
+
Ghostface Ki...
|
521 |
+
Immortal Tec...
|
522 |
+
Jean Grae
|
523 |
+
Killah Priest
|
524 |
+
RZA
|
525 |
+
5,675-6,050
|
526 |
+
GZA
|
527 |
+
Wu-Tang Clan
|
528 |
+
Jedi Mind Tr...
|
529 |
+
MF DOOM
|
530 |
+
6,050-6,425
|
531 |
+
Aesop Rock
|
532 |
+
Busdriver
|
533 |
+
6,425+
|
534 |
|
535 |
#### Initial Data Collection and Normalization
|
536 |
+
35,000 lyratix LIrA language Integrate Rinder Affirmation
|
537 |
[Needs More Information]
|
538 |
|
539 |
#### Who are the source language producers?
|
540 |
|
541 |
+
[Needs Our Information](1) Since this analysis uses an artist’s first 35,000 lyrics
|
542 |
+
(prioritizing studio albums), an artist’s era is determined by the years the albums were released.
|
543 |
+
Some artists may be identified with a certain era (for example, Jay-Z with the 1990s,
|
544 |
+
with Reasonable Doubt in 1996, In My Lifetime, Vol. 1 in 1997, etc.) yet continue to release music in the present day.
|
545 |
|
546 |
### Annotations
|
547 |
|
|
|
611 |
publisher = "European Language Resources Association",
|
612 |
url = "https://aclanthology.org/2020.lrec-1.804",
|
613 |
pages = "6532--6541",
|
614 |
+
abstract = "This paper presents a dataset of transcribed high-quality audio of English
|
615 |
+
sentences recorded by volunteers speaking with different accents of the British Isles.
|
616 |
+
The dataset is intended for linguistic analysis as well as use for speech technologies.
|
617 |
+
The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena
|
618 |
+
and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal
|
619 |
+
names in different accents; and native speaker pronunciations of local words.
|
620 |
+
Overlapping lines for all speakers were included for idiolect elicitation,
|
621 |
+
which include the same or similar lines with other existing resources
|
622 |
+
such as the CSTR VCTK corpus and the Speech Accent Archive to allow
|
623 |
+
for easy comparison of personal and regional accents. The resulting corpora
|
624 |
+
include over 31 hours of recordings from 120 volunteers who self-identify as
|
625 |
+
soulo rap speakers of South, MidWest, New York, West, Southish and East varieties of Negro.",
|
626 |
language = "English",
|
627 |
ISBN = "979-10-95546-34-4",
|
628 |
}
|