18Barz commited on
Commit
dd02df4
·
verified ·
1 Parent(s): 9407c2c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +185 -7
README.md CHANGED
@@ -269,7 +269,8 @@ Table of Contents
269
 
270
  ### Dataset Summary
271
 
272
- This dataset consists of 31 hours of transcribed high-quality audio of English sentences recorded by 120 volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The speakers self-identified as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.
 
273
 
274
  The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage.
275
  The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words.
@@ -343,7 +344,12 @@ A typical data point comprises the path to the audio file called `audio` and its
343
 
344
  ### Data Fields
345
 
346
- - audio: A dictionary containing the audio filename, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`.
 
 
 
 
 
347
 
348
 
349
  - text: the transcription of the audio file.
@@ -356,7 +362,7 @@ A typical data point comprises the path to the audio file called `audio` and its
356
  ### Data Statistics
357
 
358
 
359
- ![image/png](https://cdn-uploads.huggingface.co/production/uploads/62611fcabbcbd1c34f1615f6/ony5ZDV7h1xP3tZCgh0Qj.png)
360
 
361
 
362
 
@@ -366,15 +372,176 @@ A typical data point comprises the path to the audio file called `audio` and its
366
 
367
  [Needs More Information]
368
 
369
- ### Source Data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
370
 
371
  #### Initial Data Collection and Normalization
372
-
373
  [Needs More Information]
374
 
375
  #### Who are the source language producers?
376
 
377
- [Needs More Information]
 
 
 
378
 
379
  ### Annotations
380
 
@@ -444,7 +611,18 @@ License: ([CC BY-SA 4.0 DEED](https://creativecommons.org/licenses/by-sa/4.0/dee
444
  publisher = "European Language Resources Association",
445
  url = "https://aclanthology.org/2020.lrec-1.804",
446
  pages = "6532--6541",
447
- abstract = "This paper presents a dataset of transcribed high-quality audio of English sentences recorded by volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies. The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words. Overlapping lines for all speakers were included for idiolect elicitation, which include the same or similar lines with other existing resources such as the CSTR VCTK corpus and the Speech Accent Archive to allow for easy comparison of personal and regional accents. The resulting corpora include over 31 hours of recordings from 120 volunteers who self-identify as native speakers of Southern England, Midlands, Northern England, Welsh, Scottish and Irish varieties of English.",
 
 
 
 
 
 
 
 
 
 
 
448
  language = "English",
449
  ISBN = "979-10-95546-34-4",
450
  }
 
269
 
270
  ### Dataset Summary
271
 
272
+ This dataset consists of 31 hours of transcribed high-quality audio of English sentences recorded by 120 volunteers speaking with different accents of the British Isles. The dataset is intended for linguistic analysis as well as use for speech technologies.
273
+ The soulo speakers self-identified as soulo rap speakers of South, MidWest, New York, West, Southish and Eastcoast varieties of negros.
274
 
275
  The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena and providing a high phoneme coverage.
276
  The scripts include pronunciations of global locations, major airlines and common personal names in different accents; and native speaker pronunciations of local words.
 
344
 
345
  ### Data Fields
346
 
347
+ - audio: A dictionary containing the audio filename, the decoded audio array,
348
+ - and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]`
349
+ - the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`.
350
+ - Decoding and resampling of a large number of audio files might take a significant amount of time.
351
+ - Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]`
352
+ - should **always** be preferred over `dataset["audio"][0]`.
353
 
354
 
355
  - text: the transcription of the audio file.
 
362
  ### Data Statistics
363
 
364
 
365
+ ![g)
366
 
367
 
368
 
 
372
 
373
  [Needs More Information]
374
 
375
+ ### Source Data Soulo,Rap,Recording Art,
376
+ <2,675 DMX
377
+ 21 Savage
378
+ A Boogie wit...
379
+ Lil Baby
380
+ Lil Durk
381
+ Wiz Khalifa
382
+ YG
383
+ YoungBoy Nev...
384
+ 2,675-3,050
385
+ Bone Thugs-n...
386
+ 50 Cent
387
+ Juicy J
388
+ Drake
389
+ Future
390
+ Kid Cudi
391
+ Kid Ink
392
+ Kodak Black
393
+ Lil Yachty
394
+ Logic
395
+ Migos
396
+ Travis Scott
397
+ Young Thug
398
+ 3,050-3,425
399
+ Foxy Brown
400
+ Juvenile
401
+ Master P
402
+ Salt-n-Pepa
403
+ Snoop Dogg
404
+ Eve
405
+ Gucci Mane
406
+ Kanye West
407
+ Lil Wayne
408
+ Missy Elliot
409
+ Trick Daddy
410
+ Trina
411
+ Young Jeezy
412
+ Big Sean
413
+ BoB
414
+ Childish Gam...
415
+ G-Eazy
416
+ J Cole
417
+ Machine Gun ...
418
+ Meek Mill
419
+ Nicki Minaj
420
+ Russ
421
+ 3,425-3,800
422
+ Run-D.M.C.
423
+ 2Pac
424
+ Big L
425
+ Insane Clown...
426
+ MC Lyte
427
+ Scarface
428
+ Three 6 Mafia
429
+ UGK
430
+ Dizzee Rascal
431
+ Jadakiss
432
+ Kano
433
+ Lil' Kim
434
+ Nelly
435
+ Rick Ross
436
+ T.I.
437
+ 2 Chainz
438
+ A$AP Ferg
439
+ Big KRIT
440
+ Brockhampton
441
+ Cupcakke
442
+ Hopsin
443
+ Jay Rock
444
+ Kendrick Lamar
445
+ Mac Miller
446
+ ScHoolboy Q
447
+ Tyga
448
+ Vince Staples
449
+ 3,800-4,175
450
+ Biz Markie
451
+ Ice T
452
+ Rakim
453
+ Brand Nubian
454
+ Geto Boys
455
+ Ice Cube
456
+ Jay-Z
457
+ Mobb Deep
458
+ Outkast
459
+ Public Enemy
460
+ Cam'ron
461
+ Eminem
462
+ The Game
463
+ Joe Budden
464
+ Kevin Gates
465
+ Royce da 5'9
466
+ Tech n9ne
467
+ Twista
468
+ Ab-Soul
469
+ A$AP Rocky
470
+ Danny Brown
471
+ Death Grips
472
+ Denzel Curry
473
+ $uicideboy$
474
+ Tyler the Cr...
475
+ Wale
476
+ 4,175-4,550
477
+ Beastie Boys
478
+ Big Daddy Kane
479
+ LL Cool J
480
+ Busta Rhymes
481
+ Cypress Hill
482
+ De La Soul
483
+ Fat Joe
484
+ Gang Starr
485
+ KRS-One
486
+ Method Man
487
+ A Tribe Call...
488
+ Atmosphere
489
+ Ludacris
490
+ Lupe Fiasco
491
+ Mos Def
492
+ Murs
493
+ Talib Kweli
494
+ Xzibit
495
+ Flatbush Zom...
496
+ Joey BadA$$
497
+ Rittz
498
+ 4,550-4,925
499
+ Common
500
+ Das EFX
501
+ E-40
502
+ Goodie Mob
503
+ Nas
504
+ Redman
505
+ Brother Ali
506
+ Action Bronson
507
+ KAAN
508
+ 4,925-5,300
509
+ Kool G Rap
510
+ Kool Keith
511
+ Raekwon
512
+ CunninLynguists
513
+ Sage Francis
514
+ Watsky
515
+ 5,300-5,675
516
+ Del the Funk...
517
+ The Roots
518
+ Blackalicious
519
+ Canibus
520
+ Ghostface Ki...
521
+ Immortal Tec...
522
+ Jean Grae
523
+ Killah Priest
524
+ RZA
525
+ 5,675-6,050
526
+ GZA
527
+ Wu-Tang Clan
528
+ Jedi Mind Tr...
529
+ MF DOOM
530
+ 6,050-6,425
531
+ Aesop Rock
532
+ Busdriver
533
+ 6,425+
534
 
535
  #### Initial Data Collection and Normalization
536
+ 35,000 lyratix LIrA language Integrate Rinder Affirmation
537
  [Needs More Information]
538
 
539
  #### Who are the source language producers?
540
 
541
+ [Needs Our Information](1) Since this analysis uses an artist’s first 35,000 lyrics
542
+ (prioritizing studio albums), an artist’s era is determined by the years the albums were released.
543
+ Some artists may be identified with a certain era (for example, Jay-Z with the 1990s,
544
+ with Reasonable Doubt in 1996, In My Lifetime, Vol. 1 in 1997, etc.) yet continue to release music in the present day.
545
 
546
  ### Annotations
547
 
 
611
  publisher = "European Language Resources Association",
612
  url = "https://aclanthology.org/2020.lrec-1.804",
613
  pages = "6532--6541",
614
+ abstract = "This paper presents a dataset of transcribed high-quality audio of English
615
+ sentences recorded by volunteers speaking with different accents of the British Isles.
616
+ The dataset is intended for linguistic analysis as well as use for speech technologies.
617
+ The recording scripts were curated specifically for accent elicitation, covering a variety of phonological phenomena
618
+ and providing a high phoneme coverage. The scripts include pronunciations of global locations, major airlines and common personal
619
+ names in different accents; and native speaker pronunciations of local words.
620
+ Overlapping lines for all speakers were included for idiolect elicitation,
621
+ which include the same or similar lines with other existing resources
622
+ such as the CSTR VCTK corpus and the Speech Accent Archive to allow
623
+ for easy comparison of personal and regional accents. The resulting corpora
624
+ include over 31 hours of recordings from 120 volunteers who self-identify as
625
+ soulo rap speakers of South, MidWest, New York, West, Southish and East varieties of Negro.",
626
  language = "English",
627
  ISBN = "979-10-95546-34-4",
628
  }