Spaces:
Sleeping
Sleeping
delete readme
Browse files
README.md
DELETED
@@ -1,90 +0,0 @@
|
|
1 |
-
# Training a Tokenizer with Byte-Pair Encoding (BPE) for Hindi Language.
|
2 |
-
|
3 |
-
Hugging Face Spaces: https://huggingface.co/spaces/nishantb06/hindi-tokenizer-bpe-v2
|
4 |
-
|
5 |
-
|
6 |
-
This repository contains the code for training a tokenizer with Byte-Pair Encoding (BPE) for the Hindi language. The tokenizer is trained on a dataset of Hindi text and is used to convert the text into a sequence of tokens.
|
7 |
-
|
8 |
-
### Final compression ratio: 10.18X
|
9 |
-
|
10 |
-
### Vocab size: 5000
|
11 |
-
|
12 |
-
|
13 |
-
## Regex pattern used:
|
14 |
-
`HINDI_SPLIT_PATTERN_V2 = r'\s*(?:[\u0900-\u097F\u0981-\u0983]+|\d+|[^\s\w\u0900-\u097F\u0981-\u0983])'`
|
15 |
-
|
16 |
-
Why is the regex pattern used?
|
17 |
-
When working with languages other than English, it is important to use the regex pattern to ensure that bytes belonging to the same token are not split, thus creating a lot of unknown tokens.
|
18 |
-
Therefore it becomes important to ensure that the words are first split by space and that the verbs of Hindi lanuage are not split. Taking care of numbers and other special characters is also important.
|
19 |
-
|
20 |
-
### logs
|
21 |
-
```
|
22 |
-
compression ratio: 10.18X
|
23 |
-
merge 4691/4744: (4945, 260) -> 4946 (b' \xe0\xa4\x97\xe0\xa4\xa1\xe0\xa5\x8d\xe0\xa4\xa2\xe0\xa5\x87') had 4 occurrences
|
24 |
-
compression ratio: 10.18X
|
25 |
-
merge 4696/4744: (320, 610) -> 4951 (b'\xe0\xa4\xaa\xe0\xa5\x81\xe0\xa4\xa4\xe0\xa5\x8d\xe0\xa4\xb0') had 4 occurrences
|
26 |
-
compression ratio: 10.18X
|
27 |
-
merge 4701/4744: (1351, 291) -> 4956 (b'\n\xe0\xa4\xb9\xe0\xa4\xae\xe0\xa4\xa8\xe0\xa5\x87') had 4 occurrences
|
28 |
-
compression ratio: 10.18X
|
29 |
-
merge 4706/4744: (3077, 445) -> 4961 (b' \xe0\xa4\xa8\xe0\xa4\xbf\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xac\xe0\xa5\x81\xe0\xa4\xa6\xe0\xa5\x8d\xe0\xa4\xa7\xe0\xa4\xbf\xe0\xa4\xaf\xe0\xa5\x8b\xe0\xa4\x82') had 4 occurrences
|
30 |
-
compression ratio: 10.18X
|
31 |
-
merge 4711/4744: (4965, 2081) -> 4966 (b' \xe0\xa4\xb8\xe0\xa5\x83\xe0\xa4\x9c\xe0\xa4\xa8\xe0\xa4\xb9\xe0\xa4\xbe\xe0\xa4\xb0') had 4 occurrences
|
32 |
-
compression ratio: 10.18X
|
33 |
-
merge 4716/4744: (278, 298) -> 4971 (b' \xe0\xa4\xb8\xe0\xa4\xbe\xe0\xa4\xb0') had 4 occurrences
|
34 |
-
compression ratio: 10.19X
|
35 |
-
merge 4721/4744: (4975, 2672) -> 4976 (b' \xe0\xa4\xaa\xe0\xa5\x8d\xe0\xa4\xb0\xe0\xa4\xa4\xe0\xa4\xbf\xe0\xa4\xb5\xe0\xa4\xb0\xe0\xa5\x8d\xe0\xa4\xb7') had 4 occurrences
|
36 |
-
compression ratio: 10.19X
|
37 |
-
merge 4726/4744: (10, 822) -> 4981 (b'\n\xe0\xa4\xb9\xe0\xa4\xbe\xe0\xa4\x81') had 4 occurrences
|
38 |
-
compression ratio: 10.19X
|
39 |
-
merge 4731/4744: (1639, 260) -> 4986 (b' \xe0\xa4\x85\xe0\xa4\x9a\xe0\xa4\xae\xe0\xa5\x8d\xe0\xa4\xad\xe0\xa5\x87') had 4 occurrences
|
40 |
-
compression ratio: 10.19X
|
41 |
-
merge 4736/4744: (364, 1150) -> 4991 (b'\xe0\xa4\xbe\xe0\xa4\xb2\xe0\xa5\x80\xe0\xa4\xb8') had 4 occurrences
|
42 |
-
compression ratio: 10.19X
|
43 |
-
merge 4741/4744: (4995, 645) -> 4996 (b' \xe0\xa4\x96\xe0\xa4\xbf\xe0\xa4\xa1\xe0\xa4\xbc\xe0\xa4\x95\xe0\xa5\x80') had 4 occurrences
|
44 |
-
compression ratio: 10.19X
|
45 |
-
Training took 6005.98 seconds
|
46 |
-
```
|
47 |
-
|
48 |
-
|
49 |
-
|
50 |
-
## Dataset
|
51 |
-
|
52 |
-
[![Kaggle](https://img.shields.io/badge/Kaggle-20BEFF?style=for-the-badge&logo=Kaggle&logoColor=white)](https://www.kaggle.com/datasets/nishantbhansali/new-testament-readings-in-hindi-260-chapters)
|
53 |
-
|
54 |
-
This dataset contains
|
55 |
-
|
56 |
-
- Chapter wise audio recordings of the New Testament (260 chapters). Files in .mp3 format. Language is Hindi
|
57 |
-
- Their corresponding transcripts in Hindi Language.
|
58 |
-
|
59 |
-
This data was scraped from the website www.faithcomesbyhearing.com
|
60 |
-
This dataset was uploaded to Kaggle for easy viewing and for the community to use.
|
61 |
-
|
62 |
-
I downloaded the audio files manually and used a script to extract the text for each of the audio recordings. I used [this file](https://github.com/nishantb06/sarvam/blob/main/part2/scraping_final.ipynb) to scrape the text off of the website and clean up the text, (removing trailing whitespaces, removing unnecessary line breaks and numbers etc.). The final cleaned text is present in the kaggle dataset as well.
|
63 |
-
|
64 |
-
## Dataset Structure
|
65 |
-
|
66 |
-
After downloading, the data will be organized as follows:
|
67 |
-
|
68 |
-
data/
|
69 |
-
βββ Hindi_hin_BCS_NT_Non-Drama/ # Audio files directory
|
70 |
-
β βββ B01_01_MatthewHINBCSN1DA.mp3
|
71 |
-
β βββ B01_02_MatthewHINBCSN1DA.mp3
|
72 |
-
β βββ B01_03_MatthewHINBCSN1DA.mp3
|
73 |
-
β β ...
|
74 |
-
β βββ B260_01_RevelationHINBCSN260DA.mp3
|
75 |
-
β βββ B260_02_RevelationHINBCSN260DA.mp3
|
76 |
-
β βββ B260_03_RevelationHINBCSN260DA.mp3
|
77 |
-
β
|
78 |
-
βββ Hindi_hin_BCS_NT_Non-Drama_transcripts/ # Transcript files directory
|
79 |
-
βββ B01_01_MatthewHINBCSN1DA.txt
|
80 |
-
βββ B01_02_MatthewHINBCSN1DA.txt
|
81 |
-
βββ B01_03_MatthewHINBCSN1DA.txt
|
82 |
-
β ...
|
83 |
-
βββ B260_01_RevelationHINBCSN260DA.txt
|
84 |
-
βββ B260_02_RevelationHINBCSN260DA.txt
|
85 |
-
βββ B260_03_RevelationHINBCSN260DA.txt
|
86 |
-
|
87 |
-
# resources
|
88 |
-
|
89 |
-
- [Beyond the ABCs: Exploring the nuances of tokenization in diverse languages](https://www.icodeformybhasa.com/p/beyond-the-abcs-exploring-the-nuances)
|
90 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|