alexneakameni commited on
Commit
192d6aa
·
verified ·
1 Parent(s): 1427077

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -42
README.md CHANGED
@@ -9,7 +9,6 @@ datasets:
9
  - hac541309/open-lid-dataset
10
  pipeline_tag: text-classification
11
  ---
12
-
13
  # Language Detection Model
14
 
15
  A **BERT-based** language detection model trained on [hac541309/open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset), which includes **121 million sentences across 200 languages**. This model is optimized for **fast and accurate** language identification in text classification tasks.
@@ -26,7 +25,7 @@ A **BERT-based** language detection model trained on [hac541309/open-lid-dataset
26
 
27
  ## Training Process
28
 
29
- - **Dataset**:
30
  - Used the [open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset)
31
  - Split into train (90%) and test (10%)
32
  - **Tokenizer**: A custom `BertTokenizerFast` with special tokens for `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`
@@ -37,48 +36,70 @@ A **BERT-based** language detection model trained on [hac541309/open-lid-dataset
37
  - Scheduler: Cosine
38
  - **Trainer**: Leveraged the Hugging Face [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) with Weights & Biases for logging
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ## Evaluation
41
 
42
- The model was evaluated on the test split. Below are the overall metrics:
43
-
44
- - **Accuracy**: 0.969466
45
- - **Precision**: 0.969586
46
- - **Recall**: 0.969466
47
- - **F1 Score**: 0.969417
48
-
49
- Detailled evaluation (Size is the number of languages supported)
50
-
51
- | Script | Support | Precision | Recall | F1 Score | Size |
52
- |--------|---------|-----------|--------|----------|------|
53
- | Arab | 819219 | 0.9038 | 0.9014 | 0.9023 | 21 |
54
- | Latn | 7924704 | 0.9678 | 0.9663 | 0.9670 | 125 |
55
- | Ethi | 144403 | 0.9967 | 0.9964 | 0.9966 | 2 |
56
- | Beng | 163983 | 0.9949 | 0.9935 | 0.9942 | 3 |
57
- | Deva | 423895 | 0.9495 | 0.9326 | 0.9405 | 10 |
58
- | Cyrl | 831949 | 0.9899 | 0.9883 | 0.9891 | 12 |
59
- | Tibt | 35683 | 0.9925 | 0.9930 | 0.9927 | 2 |
60
- | Grek | 131155 | 0.9984 | 0.9990 | 0.9987 | 1 |
61
- | Gujr | 86912 | 0.99999 | 0.9999 | 0.99995 | 1 |
62
- | Hebr | 100530 | 0.9966 | 0.9995 | 0.9981 | 2 |
63
- | Armn | 67203 | 0.9999 | 0.9998 | 0.9998 | 1 |
64
- | Jpan | 88004 | 0.9983 | 0.9987 | 0.9985 | 1 |
65
- | Knda | 67170 | 0.9999 | 0.9998 | 0.9999 | 1 |
66
- | Geor | 70769 | 0.99997 | 0.9998 | 0.9999 | 1 |
67
- | Khmr | 39708 | 1.0000 | 0.9997 | 0.9999 | 1 |
68
- | Hang | 108509 | 0.9997 | 0.9999 | 0.9998 | 1 |
69
- | Laoo | 29389 | 0.9999 | 0.9999 | 0.9999 | 1 |
70
- | Mlym | 68418 | 0.99996 | 0.9999 | 0.9999 | 1 |
71
- | Mymr | 100857 | 0.9999 | 0.9992 | 0.9995 | 2 |
72
- | Orya | 44976 | 0.9995 | 0.9998 | 0.9996 | 1 |
73
- | Guru | 67106 | 0.99999 | 0.9999 | 0.9999 | 1 |
74
- | Olck | 22279 | 1.0000 | 0.9991 | 0.9995 | 1 |
75
- | Sinh | 67492 | 1.0000 | 0.9998 | 0.9999 | 1 |
76
- | Taml | 76373 | 0.99997 | 0.9999 | 0.9999 | 1 |
77
- | Tfng | 41325 | 0.8512 | 0.8246 | 0.8247 | 2 |
78
- | Telu | 62387 | 0.99997 | 0.9999 | 0.9999 | 1 |
79
- | Thai | 83820 | 0.99995 | 0.9998 | 0.9999 | 1 |
80
- | Hant | 152723 | 0.9945 | 0.9954 | 0.9949 | 2 |
81
- | Hans | 92689 | 0.9893 | 0.9870 | 0.9882 | 1 |
 
 
 
 
 
 
 
 
82
 
83
 
84
  A detailed per-script classification report is also provided in the repository for further analysis.
 
9
  - hac541309/open-lid-dataset
10
  pipeline_tag: text-classification
11
  ---
 
12
  # Language Detection Model
13
 
14
  A **BERT-based** language detection model trained on [hac541309/open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset), which includes **121 million sentences across 200 languages**. This model is optimized for **fast and accurate** language identification in text classification tasks.
 
25
 
26
  ## Training Process
27
 
28
+ - **Dataset**:
29
  - Used the [open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset)
30
  - Split into train (90%) and test (10%)
31
  - **Tokenizer**: A custom `BertTokenizerFast` with special tokens for `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`
 
36
  - Scheduler: Cosine
37
  - **Trainer**: Leveraged the Hugging Face [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) with Weights & Biases for logging
38
 
39
+ ## Data Augmentation
40
+
41
+ To improve model generalization and robustness, a **new text augmentation strategy** was introduced. This includes:
42
+
43
+ - **Removing digits** (random probability)
44
+ - **Shuffling words** to introduce variation
45
+ - **Removing words** selectively
46
+ - **Adding random digits** to simulate noise
47
+ - **Modifying punctuation** to handle different text formats
48
+
49
+ ### Impact of Augmentation
50
+
51
+ Adding these augmentations **improved overall model performance**, as seen in the latest evaluation results:
52
+
53
  ## Evaluation
54
 
55
+ ### Updated Performance Metrics:
56
+
57
+ - **Accuracy**: 0.9733
58
+ - **Precision**: 0.9735
59
+ - **Recall**: 0.9733
60
+ - **F1 Score**: 0.9733
61
+
62
+ ### Detailed Evaluation (~12 millions texts)
63
+
64
+ | | support | precision | recall | f1 | size |
65
+ |:-----|-----------------:|------------:|---------:|---------:|-------:|
66
+ | Arab | 502886 | 0.908169 | 0.91335 | 0.909868 | 21 |
67
+ | Latn | 4.86532e+06 | 0.973172 | 0.972221 | 0.972646 | 125 |
68
+ | Ethi | 88564 | 0.996634 | 0.996459 | 0.996546 | 2 |
69
+ | Beng | 100502 | 0.995 | 0.992859 | 0.993915 | 3 |
70
+ | Deva | 260227 | 0.950405 | 0.942772 | 0.946355 | 10 |
71
+ | Cyrl | 510229 | 0.991342 | 0.989693 | 0.990513 | 12 |
72
+ | Tibt | 21863 | 0.992792 | 0.993665 | 0.993222 | 2 |
73
+ | Grek | 80445 | 0.998758 | 0.999391 | 0.999074 | 1 |
74
+ | Gujr | 53237 | 0.999981 | 0.999925 | 0.999953 | 1 |
75
+ | Hebr | 61576 | 0.996375 | 0.998904 | 0.997635 | 2 |
76
+ | Armn | 41146 | 0.999927 | 0.999927 | 0.999927 | 1 |
77
+ | Jpan | 53963 | 0.999147 | 0.998721 | 0.998934 | 1 |
78
+ | Knda | 40989 | 0.999976 | 0.999902 | 0.999939 | 1 |
79
+ | Geor | 43399 | 0.999977 | 0.999908 | 0.999942 | 1 |
80
+ | Khmr | 24348 | 1 | 0.999959 | 0.999979 | 1 |
81
+ | Hang | 66447 | 0.999759 | 0.999955 | 0.999857 | 1 |
82
+ | Laoo | 18353 | 1 | 0.999837 | 0.999918 | 1 |
83
+ | Mlym | 41899 | 0.999976 | 0.999976 | 0.999976 | 1 |
84
+ | Mymr | 62067 | 0.999898 | 0.999207 | 0.999552 | 2 |
85
+ | Orya | 27626 | 1 | 0.999855 | 0.999928 | 1 |
86
+ | Guru | 40856 | 1 | 0.999902 | 0.999951 | 1 |
87
+ | Olck | 13646 | 0.999853 | 1 | 0.999927 | 1 |
88
+ | Sinh | 41437 | 1 | 0.999952 | 0.999976 | 1 |
89
+ | Taml | 46832 | 0.999979 | 1 | 0.999989 | 1 |
90
+ | Tfng | 25238 | 0.849058 | 0.823968 | 0.823808 | 2 |
91
+ | Telu | 38251 | 1 | 0.999922 | 0.999961 | 1 |
92
+ | Thai | 51428 | 0.999922 | 0.999961 | 0.999942 | 1 |
93
+ | Hant | 94042 | 0.993966 | 0.995907 | 0.994935 | 2 |
94
+ | Hans | 57006 | 0.99007 | 0.986405 | 0.988234 | 1 |
95
+
96
+ ### Comparison with Previous Performance
97
+
98
+ After introducing text augmentations, the model's performance improved on the same evaluation dataset, with accuracy increasing from 0.9695 to 0.9733, along with similar improvements in average precision, recall, and F1 score.
99
+
100
+ ## Conclusion
101
+
102
+ The integration of **new text augmentation techniques** has led to a measurable improvement in model accuracy and robustness. These enhancements allow for better generalization across diverse language scripts, improving the model’s usability in real-world applications.
103
 
104
 
105
  A detailed per-script classification report is also provided in the repository for further analysis.