Update README.md
Browse files
README.md
CHANGED
@@ -8,13 +8,39 @@ This repo contains models for the identification of language in text. It is base
|
|
8 |
|
9 |
| Model | Size | Precision | Recall | F1-Score | Support |
|
10 |
|:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
|
11 |
-
| `nordic-lid.bin` (large) | 274 MB | 0.9901 | 0.9900 | 0.9900 | 5500 |
|
12 |
-
| `nordic-lid.ftz` (small) | 1.87 MB | 0.9889 | 0.9890 | 0.9890 | 5500 |
|
13 |
-
| `nordic-lid.159.bin` (large)| 9.63 GB | 0.9434 | 0.9528 | 0.9476 | 44049 |
|
14 |
-
| `nordic-lid.159.ftz` (small)| 11.2 MB | 0.9275 | 0.9399 | 0.9327 | 44049 |
|
15 |
|
16 |
|
17 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
|
20 |
|
@@ -37,7 +63,7 @@ Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.g
|
|
37 |
| Weighted avg | | 0.9906 | 0.9905 | 0.9905 | 5500 |
|
38 |
| Macro avg | | 0.9901 | 0.9900 | 0.9900 | 5500 |
|
39 |
|
40 |
-
|
41 |
|
42 |
<details>
|
43 |
<summary>Scores for the 159 languages</summary>
|
@@ -211,7 +237,7 @@ Additionally trained on sentences from [Taoteba](https://tatoeba.org/en/).
|
|
211 |
|
212 |
</details>
|
213 |
|
214 |
-
|
215 |
|
216 |
The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
|
217 |
|
@@ -235,7 +261,7 @@ The small models are quantized versions of the large versions using a cutoff of
|
|
235 |
| Macro avg | | 0.9889 | 0.9890 | 0.9890 | 5500 |
|
236 |
|
237 |
|
238 |
-
|
239 |
|
240 |
<details>
|
241 |
<summary>Scores for the 159 languages (compressed model)</summary>
|
|
|
8 |
|
9 |
| Model | Size | Precision | Recall | F1-Score | Support |
|
10 |
|:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
|
11 |
+
| [`nordic-lid.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.bin) (large) | 274 MB | 0.9901 | 0.9900 | 0.9900 | 5500 |
|
12 |
+
| [`nordic-lid.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.ftz) (small) | 1.87 MB | 0.9889 | 0.9890 | 0.9890 | 5500 |
|
13 |
+
| [`nordic-lid.159.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.bin) (large)| 9.63 GB | 0.9434 | 0.9528 | 0.9476 | 44049 |
|
14 |
+
| [`nordic-lid.159.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.ftz) (small)| 11.2 MB | 0.9275 | 0.9399 | 0.9327 | 44049 |
|
15 |
|
16 |
|
17 |
+
## Usage
|
18 |
+
|
19 |
+
After download, the models can be used through the Fasttext library:
|
20 |
+
|
21 |
+
```python
|
22 |
+
import fasttext
|
23 |
+
from datasets.utils.download_manager import DownloadManager
|
24 |
+
|
25 |
+
|
26 |
+
NORDIC_LID_URL = "https://huggingface.co/NbAiLab/nordic-lid/resolve/main/"
|
27 |
+
model_name = "nordic-lid.ftz"
|
28 |
+
|
29 |
+
model = fasttext.load_model(DownloadManager().download(NORDIC_LID_URL + model_name))
|
30 |
+
model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling.", threshold=0.25)
|
31 |
+
# (('__label__nob',), array([0.95482141]))
|
32 |
+
```
|
33 |
+
|
34 |
+
Alternatively, these models are also integrated into the the experimental `nb` CLI application:
|
35 |
+
|
36 |
+
```bash
|
37 |
+
$ echo "Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling." | nb langid --threshold 0.25
|
38 |
+
nob,0.95482141
|
39 |
+
```
|
40 |
+
|
41 |
+
## Languages
|
42 |
+
|
43 |
+
### `nordic-lid.bin`
|
44 |
|
45 |
Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
|
46 |
|
|
|
63 |
| Weighted avg | | 0.9906 | 0.9905 | 0.9905 | 5500 |
|
64 |
| Macro avg | | 0.9901 | 0.9900 | 0.9900 | 5500 |
|
65 |
|
66 |
+
### `nordic-lid.159.bin`
|
67 |
|
68 |
<details>
|
69 |
<summary>Scores for the 159 languages</summary>
|
|
|
237 |
|
238 |
</details>
|
239 |
|
240 |
+
### `nordic-lid.ftz`
|
241 |
|
242 |
The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
|
243 |
|
|
|
261 |
| Macro avg | | 0.9889 | 0.9890 | 0.9890 | 5500 |
|
262 |
|
263 |
|
264 |
+
### `nordic-lid.159.ftz`
|
265 |
|
266 |
<details>
|
267 |
<summary>Scores for the 159 languages (compressed model)</summary>
|