Update README.md
Browse files
README.md
CHANGED
@@ -1,17 +1,39 @@
|
|
1 |
---
|
2 |
license: openrail
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
|
5 |
# Nordic language identification
|
6 |
|
7 |
-
This repo contains models for the identification of language in text. It is based on Fasttext and designed with the Nordic languages in mind, including several Sámi languages. It comes in two flavours, `nordic-lid`, a model that identifies between the 12 most common languages in the Nordic countries (plus English), and `nordic-lid.159`, a model that extends that list to 159 languages of the world. Moreover, each of them come in large and small (quantized) versions.
|
8 |
|
9 |
| Model | Size | Precision | Recall | F1-Score | Support |
|
10 |
|:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
|
11 |
-
| [`nordic-lid.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.bin) (large) | 274 MB | 0.9901 | 0.9900 | 0.9900 | 5500 |
|
12 |
-
| [`nordic-lid.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.ftz) (small) | 1.87 MB | 0.9889 | 0.9890 | 0.9890 | 5500 |
|
13 |
-
| [`nordic-lid.159.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.bin) (large)| 9.63 GB | 0.9434 | 0.9528 | 0.9476 | 44049 |
|
14 |
-
| [`nordic-lid.159.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.ftz) (small)| 11.2 MB | 0.9275 | 0.9399 | 0.9327 | 44049 |
|
15 |
|
16 |
|
17 |
## Usage
|
@@ -23,8 +45,8 @@ import fasttext
|
|
23 |
from datasets.utils.download_manager import DownloadManager
|
24 |
|
25 |
|
26 |
-
NORDIC_LID_URL = "https://huggingface.co/NbAiLab/nordic-lid/resolve/main/"
|
27 |
-
model_name = "nordic-lid.ftz"
|
28 |
|
29 |
model = fasttext.load_model(DownloadManager().download(NORDIC_LID_URL + model_name))
|
30 |
model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling.", threshold=0.25)
|
@@ -34,14 +56,14 @@ model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for pol
|
|
34 |
Alternatively, these models are also integrated into the the experimental `nbailab` CLI application:
|
35 |
|
36 |
```bash
|
37 |
-
$ echo "Jeg leser en bok" | nbailab langid --model-name nordic-lid.ftz
|
38 |
nob,0.9999788999557495
|
39 |
```
|
40 |
|
41 |
|
42 |
## Languages
|
43 |
|
44 |
-
### `nordic-lid.bin`
|
45 |
|
46 |
Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
|
47 |
|
@@ -64,7 +86,7 @@ Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.g
|
|
64 |
| Weighted avg | | 0.9906 | 0.9905 | 0.9905 | 5500 |
|
65 |
| Macro avg | | 0.9901 | 0.9900 | 0.9900 | 5500 |
|
66 |
|
67 |
-
### `nordic-lid.159.bin`
|
68 |
|
69 |
<details>
|
70 |
<summary>Scores for the 159 languages</summary>
|
@@ -238,7 +260,7 @@ Additionally trained on sentences from [Taoteba](https://tatoeba.org/en/).
|
|
238 |
|
239 |
</details>
|
240 |
|
241 |
-
### `nordic-lid.ftz`
|
242 |
|
243 |
The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
|
244 |
|
@@ -262,7 +284,7 @@ The small models are quantized versions of the large versions using a cutoff of
|
|
262 |
| Macro avg | | 0.9889 | 0.9890 | 0.9890 | 5500 |
|
263 |
|
264 |
|
265 |
-
### `nordic-lid.159.ftz`
|
266 |
|
267 |
<details>
|
268 |
<summary>Scores for the 159 languages (compressed model)</summary>
|
|
|
1 |
---
|
2 |
license: openrail
|
3 |
+
language:
|
4 |
+
- dan
|
5 |
+
- eng
|
6 |
+
- fao
|
7 |
+
- fin
|
8 |
+
- isl
|
9 |
+
- nno
|
10 |
+
- nob
|
11 |
+
- sma
|
12 |
+
- sme
|
13 |
+
- smj
|
14 |
+
- smn
|
15 |
+
- sms
|
16 |
+
- swe
|
17 |
+
tasks:
|
18 |
+
- text-classification
|
19 |
+
tags:
|
20 |
+
- fasttext
|
21 |
+
datasets:
|
22 |
+
- tatoeba
|
23 |
+
library_name: fasttext
|
24 |
+
inference: false
|
25 |
---
|
26 |
|
27 |
# Nordic language identification
|
28 |
|
29 |
+
This repo contains models for the identification of language in text. It is based on Fasttext and designed with the Nordic languages in mind, including several Sámi languages. It comes in two flavours, `nb-nordic-lid`, a model that identifies between the 12 most common languages in the Nordic countries (plus English), and `nb-nordic-lid.159`, a model that extends that list to 159 languages of the world. Moreover, each of them come in large and small (quantized) versions.
|
30 |
|
31 |
| Model | Size | Precision | Recall | F1-Score | Support |
|
32 |
|:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
|
33 |
+
| [`nb-nordic-lid.bin`](https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/nb-nordic-lid.bin) (large) | 274 MB | 0.9901 | 0.9900 | 0.9900 | 5500 |
|
34 |
+
| [`nb-nordic-lid.ftz`](https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/nb-nordic-lid.ftz) (small) | 1.87 MB | 0.9889 | 0.9890 | 0.9890 | 5500 |
|
35 |
+
| [`nb-nordic-lid.159.bin`](https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/nb-nordic-lid.159.bin) (large)| 9.63 GB | 0.9434 | 0.9528 | 0.9476 | 44049 |
|
36 |
+
| [`nb-nordic-lid.159.ftz`](https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/nb-nordic-lid.159.ftz) (small)| 11.2 MB | 0.9275 | 0.9399 | 0.9327 | 44049 |
|
37 |
|
38 |
|
39 |
## Usage
|
|
|
45 |
from datasets.utils.download_manager import DownloadManager
|
46 |
|
47 |
|
48 |
+
NORDIC_LID_URL = "https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/"
|
49 |
+
model_name = "nb-nordic-lid.ftz"
|
50 |
|
51 |
model = fasttext.load_model(DownloadManager().download(NORDIC_LID_URL + model_name))
|
52 |
model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling.", threshold=0.25)
|
|
|
56 |
Alternatively, these models are also integrated into the the experimental `nbailab` CLI application:
|
57 |
|
58 |
```bash
|
59 |
+
$ echo "Jeg leser en bok" | nbailab langid --model-name nb-nordic-lid.ftz
|
60 |
nob,0.9999788999557495
|
61 |
```
|
62 |
|
63 |
|
64 |
## Languages
|
65 |
|
66 |
+
### `nb-nordic-lid.bin`
|
67 |
|
68 |
Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
|
69 |
|
|
|
86 |
| Weighted avg | | 0.9906 | 0.9905 | 0.9905 | 5500 |
|
87 |
| Macro avg | | 0.9901 | 0.9900 | 0.9900 | 5500 |
|
88 |
|
89 |
+
### `nb-nordic-lid.159.bin`
|
90 |
|
91 |
<details>
|
92 |
<summary>Scores for the 159 languages</summary>
|
|
|
260 |
|
261 |
</details>
|
262 |
|
263 |
+
### `nb-nordic-lid.ftz`
|
264 |
|
265 |
The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
|
266 |
|
|
|
284 |
| Macro avg | | 0.9889 | 0.9890 | 0.9890 | 5500 |
|
285 |
|
286 |
|
287 |
+
### `nb-nordic-lid.159.ftz`
|
288 |
|
289 |
<details>
|
290 |
<summary>Scores for the 159 languages (compressed model)</summary>
|