versae commited on
Commit
5037034
·
1 Parent(s): 1df2555

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -12
README.md CHANGED
@@ -1,17 +1,39 @@
1
  ---
2
  license: openrail
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
4
 
5
  # Nordic language identification
6
 
7
- This repo contains models for the identification of language in text. It is based on Fasttext and designed with the Nordic languages in mind, including several Sámi languages. It comes in two flavours, `nordic-lid`, a model that identifies between the 12 most common languages in the Nordic countries (plus English), and `nordic-lid.159`, a model that extends that list to 159 languages of the world. Moreover, each of them come in large and small (quantized) versions.
8
 
9
  | Model | Size | Precision | Recall | F1-Score | Support |
10
  |:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
11
- | [`nordic-lid.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.bin) (large) | 274 MB | 0.9901 | 0.9900 | 0.9900 | 5500 |
12
- | [`nordic-lid.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.ftz) (small) | 1.87 MB | 0.9889 | 0.9890 | 0.9890 | 5500 |
13
- | [`nordic-lid.159.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.bin) (large)| 9.63 GB | 0.9434 | 0.9528 | 0.9476 | 44049 |
14
- | [`nordic-lid.159.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.ftz) (small)| 11.2 MB | 0.9275 | 0.9399 | 0.9327 | 44049 |
15
 
16
 
17
  ## Usage
@@ -23,8 +45,8 @@ import fasttext
23
  from datasets.utils.download_manager import DownloadManager
24
 
25
 
26
- NORDIC_LID_URL = "https://huggingface.co/NbAiLab/nordic-lid/resolve/main/"
27
- model_name = "nordic-lid.ftz"
28
 
29
  model = fasttext.load_model(DownloadManager().download(NORDIC_LID_URL + model_name))
30
  model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling.", threshold=0.25)
@@ -34,14 +56,14 @@ model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for pol
34
  Alternatively, these models are also integrated into the the experimental `nbailab` CLI application:
35
 
36
  ```bash
37
- $ echo "Jeg leser en bok" | nbailab langid --model-name nordic-lid.ftz
38
  nob,0.9999788999557495
39
  ```
40
 
41
 
42
  ## Languages
43
 
44
- ### `nordic-lid.bin`
45
 
46
  Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
47
 
@@ -64,7 +86,7 @@ Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.g
64
  | Weighted avg | | 0.9906 | 0.9905 | 0.9905 | 5500 |
65
  | Macro avg | | 0.9901 | 0.9900 | 0.9900 | 5500 |
66
 
67
- ### `nordic-lid.159.bin`
68
 
69
  <details>
70
  <summary>Scores for the 159 languages</summary>
@@ -238,7 +260,7 @@ Additionally trained on sentences from [Taoteba](https://tatoeba.org/en/).
238
 
239
  </details>
240
 
241
- ### `nordic-lid.ftz`
242
 
243
  The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
244
 
@@ -262,7 +284,7 @@ The small models are quantized versions of the large versions using a cutoff of
262
  | Macro avg | | 0.9889 | 0.9890 | 0.9890 | 5500 |
263
 
264
 
265
- ### `nordic-lid.159.ftz`
266
 
267
  <details>
268
  <summary>Scores for the 159 languages (compressed model)</summary>
 
1
  ---
2
  license: openrail
3
+ language:
4
+ - dan
5
+ - eng
6
+ - fao
7
+ - fin
8
+ - isl
9
+ - nno
10
+ - nob
11
+ - sma
12
+ - sme
13
+ - smj
14
+ - smn
15
+ - sms
16
+ - swe
17
+ tasks:
18
+ - text-classification
19
+ tags:
20
+ - fasttext
21
+ datasets:
22
+ - tatoeba
23
+ library_name: fasttext
24
+ inference: false
25
  ---
26
 
27
  # Nordic language identification
28
 
29
+ This repo contains models for the identification of language in text. It is based on Fasttext and designed with the Nordic languages in mind, including several Sámi languages. It comes in two flavours, `nb-nordic-lid`, a model that identifies between the 12 most common languages in the Nordic countries (plus English), and `nb-nordic-lid.159`, a model that extends that list to 159 languages of the world. Moreover, each of them come in large and small (quantized) versions.
30
 
31
  | Model | Size | Precision | Recall | F1-Score | Support |
32
  |:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
33
+ | [`nb-nordic-lid.bin`](https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/nb-nordic-lid.bin) (large) | 274 MB | 0.9901 | 0.9900 | 0.9900 | 5500 |
34
+ | [`nb-nordic-lid.ftz`](https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/nb-nordic-lid.ftz) (small) | 1.87 MB | 0.9889 | 0.9890 | 0.9890 | 5500 |
35
+ | [`nb-nordic-lid.159.bin`](https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/nb-nordic-lid.159.bin) (large)| 9.63 GB | 0.9434 | 0.9528 | 0.9476 | 44049 |
36
+ | [`nb-nordic-lid.159.ftz`](https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/nb-nordic-lid.159.ftz) (small)| 11.2 MB | 0.9275 | 0.9399 | 0.9327 | 44049 |
37
 
38
 
39
  ## Usage
 
45
  from datasets.utils.download_manager import DownloadManager
46
 
47
 
48
+ NORDIC_LID_URL = "https://huggingface.co/NbAiLab/nb-nordic-lid/resolve/main/"
49
+ model_name = "nb-nordic-lid.ftz"
50
 
51
  model = fasttext.load_model(DownloadManager().download(NORDIC_LID_URL + model_name))
52
  model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling.", threshold=0.25)
 
56
  Alternatively, these models are also integrated into the the experimental `nbailab` CLI application:
57
 
58
  ```bash
59
+ $ echo "Jeg leser en bok" | nbailab langid --model-name nb-nordic-lid.ftz
60
  nob,0.9999788999557495
61
  ```
62
 
63
 
64
  ## Languages
65
 
66
+ ### `nb-nordic-lid.bin`
67
 
68
  Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
69
 
 
86
  | Weighted avg | | 0.9906 | 0.9905 | 0.9905 | 5500 |
87
  | Macro avg | | 0.9901 | 0.9900 | 0.9900 | 5500 |
88
 
89
+ ### `nb-nordic-lid.159.bin`
90
 
91
  <details>
92
  <summary>Scores for the 159 languages</summary>
 
260
 
261
  </details>
262
 
263
+ ### `nb-nordic-lid.ftz`
264
 
265
  The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
266
 
 
284
  | Macro avg | | 0.9889 | 0.9890 | 0.9890 | 5500 |
285
 
286
 
287
+ ### `nb-nordic-lid.159.ftz`
288
 
289
  <details>
290
  <summary>Scores for the 159 languages (compressed model)</summary>