versae commited on
Commit
70be1e2
·
1 Parent(s): a251e6b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -8
README.md CHANGED
@@ -8,13 +8,39 @@ This repo contains models for the identification of language in text. It is base
8
 
9
  | Model | Size | Precision | Recall | F1-Score | Support |
10
  |:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
11
- | `nordic-lid.bin` (large) | 274 MB | 0.9901 | 0.9900 | 0.9900 | 5500 |
12
- | `nordic-lid.ftz` (small) | 1.87 MB | 0.9889 | 0.9890 | 0.9890 | 5500 |
13
- | `nordic-lid.159.bin` (large)| 9.63 GB | 0.9434 | 0.9528 | 0.9476 | 44049 |
14
- | `nordic-lid.159.ftz` (small)| 11.2 MB | 0.9275 | 0.9399 | 0.9327 | 44049 |
15
 
16
 
17
- ## `nordic-lid.bin`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
20
 
@@ -37,7 +63,7 @@ Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.g
37
  | Weighted avg | | 0.9906 | 0.9905 | 0.9905 | 5500 |
38
  | Macro avg | | 0.9901 | 0.9900 | 0.9900 | 5500 |
39
 
40
- ## `nordic-lid.159.bin`
41
 
42
  <details>
43
  <summary>Scores for the 159 languages</summary>
@@ -211,7 +237,7 @@ Additionally trained on sentences from [Taoteba](https://tatoeba.org/en/).
211
 
212
  </details>
213
 
214
- ## `nordic-lid.ftz`
215
 
216
  The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
217
 
@@ -235,7 +261,7 @@ The small models are quantized versions of the large versions using a cutoff of
235
  | Macro avg | | 0.9889 | 0.9890 | 0.9890 | 5500 |
236
 
237
 
238
- ## `nordic-lid.159.ftz`
239
 
240
  <details>
241
  <summary>Scores for the 159 languages (compressed model)</summary>
 
8
 
9
  | Model | Size | Precision | Recall | F1-Score | Support |
10
  |:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
11
+ | [`nordic-lid.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.bin) (large) | 274 MB | 0.9901 | 0.9900 | 0.9900 | 5500 |
12
+ | [`nordic-lid.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.ftz) (small) | 1.87 MB | 0.9889 | 0.9890 | 0.9890 | 5500 |
13
+ | [`nordic-lid.159.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.bin) (large)| 9.63 GB | 0.9434 | 0.9528 | 0.9476 | 44049 |
14
+ | [`nordic-lid.159.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.ftz) (small)| 11.2 MB | 0.9275 | 0.9399 | 0.9327 | 44049 |
15
 
16
 
17
+ ## Usage
18
+
19
+ After download, the models can be used through the Fasttext library:
20
+
21
+ ```python
22
+ import fasttext
23
+ from datasets.utils.download_manager import DownloadManager
24
+
25
+
26
+ NORDIC_LID_URL = "https://huggingface.co/NbAiLab/nordic-lid/resolve/main/"
27
+ model_name = "nordic-lid.ftz"
28
+
29
+ model = fasttext.load_model(DownloadManager().download(NORDIC_LID_URL + model_name))
30
+ model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling.", threshold=0.25)
31
+ # (('__label__nob',), array([0.95482141]))
32
+ ```
33
+
34
+ Alternatively, these models are also integrated into the the experimental `nb` CLI application:
35
+
36
+ ```bash
37
+ $ echo "Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling." | nb langid --threshold 0.25
38
+ nob,0.95482141
39
+ ```
40
+
41
+ ## Languages
42
+
43
+ ### `nordic-lid.bin`
44
 
45
  Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
46
 
 
63
  | Weighted avg | | 0.9906 | 0.9905 | 0.9905 | 5500 |
64
  | Macro avg | | 0.9901 | 0.9900 | 0.9900 | 5500 |
65
 
66
+ ### `nordic-lid.159.bin`
67
 
68
  <details>
69
  <summary>Scores for the 159 languages</summary>
 
237
 
238
  </details>
239
 
240
+ ### `nordic-lid.ftz`
241
 
242
  The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
243
 
 
261
  | Macro avg | | 0.9889 | 0.9890 | 0.9890 | 5500 |
262
 
263
 
264
+ ### `nordic-lid.159.ftz`
265
 
266
  <details>
267
  <summary>Scores for the 159 languages (compressed model)</summary>