NbAiLab
/

nb-nordic-lid

Text Classification

fastText

language-identification

language-detection

Model card Files Files and versions Community

versae commited on Nov 21, 2022

Commit

70be1e2

1 Parent(s): a251e6b

Update README.md

Browse files

Files changed (1) hide show

README.md +34 -8

README.md CHANGED Viewed

@@ -8,13 +8,39 @@ This repo contains models for the identification of language in text. It is base
 | Model                       | Size              |   Precision |   Recall |   F1-Score |   Support |
 |:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
-| `nordic-lid.bin` (large)    | 274 MB            |      0.9901 |   0.9900 |     0.9900 |      5500 |
-| `nordic-lid.ftz` (small)    | 1.87 MB           |      0.9889 |   0.9890 |     0.9890 |      5500 |
-| `nordic-lid.159.bin` (large)| 9.63 GB           |      0.9434 |   0.9528 |     0.9476 |     44049 |
-| `nordic-lid.159.ftz` (small)| 11.2 MB           |      0.9275 |   0.9399 |     0.9327 |     44049 |
-## `nordic-lid.bin`
 Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
@@ -37,7 +63,7 @@ Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.g
 | Weighted avg |                   |      0.9906 |   0.9905 |     0.9905 |      5500 |
 | Macro avg    |                   |      0.9901 |   0.9900 |     0.9900 |      5500 |
-## `nordic-lid.159.bin`
 <details>
   <summary>Scores for the 159 languages</summary>
@@ -211,7 +237,7 @@ Additionally trained on sentences from [Taoteba](https://tatoeba.org/en/).
 </details>
-## `nordic-lid.ftz`
 The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
@@ -235,7 +261,7 @@ The small models are quantized versions of the large versions using a cutoff of
 | Macro avg    |                   |      0.9889 |   0.9890 |     0.9890 |      5500 |
-## `nordic-lid.159.ftz`
 <details>
   <summary>Scores for the 159 languages (compressed model)</summary>

 | Model                       | Size              |   Precision |   Recall |   F1-Score |   Support |
 |:----------------------------|:------------------|------------:|---------:|-----------:|----------:|
+| [`nordic-lid.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.bin) (large)    | 274 MB            |      0.9901 |   0.9900 |     0.9900 |      5500 |
+| [`nordic-lid.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.ftz) (small)    | 1.87 MB           |      0.9889 |   0.9890 |     0.9890 |      5500 |
+| [`nordic-lid.159.bin`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.bin) (large)| 9.63 GB           |      0.9434 |   0.9528 |     0.9476 |     44049 |
+| [`nordic-lid.159.ftz`](https://huggingface.co/NbAiLab/nordic-lid/resolve/main/nordic-lid.159.ftz) (small)| 11.2 MB           |      0.9275 |   0.9399 |     0.9327 |     44049 |
+## Usage
+After download, the models can be used through the Fasttext library:
+```python
+import fasttext
+from datasets.utils.download_manager import DownloadManager
+NORDIC_LID_URL = "https://huggingface.co/NbAiLab/nordic-lid/resolve/main/"
+model_name = "nordic-lid.ftz"
+model = fasttext.load_model(DownloadManager().download(NORDIC_LID_URL + model_name))
+model.predict("Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling.", threshold=0.25)
+# (('__label__nob',), array([0.95482141]))
+```
+Alternatively, these models are also integrated into the the experimental `nb` CLI application:
+```bash
+$ echo "Debatt er bra og sunt for demokratier, og en forutsetning for politikkutvikling." | nb langid --threshold 0.25
+nob,0.95482141
+```
+## Languages
+### `nordic-lid.bin`
 Trained on sentences from the [GiellaT's Tranlation Memories](https://giellalt.github.io/tm/TranslationMemories.html) and [Wortschatz's corpora](https://wortschatz.uni-leipzig.de/en/download).
 | Weighted avg |                   |      0.9906 |   0.9905 |     0.9905 |      5500 |
 | Macro avg    |                   |      0.9901 |   0.9900 |     0.9900 |      5500 |
+### `nordic-lid.159.bin`
 <details>
   <summary>Scores for the 159 languages</summary>
 </details>
+### `nordic-lid.ftz`
 The small models are quantized versions of the large versions using a cutoff of 50,000 words and ngrams and quantizing the norm separately.
 | Macro avg    |                   |      0.9889 |   0.9890 |     0.9890 |      5500 |
+### `nordic-lid.159.ftz`
 <details>
   <summary>Scores for the 159 languages (compressed model)</summary>