|
--- |
|
library_name: transformers |
|
language: |
|
- fr |
|
- de |
|
- en |
|
- it |
|
- lb |
|
license: agpl-3.0 |
|
tags: |
|
- language-identification |
|
- multilingual |
|
- historical |
|
- impresso |
|
--- |
|
|
|
# Model Card for `impresso-project/language-identifier` |
|
|
|
## Overview |
|
|
|
`impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders. |
|
|
|
This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891). |
|
- **Model type:** Language identification using a transformer-based classification architecture |
|
- **Languages:** French, German, English, Italian, Luxembourgish |
|
- **License:** AGPL-3.0 |
|
- **Finetuned from:** Custom model trained on historical newspaper data from the Impresso corpus |
|
|
|
## How to Use |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
MODEL_NAME = "impresso-project/language-identifier" |
|
|
|
lang_pipeline = pipeline( |
|
"langident", |
|
model=MODEL_NAME, |
|
trust_remote_code=True, |
|
device="cpu", |
|
) |
|
|
|
text = """En l'an 1348, au plus fort des ravages de la peste noire à travers |
|
l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et |
|
face à une opportunité.""" |
|
|
|
langs = lang_pipeline(text) |
|
print(langs) |
|
``` |
|
|
|
## Output Format |
|
|
|
The output is a single dictionary with the predicted language and confidence score: |
|
|
|
```python |
|
{ |
|
"language": "fr", |
|
"score": 1.0 |
|
} |
|
``` |
|
|
|
|
|
## Use Cases |
|
|
|
- Preprocessing for OCR and NLP tasks on historical corpora |
|
- Document and segment-level language tagging |
|
- Filtering and sorting multilingual newspaper archives |
|
|
|
## Limitations |
|
|
|
- Works best on **sentence- or paragraph-length** texts |
|
- May struggle with code-switching or OCR-degraded text that mixes languages |
|
- Primarily optimized for **Impresso-like sources** (19th–20th century newspapers) |
|
|
|
## Installation |
|
|
|
```bash |
|
pip install transformers floret |
|
``` |
|
|
|
## Contact |
|
|
|
- Website: [https://impresso-project.ch](https://impresso-project.ch) |
|
|
|
<p align="center"> |
|
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/> |
|
</p> |
|
|
|
|