File size: 2,888 Bytes
d367806 eeffb50 d367806 eeffb50 d367806 6423566 d7a9d39 eeffb50 d7a9d39 eeffb50 d367806 eeffb50 d367806 eeffb50 2168ec7 377d493 2168ec7 eeffb50 2168ec7 eeffb50 d367806 4262db7 d367806 d7a9d39 3b2c930 d367806 d7a9d39 d367806 3b2c930 eeffb50 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
library_name: transformers
language:
- fr
- de
- en
- it
- lb
license: agpl-3.0
tags:
- language-identification
- multilingual
- historical
- impresso
---
# Model Card for `impresso-project/language-identifier`
## Overview
`impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.
This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
## Model Details
### Model Description
- **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
- **Model type:** Language identification using a transformer-based classification architecture
- **Languages:** French, German, English, Italian, Luxembourgish
- **License:** AGPL-3.0
- **Finetuned from:** Custom model trained on historical newspaper data from the Impresso corpus
## How to Use
```python
from transformers import pipeline
MODEL_NAME = "impresso-project/language-identifier"
lang_pipeline = pipeline(
"langident",
model=MODEL_NAME,
trust_remote_code=True,
device="cpu",
)
text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
face à une opportunité."""
langs = lang_pipeline(text)
print(langs)
```
## Output Format
The output is a single dictionary with the predicted language and confidence score:
```python
{
"language": "fr",
"score": 1.0
}
```
## Use Cases
- Preprocessing for OCR and NLP tasks on historical corpora
- Document and segment-level language tagging
- Filtering and sorting multilingual newspaper archives
## Limitations
- Works best on **sentence- or paragraph-length** texts
- May struggle with code-switching or OCR-degraded text that mixes languages
- Primarily optimized for **Impresso-like sources** (19th–20th century newspapers)
## Installation
```bash
pip install transformers floret
```
## Contact
- Website: [https://impresso-project.ch](https://impresso-project.ch)
<p align="center">
<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
</p>
|