impresso-project
/

language-identifier

Token Classification

language-identification

Model card Files Files and versions Community

language-identifier / README.md

emanuelaboros's picture

review readme

377d493 3 months ago

|

history blame contribute delete

2.89 kB

	---
	library_name: transformers
	language:
	- fr
	- de
	- en
	- it
	- lb
	license: agpl-3.0
	tags:
	- language-identification
	- multilingual
	- historical
	- impresso
	---

	# Model Card for `impresso-project/language-identifier`

	## Overview

	`impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb) — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.

	This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.

	## Model Details

	### Model Description

	- Developed by: University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
	- Model type: Language identification using a transformer-based classification architecture
	- Languages: French, German, English, Italian, Luxembourgish
	- License: AGPL-3.0
	- Finetuned from: Custom model trained on historical newspaper data from the Impresso corpus

	## How to Use

	```python
	from transformers import pipeline

	MODEL_NAME = "impresso-project/language-identifier"

	lang_pipeline = pipeline(
	"langident",
	model=MODEL_NAME,
	trust_remote_code=True,
	device="cpu",
	)

	text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
	l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
	face à une opportunité."""

	langs = lang_pipeline(text)
	print(langs)
	```

	## Output Format

	The output is a single dictionary with the predicted language and confidence score:

	```python
	{
	"language": "fr",
	"score": 1.0
	}
	```


	## Use Cases

	- Preprocessing for OCR and NLP tasks on historical corpora
	- Document and segment-level language tagging
	- Filtering and sorting multilingual newspaper archives

	## Limitations

	- Works best on sentence- or paragraph-length texts
	- May struggle with code-switching or OCR-degraded text that mixes languages
	- Primarily optimized for Impresso-like sources (19th–20th century newspapers)

	## Installation

	```bash
	pip install transformers floret
	```

	## Contact

	- Website: [https://impresso-project.ch](https://impresso-project.ch)

	<p align="center">
	<img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
	</p>