impresso-project
/

OCR-quality-assessment-unigram

Model card Files Files and versions Community

OCR-quality-assessment-unigram / README.md

simon-clmtd

Update README.md

2bd1e6e verified about 1 month ago

preview code

raw

history blame contribute delete

3.98 kB

	---
	license: gpl-3.0
	---

	# OCR Quality Assessment using Unigram Language Data

	This HuggingFace model repository contains known word lists (a.k.a. word unigram data)
	in bloom filter format built for efficient and robust OCR quality assessment.

	## Known Word Lists as Bloom Filters

	All model names start with `ocrqa-` and the remainder specifies the following metadata:

	- Model Name: A short identifier (e.g. wp for Wikipedia)
	- Version: A specific model version identifier (e.g. v1.0.0)
	- Language: The target language (e.g. fr, de)

	If available, log files from the bloom file compilation process contain more details
	about the word lists that were used.
	All words in the Bloom filters are in lowercase and have been normalized to unicode NFKC
	normalization.
	All digits are mapped to `0`. Many punctuation characters and non-alphanumeric symbols are replaced by space.

	## Installation

	Tested with Python 3.11.

	```bash
	pip install cython pybloomfiltermmap3 huggingface_hub
	```

	## Usage

	To use this models in your project for OCR QA, you can use the following code snippet:

	```python
	import unicodedata
	from typing import Optional
	from huggingface_hub import hf_hub_download
	from pybloomfilter import BloomFilter


	# Define normalization table
	QUOTES_PUNCT = "„•<>!\"#%&'’"
	ASCII_PUNCT = "()*,./:;?"
	BRACKETS_SPECIAL = "[]\\~_{}"
	UNICODE_PUNCT = "\xa1\xab\xb7\xbb\xbf"
	DASH_CARET = "—^`"
	SPECIAL_SYMBOLS = "¦§£="
	HYPHEN = "-"
	DIGITS = "0123456789"

	NORMALIZATION_TABLE = str.maketrans(
	{
	char: " "
	for char in (
	QUOTES_PUNCT
	+ ASCII_PUNCT
	+ BRACKETS_SPECIAL
	+ UNICODE_PUNCT
	+ DASH_CARET
	+ SPECIAL_SYMBOLS
	+ HYPHEN
	)
	}
	\| {char: "0" for char in DIGITS}
	)


	def normalize_text(s: str, unicode_normalize: Optional[str] = "NFKC") -> str:
	"""Normalize text by replacing punctuation with spaces and digits with '0'."""
	if unicode_normalize:
	s = unicodedata.normalize(unicode_normalize, s).lower()
	return s.translate(NORMALIZATION_TABLE)


	def get_bloomfilter(model_id: str, filename: str):
	return BloomFilter.open(hf_hub_download(repo_id=model_id, filename=filename))


	def filter(text: str, bloom_filter: BloomFilter):
	# Normalize and tokenize text
	normalized_text = normalize_text(text)
	tokens = normalized_text.split()

	# Check tokens against the bloom filter
	for token in tokens:
	if token in bloom_filter:
	print(f"'{token}' is in the bloom filter.")
	else:
	print(f"'{token}' is NOT in the bloom filter.")


	def filter_text(DE_TEXT: str, bloom_filter: BloomFilter):

	knowns = set()
	unknowns = set()

	# Normalize and tokenize text
	normalized_text = normalize_text(DE_TEXT)
	tokens = normalized_text.split()

	# Check tokens against the bloom filter
	for token in tokens:
	if token in bloom_filter:
	print(f"'{token}' is in the bloom filter.")
	knowns.add(token)
	else:
	print(f"'{token}' is NOT in the bloom filter.")
	unknowns.add(token)
	result = result = {"knowns": knowns, "unknowns": unknowns}
	return result


	DE_TEXT = """Dieser histrische Text änthält OCR-/Tippsfehler, aber auch einige korrekte Wörter."""

	# Load the bloom filter

	bf = get_bloomfilter(
	"impresso-project/OCR-quality-assessment-unigram", "ocrqa-wp_v1.0.6-de.bloom"
	)

	print(filter_text(DE_TEXT, bf))

	```

	## Limitations

	- only French and German is supported so far.
	- New Wikipedia dumps should be used to update the word lists.

	## Release info

	- v1.0.6: Added more high-frequency words for German (historical spelling) and a few
	French ones. This models are planned to be used in the impresso webapp and API
	- v1.0.5: Initial release with impresso 1 word lists (only internally used, never
	available in the public webapp or API) built mostly from Wikipedia dumps from 2019