|
--- |
|
license: gpl-3.0 |
|
--- |
|
|
|
# OCR Quality Assessment using Unigram Language Data |
|
|
|
This HuggingFace model repository contains known word lists (a.k.a. word unigram data) |
|
in bloom filter format built for efficient and robust OCR quality assessment. |
|
|
|
## Known Word Lists as Bloom Filters |
|
|
|
All model names start with `ocrqa-` and the remainder specifies the following metadata: |
|
|
|
- **Model Name:** A short identifier (e.g. wp for Wikipedia) |
|
- **Version:** A specific model version identifier (e.g. v1.0.0) |
|
- **Language:** The target language (e.g. fr, de) |
|
|
|
If available, log files from the bloom file compilation process contain more details |
|
about the word lists that were used. |
|
All words in the Bloom filters are in lowercase and have been normalized to unicode NFKC |
|
normalization. |
|
All digits are mapped to `0`. Many punctuation characters and non-alphanumeric symbols are replaced by space. |
|
|
|
## Installation |
|
|
|
Tested with Python 3.11. |
|
|
|
```bash |
|
pip install cython pybloomfiltermmap3 huggingface_hub |
|
``` |
|
|
|
## Usage |
|
|
|
To use this models in your project for OCR QA, you can use the following code snippet: |
|
|
|
```python |
|
import unicodedata |
|
from typing import Optional |
|
from huggingface_hub import hf_hub_download |
|
from pybloomfilter import BloomFilter |
|
|
|
|
|
# Define normalization table |
|
QUOTES_PUNCT = "„•<>!\"#%&'’" |
|
ASCII_PUNCT = "()*,./:;?" |
|
BRACKETS_SPECIAL = "[]\\~_{}" |
|
UNICODE_PUNCT = "\xa1\xab\xb7\xbb\xbf" |
|
DASH_CARET = "—^`" |
|
SPECIAL_SYMBOLS = "¦§£=" |
|
HYPHEN = "-" |
|
DIGITS = "0123456789" |
|
|
|
NORMALIZATION_TABLE = str.maketrans( |
|
{ |
|
char: " " |
|
for char in ( |
|
QUOTES_PUNCT |
|
+ ASCII_PUNCT |
|
+ BRACKETS_SPECIAL |
|
+ UNICODE_PUNCT |
|
+ DASH_CARET |
|
+ SPECIAL_SYMBOLS |
|
+ HYPHEN |
|
) |
|
} |
|
| {char: "0" for char in DIGITS} |
|
) |
|
|
|
|
|
def normalize_text(s: str, unicode_normalize: Optional[str] = "NFKC") -> str: |
|
"""Normalize text by replacing punctuation with spaces and digits with '0'.""" |
|
if unicode_normalize: |
|
s = unicodedata.normalize(unicode_normalize, s).lower() |
|
return s.translate(NORMALIZATION_TABLE) |
|
|
|
|
|
def get_bloomfilter(model_id: str, filename: str): |
|
return BloomFilter.open(hf_hub_download(repo_id=model_id, filename=filename)) |
|
|
|
|
|
def filter(text: str, bloom_filter: BloomFilter): |
|
# Normalize and tokenize text |
|
normalized_text = normalize_text(text) |
|
tokens = normalized_text.split() |
|
|
|
# Check tokens against the bloom filter |
|
for token in tokens: |
|
if token in bloom_filter: |
|
print(f"'{token}' is in the bloom filter.") |
|
else: |
|
print(f"'{token}' is NOT in the bloom filter.") |
|
|
|
|
|
def filter_text(DE_TEXT: str, bloom_filter: BloomFilter): |
|
|
|
knowns = set() |
|
unknowns = set() |
|
|
|
# Normalize and tokenize text |
|
normalized_text = normalize_text(DE_TEXT) |
|
tokens = normalized_text.split() |
|
|
|
# Check tokens against the bloom filter |
|
for token in tokens: |
|
if token in bloom_filter: |
|
print(f"'{token}' is in the bloom filter.") |
|
knowns.add(token) |
|
else: |
|
print(f"'{token}' is NOT in the bloom filter.") |
|
unknowns.add(token) |
|
result = result = {"knowns": knowns, "unknowns": unknowns} |
|
return result |
|
|
|
|
|
DE_TEXT = """Dieser histrische Text änthält OCR-/Tippsfehler, aber auch einige korrekte Wörter.""" |
|
|
|
# Load the bloom filter |
|
|
|
bf = get_bloomfilter( |
|
"impresso-project/OCR-quality-assessment-unigram", "ocrqa-wp_v1.0.6-de.bloom" |
|
) |
|
|
|
print(filter_text(DE_TEXT, bf)) |
|
|
|
``` |
|
|
|
## Limitations |
|
|
|
- only French and German is supported so far. |
|
- New Wikipedia dumps should be used to update the word lists. |
|
|
|
## Release info |
|
|
|
- v1.0.6: Added more high-frequency words for German (historical spelling) and a few |
|
French ones. This models are planned to be used in the impresso webapp and API |
|
- v1.0.5: Initial release with impresso 1 word lists (only internally used, never |
|
available in the public webapp or API) built mostly from Wikipedia dumps from 2019 |
|
|