document_redaction / tools /load_spacy_model_custom_recognisers.py

Commit History

Added regex functionality to deny lists. Corrected tesseract to word level parsing. Improved review search regex capabilities. Updated documentation
4852fb5

seanpedrickcase commited on

Again revised spaCy language model load for different languages
2f34683

seanpedrickcase commited on

Modified model load for custom languages with spaCy. Languages should load successfully now.
2148ddd

seanpedrickcase commited on

Corrected environment variable file references for log files and spacy/paddle folders for lambda_entrypoint
e347a56

seanpedrickcase commited on

Enabled export of both review pdfs and redacted pdfs from same redaction run. Added config variables for user guide url and showing redaction settings. Moved config variables around a bit. Minor GUI improvements
44d987c

seanpedrickcase commited on

Removed some extraneous test steps. Improved Example loading and feedback, and redaction feedback. Minor security updates. Fixed Adobe xfdf file parsing.
1cb1897

seanpedrickcase commited on

Fixed on deprecated Github workflow functions. Applied linter and formatter to code throughout. Added tests for GUI load.
bafcf39

seanpedrickcase commited on

Added example data files. Greatly revised CLI redaction for redaction, deduplication, and AWS Textract batch calls. Various minor fixes and package updates.
d60759d

seanpedrickcase commited on

Updated documentation. Fix on ocr_output upload before pdf. Duplicate page fix
af187f0

seanpedrickcase commited on

Improved language support functions and reporting
601fcda

seanpedrickcase commited on

Added support for other languages. Improved DynamoDB download
9ae09da

seanpedrickcase commited on

Major update. General code revision. Improved config variables. Dataframe based review frame now includes text, items can be searched and excluded. Costs now estimated. Option for adding cost codes added. Option to extract text only.
0ea8b9e

seanpedrickcase commited on

Added features to review dataframe to filter and exclude features based on text. Text should now appear consistently in review_df (for boxes not modified). Larger spacy model returned to use. Gradio upgrade.
66e145d

seanpedrickcase commited on

Fixed issues with gradio version 5.16. Fixed fuzzy search error with pages with no data.
3cecbfa

seanpedrickcase commited on

Fuzzy match implementation for deny list. Added option to merge multiple review files. Review files from redaction step should now include text.
bde6e5b

seanpedrickcase commited on

Ensured the text ocr outputs have no line breaks at end. Multi-line custom text searches now possible. Files for review sent from redact button. Fixed image redaction (not review yet). Can get user pool details from headers. Gradio update.
cb349ad

seanpedrickcase commited on

Greatly improved regex for direct matching with custom entities
6ac4be4

seanpedrickcase commited on

Uploaded pdfs with review files will now include all pages that don't have redactions. Slightly improved deny list matching.
613b1b4

seanpedrickcase commited on

Refactor redaction functionality and enhance UI components: Added support for custom recognizers and whole page redaction options. Updated file handling to include new dropdowns for entity selection and improved dataframes for entity management. Enhanced the annotator with better state management and UI responsiveness. Cleaned up redundant code and improved overall performance in the redaction process.
1d772de

seanpedrickcase commited on

Updated packages. Reinstituted multithreading with page load, now with order protected. Smaller spacy model used for speed. Textract calls should now be faster
f0c28d7

seanpedrickcase commited on

Started adding in support for custom deny list. Fixed textract call issue. Removed multithreading for now as it mixes up pages
e3365ed

seanpedrickcase commited on

Comprehend now uses custom spacy recognisers on top of defaults. Added zoom functionality to annotator. Fixed some pdf mediabox issues and redacted image output issues.
ec98119

seanpedrickcase commited on

Allowed for time limits on redact to avoid timeouts. Improved review interface. Now accepts only one file at a time. Upgraded Gradio version
eea5c07

seanpedrickcase commited on

Redaction tool can now export pdfs with selectable text retained - redacted text is deleted and covered with a black box. Licence change for pymupdf use.
339a165

seanpedrickcase commited on

Generally improved OCR recognition of texts, corrected postcode regex
a748df6

seanpedrickcase commited on

Optimised Textract and Tesseract workings
8652429

seanpedrickcase commited on

Improved allow list, handwriting/signature identification, logging
6ea0852

seanpedrickcase commited on

Version 0.1. Adapted code for pyinstaller local executable conversion (Windows)
2a4b347

seanpedrickcase commited on