document_redaction / tools /custom_image_analyser_engine.py

Commit History

Added regex functionality to deny lists. Corrected tesseract to word level parsing. Improved review search regex capabilities. Updated documentation
4852fb5

seanpedrickcase commited on

Allow for tesseract to run OCR in line-level mode and then query LLM with line-level data. Added option for running as MCP server, added api for multi-word text search
419fb7d

seanpedrickcase commited on

Fixed minor bugs related to Textract API calls, pyproject format. Removed print statements and fixed some future concat deprecation issues
7bb945f

seanpedrickcase commited on

Added suffix to textract output files according to tasks included (e.g. signature analysis). Improved reporting when Textract client doesn't exist. Fixed display for cost and time taken. Changes to config variables to allow exclusion of PaddleOCR from display
25e2089

seanpedrickcase commited on

Improved paddle and hybrid OCR analysis across all options. Tried to revise requirements for spaces
2c00d05

seanpedrickcase commited on

Allowed for load Paddle at startup. Updated requirements for torch compatability
bf83b6f

seanpedrickcase commited on

Optimised VLM model choice and prompting/parameters
ad60619

seanpedrickcase commited on

Added hybrid paddle + vlm option. Optimised word segmenters for single words. Optimised package installation in pyproject.toml
6d4f6e4

seanpedrickcase commited on

Added upgraded line to word parsing algorithm. Added dependencies and framework for Huggingface spaces deployment with ZeroGPU
c2becd8

seanpedrickcase commited on

Improved new requirements. Improved visual OCR outputs and word-level Paddle outputs and general bounding box positioning
e4493fe

seanpedrickcase commited on

Initial commit for VLM support. Created visualisations for OCR output. Corrected log_file_output_paths reference.
5e01004

seanpedrickcase commited on

Corrected environment variable file references for log files and spacy/paddle folders for lambda_entrypoint
e347a56

seanpedrickcase commited on

Modified Dockerfile and entrypoint to switch user at runtime. Updated output folder file creation for custom_image_anlyser_engine and find_duplicate_pages.py
3dd6d75

seanpedrickcase commited on

Updated formatting and linting check
ed45a4a

seanpedrickcase commited on

Updated file save sections in custom_image_analyser_engine.py with secure file writes
87f1356

seanpedrickcase commited on

Improved PaddleOCR implementation (greater accuracy, now can save outputs with config setting). Updated Dockerfile entrypoint for Lambda to hopefully avoid permissions issues
d882db9

seanpedrickcase commited on

Updated paddleocr implementation to have a menu option on the GUI with a config value change. Minor package updates, favicon update, and update to Dockerfile to allow for Lambda function execution.
0d7ad2a

seanpedrickcase commited on

Added the possibility of saving initial redacted pdfs with redaction comments directly attached. Fix for missing Textract pages. Better Textract forms element extraction and save.
f333cf5

seanpedrickcase commited on

OCR outputs now return confidence values
a159312

seanpedrickcase commited on

General code changes and reformatting to address code vulnerabilities highlighted by codeQL scan, and black/ruff repplied to code. Fixes/optimisation of Github Actions
f957846

seanpedrickcase commited on

Fixed on deprecated Github workflow functions. Applied linter and formatter to code throughout. Added tests for GUI load.
bafcf39

seanpedrickcase commited on

Added example data files. Greatly revised CLI redaction for redaction, deduplication, and AWS Textract batch calls. Various minor fixes and package updates.
d60759d

seanpedrickcase commited on

Corrected some multiple xlsx/docx file redaction issues. package updates.
6f96988

seanpedrickcase commited on

Corrected an issue with finding valid language entities for AWS comprehend redaction
f188b10

seanpedrickcase commited on

Updated command line redaction script with more options
3bff849

seanpedrickcase commited on

Improved language support functions and reporting
601fcda

seanpedrickcase commited on

Added support for other languages. Improved DynamoDB download
9ae09da

seanpedrickcase commited on

OCR results now return line numbers consistently. Made redaction search more resilient to punctuation at end of terms.
003292d

seanpedrickcase commited on

Can now redact terms using a new redact search tab on the Review Redactions tab. Various minor improvements
ee6b7fb

seanpedrickcase commited on

Now local OCR outputs can be saved to file and reloaded to save preparation time. Bug fixing in logs and tabular data redaction. Update to documentation
f93e49c

seanpedrickcase commited on

Improved logging format a little. Now possible to save logs to DynamoDB
0042e78

seanpedrickcase commited on

More config options. Fixed some bugs with removing elements from review page and Adobe export. Some UI rearrangements
6319afc

seanpedrickcase commited on

Integrated AWS Comprehend and fuzzy matching functions with tabular data redaction.
ff290e1

seanpedrickcase commited on

Allowed for Textract and Comprehend API calls through AWS keys. File preparation function incorporated into main redaction function to avoid needing user to 'check in' during redaction process
391712c

seanpedrickcase commited on

Fuzzy match implementation for deny list. Added option to merge multiple review files. Review files from redaction step should now include text.
bde6e5b

seanpedrickcase commited on

Ensured the text ocr outputs have no line breaks at end. Multi-line custom text searches now possible. Files for review sent from redact button. Fixed image redaction (not review yet). Can get user pool details from headers. Gradio update.
cb349ad

seanpedrickcase commited on

App should now resize images that are too large before sending to Textract. Textract now more robust to failure. Improved reliability of json conversion to review dataframe
143e2cc

seanpedrickcase commited on

Greatly improved regex for direct matching with custom entities
6ac4be4

seanpedrickcase commited on

Started adding in support for custom deny list. Fixed textract call issue. Removed multithreading for now as it mixes up pages
e3365ed

seanpedrickcase commited on

Only shows AWS options when AWS functions enabled. Can now upload previous review files to continue review later. Some review debugging.
e2aae24

seanpedrickcase commited on

Comprehend now uses custom spacy recognisers on top of defaults. Added zoom functionality to annotator. Fixed some pdf mediabox issues and redacted image output issues.
ec98119

seanpedrickcase commited on

Consolidated AWS Comprehend redaction calls to reduce total number
542c252

seanpedrickcase commited on

When on AWS, now loads in a default allow_list to exclude common words from redaction. Improved checks on AWS Comprehend calls.
390bef2

seanpedrickcase commited on