document_redaction / tools /file_redaction.py

Commit History

Greatly improved regex for direct matching with custom entities
6ac4be4

seanpedrickcase commited on

Uploaded pdfs with review files will now include all pages that don't have redactions. Slightly improved deny list matching.
613b1b4

seanpedrickcase commited on

Fixed bug where pages suggested for whole redaction are one lower than requested
e8681e8

seanpedrickcase commited on

Fix bug to identify all handwriting labels. Now only concatenates entity_type boxes if they have different labels.
0d3554e

seanpedrickcase commited on

Side review bar is mostly there. A couple of bugs fixed. Can now return identified text in initial review files. Still working on retaining found text throughout review process
a03496e

seanpedrickcase commited on

Hopefully finally fixed the duplicate image_annotation_object issue
59ff822

seanpedrickcase commited on

Refactor redaction functionality and enhance UI components: Added support for custom recognizers and whole page redaction options. Updated file handling to include new dropdowns for entity selection and improved dataframes for entity management. Enhanced the annotator with better state management and UI responsiveness. Cleaned up redundant code and improved overall performance in the redaction process.
1d772de

seanpedrickcase commited on

Enhance file handling and UI features: improved Gradio app layout with fill width option, and integrated new settings for deny, and fully redacted lists (placeholders so far). Updated file conversion functions to handle CSV inputs and added CSV review file generation for redactions. Now retains all original and merged redaction boxes.
a770956

seanpedrickcase commited on

Can now toggle colour change for boxes. Large boxes now remove text correctly
928b1e9

seanpedrickcase commited on

Fixed issue where redactions were sometimes not removing text underneath boxes. You can now redact in different colours from review page
23f8ca3

seanpedrickcase commited on

Updated packages. Reinstituted multithreading with page load, now with order protected. Smaller spacy model used for speed. Textract calls should now be faster
f0c28d7

seanpedrickcase commited on

Started adding in support for custom deny list. Fixed textract call issue. Removed multithreading for now as it mixes up pages
e3365ed

seanpedrickcase commited on

Multithreaded file preparation. Can call Textract without signature detection
9504619

seanpedrickcase commited on

Allowed for overwriting of default output folder in choose_and_run_redactor function.
68a91f4

seanpedrickcase commited on

Added option for running redact function through CLI (i.e. not going through Gradio UI or API). Test functions for running this through AWS Lambda.
e5dfae7

seanpedrickcase commited on

Only shows AWS options when AWS functions enabled. Can now upload previous review files to continue review later. Some review debugging.
e2aae24

seanpedrickcase commited on

Submitting modified redactions will no longer overwrite default labels
e69ae00

seanpedrickcase commited on

Comprehend now uses custom spacy recognisers on top of defaults. Added zoom functionality to annotator. Fixed some pdf mediabox issues and redacted image output issues.
ec98119

seanpedrickcase commited on

AWS Comprehend query numbers in logs should now add up correctly
c71d0c1

seanpedrickcase commited on

Returned file redaction timeout (before resending request) to 105 seconds default
f5b6c1b

seanpedrickcase commited on

logs should only be updated once per file run now
2e71433

seanpedrickcase commited on

Improved time taken reporting and readme
04d80a1

seanpedrickcase commited on

Consolidated AWS Comprehend redaction calls to reduce total number
542c252

seanpedrickcase commited on

When on AWS, now loads in a default allow_list to exclude common words from redaction. Improved checks on AWS Comprehend calls.
390bef2

seanpedrickcase commited on

Changed default options for AWS.
056204b

seanpedrickcase commited on

Added support for AWS Comprehend for PII identification. OCR and detection results now written to main output
f0f9378

seanpedrickcase commited on

Allowed for time limits on redact to avoid timeouts. Improved review interface. Now accepts only one file at a time. Upgraded Gradio version
eea5c07

seanpedrickcase commited on

Upgraded packages. Fixed some issues with review process. Better progress reporting for user.
5b4b5fb

seanpedrickcase commited on

Allowed for PIL to load truncated images to avoid some load errors
a680619

seanpedrickcase commited on

Added 'Review redactions' tab to the app. You can now visually inspect suggested redactions and modify/add with a point and click interface.
ebf9010

seanpedrickcase commited on

Adjusted outputs correctly for situations where the pdf mediabox size is different from the visible page size
15026f7

seanpedrickcase commited on

Redaction tool can now export pdfs with selectable text retained - redacted text is deleted and covered with a black box. Licence change for pymupdf use.
339a165

seanpedrickcase commited on

General improvement in quick image matching and merging
84c83c0

seanpedrickcase commited on

Generally improved OCR recognition of texts, corrected postcode regex
a748df6

seanpedrickcase commited on

Optimised Textract and Tesseract workings
8652429

seanpedrickcase commited on

Improved allow list, handwriting/signature identification, logging
6ea0852

seanpedrickcase commited on

Added AWS Textract support. Allowed for OCR logs export.
e9c4101

seanpedrickcase commited on

Updated time sum function to sum correctly
e1c402a

seanpedrickcase commited on

Updated default AWS_FUNCTION value. Logs seconds values from outputs correctly.
7aa4d5f

seanpedrickcase commited on

Should now correctly extract and sum up total processing time
f8700a5

seanpedrickcase commited on

Enhanced logging of usage. Small buffer added to redaction rectangles as it seems to miss the tops of text often.
34addbf

seanpedrickcase commited on

Works correctly with images again
230fcc3

seanpedrickcase commited on

Can now select only specific pages in document to redact. Image based redaction should work correctly now.
bc4bdbd

seanpedrickcase commited on

Handles multiple runs with multiple files correctly now. Logging and feedback improvements.
bbf818d

seanpedrickcase commited on

Updated decision making output files, log locations
93ac94f

seanpedrickcase commited on

Decision process now saved as log files. Other log files and feedback added
8c33828

seanpedrickcase commited on

Added logging, anonymising all Excel sheets, simple redaction tags, some Dockerfile optimisation
01c88c0

seanpedrickcase commited on

Can now redaction text or csv/xlsx files. Can redact multiple files. Embeds redactions as image-based file by default
7810536

seanpedrickcase commited on