Commits · seanpedrickcase/document

Added regex functionality to deny lists. Corrected tesseract to word level parsing. Improved review search regex capabilities. Updated documentation

4852fb5

seanpedrickcase commited on 16 days ago

Allow for tesseract to run OCR in line-level mode and then query LLM with line-level data. Added option for running as MCP server, added api for multi-word text search

419fb7d

seanpedrickcase commited on 16 days ago

Fixed minor bugs related to Textract API calls, pyproject format. Removed print statements and fixed some future concat deprecation issues

7bb945f

seanpedrickcase commited on 17 days ago

formatter and linter applied

ca530a1

seanpedrickcase commited on 17 days ago

Added suffix to textract output files according to tasks included (e.g. signature analysis). Improved reporting when Textract client doesn't exist. Fixed display for cost and time taken. Changes to config variables to allow exclusion of PaddleOCR from display

25e2089

seanpedrickcase commited on 17 days ago

Improved paddle and hybrid OCR analysis across all options. Tried to revise requirements for spaces

2c00d05

seanpedrickcase commited on 18 days ago

Allowed for load Paddle at startup. Updated requirements for torch compatability

bf83b6f

seanpedrickcase commited on 18 days ago

formatter and linter applied

bcb5ad4

seanpedrickcase commited on 19 days ago

Updated word segmenter code

4440bed

seanpedrickcase commited on 19 days ago

Added text rotation capability

1ff0b3d

seanpedrickcase commited on 20 days ago

Optimised VLM model choice and prompting/parameters

ad60619

seanpedrickcase commited on 20 days ago

Added hybrid paddle + vlm option. Optimised word segmenters for single words. Optimised package installation in pyproject.toml

6d4f6e4

seanpedrickcase commited on 20 days ago

Added upgraded line to word parsing algorithm. Added dependencies and framework for Huggingface spaces deployment with ZeroGPU

c2becd8

seanpedrickcase commited on 20 days ago

Improved new requirements. Improved visual OCR outputs and word-level Paddle outputs and general bounding box positioning

e4493fe

seanpedrickcase commited on 25 days ago

Initial commit for VLM support. Created visualisations for OCR output. Corrected log_file_output_paths reference.

5e01004

seanpedrickcase commited on 26 days ago

Corrected environment variable file references for log files and spacy/paddle folders for lambda_entrypoint

e347a56

seanpedrickcase commited on about 1 month ago

Formatter and linter check

b3d51df

seanpedrickcase commited on Oct 23

Modified Dockerfile and entrypoint to switch user at runtime. Updated output folder file creation for custom_image_anlyser_engine and find_duplicate_pages.py

3dd6d75

seanpedrickcase commited on Oct 23

Updated formatting and linting check

ed45a4a

seanpedrickcase commited on Oct 23

Updated file save sections in custom_image_analyser_engine.py with secure file writes

87f1356

seanpedrickcase commited on Oct 23

Updated formatting and linting

1b1fd61

seanpedrickcase commited on Oct 23

Improved PaddleOCR implementation (greater accuracy, now can save outputs with config setting). Updated Dockerfile entrypoint for Lambda to hopefully avoid permissions issues

d882db9

seanpedrickcase commited on Oct 23

Updated paddleocr implementation to have a menu option on the GUI with a config value change. Minor package updates, favicon update, and update to Dockerfile to allow for Lambda function execution.

0d7ad2a

seanpedrickcase commited on Oct 22

Added the possibility of saving initial redacted pdfs with redaction comments directly attached. Fix for missing Textract pages. Better Textract forms element extraction and save.

f333cf5

seanpedrickcase commited on Sep 26

OCR outputs now return confidence values

a159312

seanpedrickcase commited on Sep 25

General code changes and reformatting to address code vulnerabilities highlighted by codeQL scan, and black/ruff repplied to code. Fixes/optimisation of Github Actions

f957846

seanpedrickcase commited on Sep 23

Fixed on deprecated Github workflow functions. Applied linter and formatter to code throughout. Added tests for GUI load.

bafcf39

seanpedrickcase commited on Sep 23

Added example data files. Greatly revised CLI redaction for redaction, deduplication, and AWS Textract batch calls. Various minor fixes and package updates.

d60759d

seanpedrickcase commited on Sep 21

Corrected some multiple xlsx/docx file redaction issues. package updates.

6f96988

seanpedrickcase commited on Aug 22

Corrected an issue with finding valid language entities for AWS comprehend redaction

f188b10

seanpedrickcase commited on Aug 21

Updated command line redaction script with more options

3bff849

seanpedrickcase commited on Aug 21

Improved language support functions and reporting

601fcda

seanpedrickcase commited on Aug 21

Added support for other languages. Improved DynamoDB download

9ae09da

seanpedrickcase commited on Aug 21

OCR results now return line numbers consistently. Made redaction search more resilient to punctuation at end of terms.

003292d

seanpedrickcase commited on Aug 19

Added PaddleOCR support

2878a94

seanpedrickcase commited on Aug 19

Can now redact terms using a new redact search tab on the Review Redactions tab. Various minor improvements

ee6b7fb

seanpedrickcase commited on Aug 14

Now local OCR outputs can be saved to file and reloaded to save preparation time. Bug fixing in logs and tabular data redaction. Update to documentation

f93e49c

seanpedrickcase commited on Apr 28

Improved logging format a little. Now possible to save logs to DynamoDB

0042e78

seanpedrickcase commited on Apr 27

More config options. Fixed some bugs with removing elements from review page and Adobe export. Some UI rearrangements

6319afc

seanpedrickcase commited on Mar 24

Integrated AWS Comprehend and fuzzy matching functions with tabular data redaction.

ff290e1

seanpedrickcase commited on Mar 5

Allowed for Textract and Comprehend API calls through AWS keys. File preparation function incorporated into main redaction function to avoid needing user to 'check in' during redaction process

391712c

seanpedrickcase commited on Feb 25

Fuzzy match implementation for deny list. Added option to merge multiple review files. Review files from redaction step should now include text.

bde6e5b

seanpedrickcase commited on Jan 27

Ensured the text ocr outputs have no line breaks at end. Multi-line custom text searches now possible. Files for review sent from redact button. Fixed image redaction (not review yet). Can get user pool details from headers. Gradio update.

cb349ad

seanpedrickcase commited on Jan 21

App should now resize images that are too large before sending to Textract. Textract now more robust to failure. Improved reliability of json conversion to review dataframe

143e2cc

seanpedrickcase commited on Jan 15

Greatly improved regex for direct matching with custom entities

6ac4be4

seanpedrickcase commited on Jan 14

Started adding in support for custom deny list. Fixed textract call issue. Removed multithreading for now as it mixes up pages

e3365ed

seanpedrickcase commited on Dec 17, 2024

Only shows AWS options when AWS functions enabled. Can now upload previous review files to continue review later. Some review debugging.

e2aae24

seanpedrickcase commited on Nov 18, 2024

Comprehend now uses custom spacy recognisers on top of defaults. Added zoom functionality to annotator. Fixed some pdf mediabox issues and redacted image output issues.

ec98119

seanpedrickcase commited on Nov 8, 2024

Consolidated AWS Comprehend redaction calls to reduce total number

542c252

seanpedrickcase commited on Nov 6, 2024

When on AWS, now loads in a default allow_list to exclude common words from redaction. Improved checks on AWS Comprehend calls.

390bef2

seanpedrickcase commited on Nov 6, 2024

Commit History

Added regex functionality to deny lists. Corrected tesseract to word level parsing. Improved review search regex capabilities. Updated documentation 4852fb5

Allow for tesseract to run OCR in line-level mode and then query LLM with line-level data. Added option for running as MCP server, added api for multi-word text search 419fb7d

Fixed minor bugs related to Textract API calls, pyproject format. Removed print statements and fixed some future concat deprecation issues 7bb945f

formatter and linter applied ca530a1

Added suffix to textract output files according to tasks included (e.g. signature analysis). Improved reporting when Textract client doesn't exist. Fixed display for cost and time taken. Changes to config variables to allow exclusion of PaddleOCR from display 25e2089

Improved paddle and hybrid OCR analysis across all options. Tried to revise requirements for spaces 2c00d05

Allowed for load Paddle at startup. Updated requirements for torch compatability bf83b6f

formatter and linter applied bcb5ad4

Updated word segmenter code 4440bed

Added text rotation capability 1ff0b3d

Optimised VLM model choice and prompting/parameters ad60619

Added hybrid paddle + vlm option. Optimised word segmenters for single words. Optimised package installation in pyproject.toml 6d4f6e4

Added upgraded line to word parsing algorithm. Added dependencies and framework for Huggingface spaces deployment with ZeroGPU c2becd8

Improved new requirements. Improved visual OCR outputs and word-level Paddle outputs and general bounding box positioning e4493fe

Initial commit for VLM support. Created visualisations for OCR output. Corrected log_file_output_paths reference. 5e01004

Corrected environment variable file references for log files and spacy/paddle folders for lambda_entrypoint e347a56

Formatter and linter check b3d51df

Modified Dockerfile and entrypoint to switch user at runtime. Updated output folder file creation for custom_image_anlyser_engine and find_duplicate_pages.py 3dd6d75

Updated formatting and linting check ed45a4a

Updated file save sections in custom_image_analyser_engine.py with secure file writes 87f1356

Updated formatting and linting 1b1fd61

Improved PaddleOCR implementation (greater accuracy, now can save outputs with config setting). Updated Dockerfile entrypoint for Lambda to hopefully avoid permissions issues d882db9

Updated paddleocr implementation to have a menu option on the GUI with a config value change. Minor package updates, favicon update, and update to Dockerfile to allow for Lambda function execution. 0d7ad2a

Added the possibility of saving initial redacted pdfs with redaction comments directly attached. Fix for missing Textract pages. Better Textract forms element extraction and save. f333cf5

OCR outputs now return confidence values a159312

General code changes and reformatting to address code vulnerabilities highlighted by codeQL scan, and black/ruff repplied to code. Fixes/optimisation of Github Actions f957846

Fixed on deprecated Github workflow functions. Applied linter and formatter to code throughout. Added tests for GUI load. bafcf39

Added example data files. Greatly revised CLI redaction for redaction, deduplication, and AWS Textract batch calls. Various minor fixes and package updates. d60759d

Corrected some multiple xlsx/docx file redaction issues. package updates. 6f96988

Corrected an issue with finding valid language entities for AWS comprehend redaction f188b10

Updated command line redaction script with more options 3bff849

Improved language support functions and reporting 601fcda

Added support for other languages. Improved DynamoDB download 9ae09da

OCR results now return line numbers consistently. Made redaction search more resilient to punctuation at end of terms. 003292d

Added PaddleOCR support 2878a94

Can now redact terms using a new redact search tab on the Review Redactions tab. Various minor improvements ee6b7fb

Now local OCR outputs can be saved to file and reloaded to save preparation time. Bug fixing in logs and tabular data redaction. Update to documentation f93e49c

Improved logging format a little. Now possible to save logs to DynamoDB 0042e78

More config options. Fixed some bugs with removing elements from review page and Adobe export. Some UI rearrangements 6319afc

Integrated AWS Comprehend and fuzzy matching functions with tabular data redaction. ff290e1

Allowed for Textract and Comprehend API calls through AWS keys. File preparation function incorporated into main redaction function to avoid needing user to 'check in' during redaction process 391712c

Fuzzy match implementation for deny list. Added option to merge multiple review files. Review files from redaction step should now include text. bde6e5b

Ensured the text ocr outputs have no line breaks at end. Multi-line custom text searches now possible. Files for review sent from redact button. Fixed image redaction (not review yet). Can get user pool details from headers. Gradio update. cb349ad

App should now resize images that are too large before sending to Textract. Textract now more robust to failure. Improved reliability of json conversion to review dataframe 143e2cc

Greatly improved regex for direct matching with custom entities 6ac4be4

Started adding in support for custom deny list. Fixed textract call issue. Removed multithreading for now as it mixes up pages e3365ed

Only shows AWS options when AWS functions enabled. Can now upload previous review files to continue review later. Some review debugging. e2aae24

Comprehend now uses custom spacy recognisers on top of defaults. Added zoom functionality to annotator. Fixed some pdf mediabox issues and redacted image output issues. ec98119

Consolidated AWS Comprehend redaction calls to reduce total number 542c252

When on AWS, now loads in a default allow_list to exclude common words from redaction. Improved checks on AWS Comprehend calls. 390bef2

Added regex functionality to deny lists. Corrected tesseract to word level parsing. Improved review search regex capabilities. Updated documentation

4852fb5

Allow for tesseract to run OCR in line-level mode and then query LLM with line-level data. Added option for running as MCP server, added api for multi-word text search

419fb7d

Fixed minor bugs related to Textract API calls, pyproject format. Removed print statements and fixed some future concat deprecation issues

7bb945f

formatter and linter applied

ca530a1

Added suffix to textract output files according to tasks included (e.g. signature analysis). Improved reporting when Textract client doesn't exist. Fixed display for cost and time taken. Changes to config variables to allow exclusion of PaddleOCR from display

25e2089

Improved paddle and hybrid OCR analysis across all options. Tried to revise requirements for spaces

2c00d05

Allowed for load Paddle at startup. Updated requirements for torch compatability

bf83b6f

formatter and linter applied

bcb5ad4

Updated word segmenter code

4440bed

Added text rotation capability

1ff0b3d

Optimised VLM model choice and prompting/parameters

ad60619

Added hybrid paddle + vlm option. Optimised word segmenters for single words. Optimised package installation in pyproject.toml

6d4f6e4

Added upgraded line to word parsing algorithm. Added dependencies and framework for Huggingface spaces deployment with ZeroGPU

c2becd8

Improved new requirements. Improved visual OCR outputs and word-level Paddle outputs and general bounding box positioning

e4493fe

Initial commit for VLM support. Created visualisations for OCR output. Corrected log_file_output_paths reference.

5e01004

Corrected environment variable file references for log files and spacy/paddle folders for lambda_entrypoint

e347a56

Formatter and linter check

b3d51df

Modified Dockerfile and entrypoint to switch user at runtime. Updated output folder file creation for custom_image_anlyser_engine and find_duplicate_pages.py

3dd6d75

Updated formatting and linting check

ed45a4a

Updated file save sections in custom_image_analyser_engine.py with secure file writes

87f1356

Updated formatting and linting

1b1fd61

Improved PaddleOCR implementation (greater accuracy, now can save outputs with config setting). Updated Dockerfile entrypoint for Lambda to hopefully avoid permissions issues

d882db9

Updated paddleocr implementation to have a menu option on the GUI with a config value change. Minor package updates, favicon update, and update to Dockerfile to allow for Lambda function execution.

0d7ad2a

Added the possibility of saving initial redacted pdfs with redaction comments directly attached. Fix for missing Textract pages. Better Textract forms element extraction and save.

f333cf5

OCR outputs now return confidence values

a159312

General code changes and reformatting to address code vulnerabilities highlighted by codeQL scan, and black/ruff repplied to code. Fixes/optimisation of Github Actions

f957846

Fixed on deprecated Github workflow functions. Applied linter and formatter to code throughout. Added tests for GUI load.

bafcf39

Added example data files. Greatly revised CLI redaction for redaction, deduplication, and AWS Textract batch calls. Various minor fixes and package updates.

d60759d

Corrected some multiple xlsx/docx file redaction issues. package updates.

6f96988

Corrected an issue with finding valid language entities for AWS comprehend redaction

f188b10

Updated command line redaction script with more options

3bff849

Improved language support functions and reporting

601fcda

Added support for other languages. Improved DynamoDB download

9ae09da

OCR results now return line numbers consistently. Made redaction search more resilient to punctuation at end of terms.

003292d

Added PaddleOCR support

2878a94

Can now redact terms using a new redact search tab on the Review Redactions tab. Various minor improvements

ee6b7fb

Now local OCR outputs can be saved to file and reloaded to save preparation time. Bug fixing in logs and tabular data redaction. Update to documentation

f93e49c

Improved logging format a little. Now possible to save logs to DynamoDB

0042e78

More config options. Fixed some bugs with removing elements from review page and Adobe export. Some UI rearrangements

6319afc

Integrated AWS Comprehend and fuzzy matching functions with tabular data redaction.

ff290e1

Allowed for Textract and Comprehend API calls through AWS keys. File preparation function incorporated into main redaction function to avoid needing user to 'check in' during redaction process

391712c

Fuzzy match implementation for deny list. Added option to merge multiple review files. Review files from redaction step should now include text.

bde6e5b

Ensured the text ocr outputs have no line breaks at end. Multi-line custom text searches now possible. Files for review sent from redact button. Fixed image redaction (not review yet). Can get user pool details from headers. Gradio update.

cb349ad

App should now resize images that are too large before sending to Textract. Textract now more robust to failure. Improved reliability of json conversion to review dataframe

143e2cc

Greatly improved regex for direct matching with custom entities

6ac4be4

Started adding in support for custom deny list. Fixed textract call issue. Removed multithreading for now as it mixes up pages

e3365ed

Only shows AWS options when AWS functions enabled. Can now upload previous review files to continue review later. Some review debugging.

e2aae24

Comprehend now uses custom spacy recognisers on top of defaults. Added zoom functionality to annotator. Fixed some pdf mediabox issues and redacted image output issues.

ec98119

Consolidated AWS Comprehend redaction calls to reduce total number

542c252

When on AWS, now loads in a default allow_list to exclude common words from redaction. Improved checks on AWS Comprehend calls.

390bef2