Commit
·
391712c
1
Parent(s):
35a1591
Allowed for Textract and Comprehend API calls through AWS keys. File preparation function incorporated into main redaction function to avoid needing user to 'check in' during redaction process
Browse files- .dockerignore +1 -1
- .gitignore +1 -0
- Dockerfile +1 -0
- README.md +32 -2
- app.py +15 -18
- how_to_create_exe_dist.txt +4 -2
- tools/aws_functions.py +10 -1
- tools/custom_image_analyser_engine.py +1 -5
- tools/file_conversion.py +0 -4
- tools/file_redaction.py +42 -25
.dockerignore
CHANGED
@@ -16,5 +16,5 @@ build/*
|
|
16 |
dist/*
|
17 |
build_deps/*
|
18 |
logs/*
|
19 |
-
|
20 |
user_guide/*
|
|
|
16 |
dist/*
|
17 |
build_deps/*
|
18 |
logs/*
|
19 |
+
config/*
|
20 |
user_guide/*
|
.gitignore
CHANGED
@@ -16,5 +16,6 @@ build/*
|
|
16 |
dist/*
|
17 |
build_deps/*
|
18 |
logs/*
|
|
|
19 |
doc_redaction_amplify_app/*
|
20 |
user_guide/*
|
|
|
16 |
dist/*
|
17 |
build_deps/*
|
18 |
logs/*
|
19 |
+
config/*
|
20 |
doc_redaction_amplify_app/*
|
21 |
user_guide/*
|
Dockerfile
CHANGED
@@ -56,6 +56,7 @@ RUN mkdir -p /home/user/app/output \
|
|
56 |
&& mkdir -p /home/user/app/input \
|
57 |
&& mkdir -p /home/user/app/tld \
|
58 |
&& mkdir -p /home/user/app/logs \
|
|
|
59 |
&& chown -R user:user /home/user/app
|
60 |
|
61 |
# Copy installed packages from builder stage
|
|
|
56 |
&& mkdir -p /home/user/app/input \
|
57 |
&& mkdir -p /home/user/app/tld \
|
58 |
&& mkdir -p /home/user/app/logs \
|
59 |
+
&& mkdir -p /home/user/app/config \
|
60 |
&& chown -R user:user /home/user/app
|
61 |
|
62 |
# Copy installed packages from builder stage
|
README.md
CHANGED
@@ -34,7 +34,16 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
|
|
34 |
- [Handwriting and signature redaction](#handwriting-and-signature-redaction)
|
35 |
- [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
|
36 |
|
37 |
-
See the [advanced user guide here](#advanced-user-guide)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
|
39 |
## Example data files
|
40 |
|
@@ -292,4 +301,25 @@ The app also allows you to import .xfdf files from Adobe Acrobat. To do this, go
|
|
292 |
|
293 |
When you click the 'convert .xfdf comment file to review_file.csv' button, the app should take you up to the top of the screen where the new review file has been created and can be downloaded.
|
294 |
|
295 |
-

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
- [Handwriting and signature redaction](#handwriting-and-signature-redaction)
|
35 |
- [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
|
36 |
|
37 |
+
See the [advanced user guide here](#advanced-user-guide):
|
38 |
+
- [Modifying and merging redaction review files](#modifying-and-merging-redaction-review-files)
|
39 |
+
- [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
|
40 |
+
- [Merging existing redaction review files](#merging-existing-redaction-review-files)
|
41 |
+
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
42 |
+
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
43 |
+
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
44 |
+
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
|
45 |
+
- [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
|
46 |
+
- [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
|
47 |
|
48 |
## Example data files
|
49 |
|
|
|
301 |
|
302 |
When you click the 'convert .xfdf comment file to review_file.csv' button, the app should take you up to the top of the screen where the new review file has been created and can be downloaded.
|
303 |
|
304 |
+

|
305 |
+
|
306 |
+
## Using AWS Textract and Comprehend when not running in an AWS environment
|
307 |
+
|
308 |
+
AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
|
309 |
+
|
310 |
+
However, it is possible to access these services directly via API from outside an AWS environment by creating IAM users and access keys with relevant permissions to access AWS Textract and Comprehend services. Please check with your IT and data security teams that this approach is acceptable for your data before trying the following approaches.
|
311 |
+
|
312 |
+
To do the following, in your AWS environment you will need to create a new user with permissions for "textract:AnalyzeDocument", "textract:DetectDocumentText", and "comprehend:DetectPiiEntities". Under security credentials, create new access keys - note down the access key and secret key.
|
313 |
+
|
314 |
+
### Direct access by passing AWS access keys through app
|
315 |
+
The Redaction Settings tab now has boxes for entering the AWS access key and secret key. If you paste the relevant keys into these boxes before performing redaction, you should be able to use these services in the app.
|
316 |
+
|
317 |
+
### Picking up AWS access keys through an .env file
|
318 |
+
The app also has the capability of picking up AWS access key details through a .env file located in a '/config/aws_config.env' file (default), or alternative .env file location specified by the environment variable AWS_CONFIG_PATH. The env file should look like the following with just two lines:
|
319 |
+
|
320 |
+
AWS_ACCESS_KEY=<your-access-key>
|
321 |
+
AWS_SECRET_KEY=<your-secret-key>
|
322 |
+
|
323 |
+
The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
|
324 |
+
|
325 |
+
Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
|
app.py
CHANGED
@@ -178,12 +178,12 @@ with app:
|
|
178 |
with gr.Tab("Redact PDFs/images"):
|
179 |
with gr.Accordion("Redact document", open = True):
|
180 |
in_doc_files = gr.File(label="Choose a document or image file (PDF, JPG, PNG)", file_count= "single", file_types=['.pdf', '.jpg', '.png', '.json'], height=file_input_height)
|
181 |
-
if RUN_AWS_FUNCTIONS == "1":
|
182 |
-
|
183 |
-
|
184 |
-
else:
|
185 |
-
|
186 |
-
|
187 |
|
188 |
gr.Markdown("""If you only want to redact certain pages, or certain entities (e.g. just email addresses, or a custom list of terms), please go to the redaction settings tab.""")
|
189 |
document_redact_btn = gr.Button("Redact document", variant="primary")
|
@@ -343,8 +343,8 @@ with app:
|
|
343 |
in_redact_language = gr.Dropdown(value = "en", choices = ["en"], label="Redaction language (only English currently supported)", multiselect=False, visible=False)
|
344 |
|
345 |
with gr.Row():
|
346 |
-
aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=
|
347 |
-
aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=
|
348 |
|
349 |
with gr.Accordion("Settings for open text or xlsx/csv files", open = False):
|
350 |
anon_strat = gr.Radio(choices=["replace with <REDACTED>", "replace with <ENTITY_NAME>", "redact", "hash", "mask", "encrypt", "fake_first_name"], label="Select an anonymisation method.", value = "replace with <REDACTED>")
|
@@ -356,8 +356,6 @@ with app:
|
|
356 |
merge_multiple_review_files_btn = gr.Button("Merge multiple review files into one", variant="primary")
|
357 |
|
358 |
|
359 |
-
|
360 |
-
|
361 |
### UI INTERACTION ###
|
362 |
|
363 |
###
|
@@ -366,14 +364,13 @@ with app:
|
|
366 |
in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list])
|
367 |
|
368 |
document_redact_btn.click(fn = reset_state_vars, outputs=[pdf_doc_state, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
|
369 |
-
|
370 |
-
|
371 |
-
outputs=[output_summary, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, estimated_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, output_review_files], api_name="redact_doc").\
|
372 |
then(fn=update_annotator, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
|
373 |
|
374 |
# If the app has completed a batch of pages, it will run this until the end of all pages in the document
|
375 |
-
current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, in_redaction_method, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, output_summary, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, estimated_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox],
|
376 |
-
outputs=[output_summary, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, estimated_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, output_review_files]).\
|
377 |
then(fn=update_annotator, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
|
378 |
|
379 |
# If a file has been completed, the function will continue onto the next document
|
@@ -387,7 +384,7 @@ with app:
|
|
387 |
# Upload previous files for modifying redactions
|
388 |
upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
|
389 |
then(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
|
390 |
-
then(fn = prepare_image_or_pdf, inputs=[output_review_files, in_redaction_method,
|
391 |
then(update_annotator, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
|
392 |
|
393 |
# Page controls at top
|
@@ -446,12 +443,12 @@ with app:
|
|
446 |
|
447 |
# Convert review file to xfdf Adobe format
|
448 |
convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
|
449 |
-
then(fn = prepare_image_or_pdf, inputs=[output_review_files, in_redaction_method,
|
450 |
then(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state], outputs=[adobe_review_files_out])
|
451 |
|
452 |
# Convert xfdf Adobe file back to review_file.csv
|
453 |
convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
|
454 |
-
then(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, in_redaction_method,
|
455 |
then(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state], outputs=[output_review_files], scroll_to_output=True)
|
456 |
|
457 |
###
|
|
|
178 |
with gr.Tab("Redact PDFs/images"):
|
179 |
with gr.Accordion("Redact document", open = True):
|
180 |
in_doc_files = gr.File(label="Choose a document or image file (PDF, JPG, PNG)", file_count= "single", file_types=['.pdf', '.jpg', '.png', '.json'], height=file_input_height)
|
181 |
+
# if RUN_AWS_FUNCTIONS == "1":
|
182 |
+
in_redaction_method = gr.Radio(label="Choose text extraction method. AWS Textract has a cost per page - $3.50 per 1,000 pages with signature detection (default), $1.50 without. Go to Redaction settings - AWS Textract options to remove signature detection.", value = default_ocr_val, choices=[text_ocr_option, tesseract_ocr_option, textract_option])
|
183 |
+
pii_identification_method_drop = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
|
184 |
+
# else:
|
185 |
+
# in_redaction_method = gr.Radio(label="Choose text extraction method.", value = default_ocr_val, choices=[text_ocr_option, tesseract_ocr_option])
|
186 |
+
# pii_identification_method_drop = gr.Radio(label = "Choose PII detection method.", value = default_pii_detector, choices=[local_pii_detector], visible=False)
|
187 |
|
188 |
gr.Markdown("""If you only want to redact certain pages, or certain entities (e.g. just email addresses, or a custom list of terms), please go to the redaction settings tab.""")
|
189 |
document_redact_btn = gr.Button("Redact document", variant="primary")
|
|
|
343 |
in_redact_language = gr.Dropdown(value = "en", choices = ["en"], label="Redaction language (only English currently supported)", multiselect=False, visible=False)
|
344 |
|
345 |
with gr.Row():
|
346 |
+
aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
|
347 |
+
aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
|
348 |
|
349 |
with gr.Accordion("Settings for open text or xlsx/csv files", open = False):
|
350 |
anon_strat = gr.Radio(choices=["replace with <REDACTED>", "replace with <ENTITY_NAME>", "redact", "hash", "mask", "encrypt", "fake_first_name"], label="Select an anonymisation method.", value = "replace with <REDACTED>")
|
|
|
356 |
merge_multiple_review_files_btn = gr.Button("Merge multiple review files into one", variant="primary")
|
357 |
|
358 |
|
|
|
|
|
359 |
### UI INTERACTION ###
|
360 |
|
361 |
###
|
|
|
364 |
in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list])
|
365 |
|
366 |
document_redact_btn.click(fn = reset_state_vars, outputs=[pdf_doc_state, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
|
367 |
+
then(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, in_redaction_method, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, output_summary, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, estimated_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state],
|
368 |
+
outputs=[output_summary, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, estimated_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state], api_name="redact_doc").\
|
|
|
369 |
then(fn=update_annotator, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
|
370 |
|
371 |
# If the app has completed a batch of pages, it will run this until the end of all pages in the document
|
372 |
+
current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, in_redaction_method, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, output_summary, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, estimated_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state],
|
373 |
+
outputs=[output_summary, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, estimated_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state]).\
|
374 |
then(fn=update_annotator, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
|
375 |
|
376 |
# If a file has been completed, the function will continue onto the next document
|
|
|
384 |
# Upload previous files for modifying redactions
|
385 |
upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
|
386 |
then(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
|
387 |
+
then(fn = prepare_image_or_pdf, inputs=[output_review_files, in_redaction_method, latest_file_completed_text, output_summary, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state], api_name="prepare_doc").\
|
388 |
then(update_annotator, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
|
389 |
|
390 |
# Page controls at top
|
|
|
443 |
|
444 |
# Convert review file to xfdf Adobe format
|
445 |
convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
|
446 |
+
then(fn = prepare_image_or_pdf, inputs=[output_review_files, in_redaction_method, latest_file_completed_text, output_summary, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state]).\
|
447 |
then(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state], outputs=[adobe_review_files_out])
|
448 |
|
449 |
# Convert xfdf Adobe file back to review_file.csv
|
450 |
convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
|
451 |
+
then(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, in_redaction_method, latest_file_completed_text, output_summary, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state]).\
|
452 |
then(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state], outputs=[output_review_files], scroll_to_output=True)
|
453 |
|
454 |
###
|
how_to_create_exe_dist.txt
CHANGED
@@ -1,3 +1,5 @@
|
|
|
|
|
|
1 |
1. Create minimal environment to run the app in conda. E.g. 'conda create --name new_env'
|
2 |
|
3 |
2. Activate the environment 'conda activate new_env'
|
@@ -14,7 +16,7 @@ NOTE: for ensuring that spaCy models are loaded into the program correctly in re
|
|
14 |
|
15 |
9.Run the following (This helped me: https://github.com/pyinstaller/pyinstaller/issues/8108):
|
16 |
|
17 |
-
a) In command line: pyi-makespec --additional-hooks-dir="build_deps" --add-data "tesseract/:tesseract/" --add-data "poppler/poppler-24.02.0/:poppler/poppler-24.02.0/" --collect-data=gradio_client --collect-data=gradio --hidden-import=gradio_image_annotation --collect-data=gradio_image_annotation --collect-all=gradio_image_annotation --hidden-import pyarrow.vendored.version --hidden-import pydicom.encoders --hidden-import=safehttpx --collect-all=safehttpx --hidden-import=presidio_analyzer --collect-all=presidio_analyzer --hidden-import=presidio_anonymizer --collect-all=presidio_anonymizer --hidden-import=presidio_image_redactor --collect-all=presidio_image_redactor --name DocRedactApp_0.
|
18 |
|
19 |
# Add --onefile to the above if you would like everything packaged as a single exe, although this will need to be extracted upon starting the app, slowing down initialisation time significantly.
|
20 |
|
@@ -30,7 +32,7 @@ a = Analysis(
|
|
30 |
|
31 |
hook-presidio-image-redactor.py
|
32 |
|
33 |
-
c) Back in command line, run this: pyinstaller --clean --noconfirm DocRedactApp_0.
|
34 |
|
35 |
|
36 |
9. A 'dist' folder will be created with the executable inside along with all dependencies('dist\redaction').
|
|
|
1 |
+
Here are instructions for creating an .exe runnable version of the redaction app. Tested until Gradio version 5.17.0
|
2 |
+
|
3 |
1. Create minimal environment to run the app in conda. E.g. 'conda create --name new_env'
|
4 |
|
5 |
2. Activate the environment 'conda activate new_env'
|
|
|
16 |
|
17 |
9.Run the following (This helped me: https://github.com/pyinstaller/pyinstaller/issues/8108):
|
18 |
|
19 |
+
a) In command line: pyi-makespec --additional-hooks-dir="build_deps" --add-data "tesseract/:tesseract/" --add-data "poppler/poppler-24.02.0/:poppler/poppler-24.02.0/" --collect-data=gradio_client --collect-data=gradio --hidden-import=gradio_image_annotation --collect-data=gradio_image_annotation --collect-all=gradio_image_annotation --hidden-import pyarrow.vendored.version --hidden-import pydicom.encoders --hidden-import=safehttpx --collect-all=safehttpx --hidden-import=presidio_analyzer --collect-all=presidio_analyzer --hidden-import=presidio_anonymizer --collect-all=presidio_anonymizer --hidden-import=presidio_image_redactor --collect-all=presidio_image_redactor --name DocRedactApp_0.3.0 app.py
|
20 |
|
21 |
# Add --onefile to the above if you would like everything packaged as a single exe, although this will need to be extracted upon starting the app, slowing down initialisation time significantly.
|
22 |
|
|
|
32 |
|
33 |
hook-presidio-image-redactor.py
|
34 |
|
35 |
+
c) Back in command line, run this: pyinstaller --clean --noconfirm DocRedactApp_0.3.0.spec
|
36 |
|
37 |
|
38 |
9. A 'dist' folder will be created with the executable inside along with all dependencies('dist\redaction').
|
tools/aws_functions.py
CHANGED
@@ -4,18 +4,27 @@ import boto3
|
|
4 |
import tempfile
|
5 |
import os
|
6 |
from tools.helper_functions import get_or_create_env_var
|
|
|
7 |
|
8 |
PandasDataFrame = Type[pd.DataFrame]
|
9 |
|
10 |
# Get AWS credentials
|
11 |
bucket_name=""
|
12 |
|
13 |
-
RUN_AWS_FUNCTIONS = get_or_create_env_var("RUN_AWS_FUNCTIONS", "
|
14 |
print(f'The value of RUN_AWS_FUNCTIONS is {RUN_AWS_FUNCTIONS}')
|
15 |
|
16 |
AWS_REGION = get_or_create_env_var('AWS_REGION', 'eu-west-2')
|
17 |
print(f'The value of AWS_REGION is {AWS_REGION}')
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
AWS_ACCESS_KEY = get_or_create_env_var('AWS_ACCESS_KEY', '')
|
20 |
if AWS_ACCESS_KEY:
|
21 |
print(f'AWS_ACCESS_KEY found in environment variables')
|
|
|
4 |
import tempfile
|
5 |
import os
|
6 |
from tools.helper_functions import get_or_create_env_var
|
7 |
+
from dotenv import load_dotenv
|
8 |
|
9 |
PandasDataFrame = Type[pd.DataFrame]
|
10 |
|
11 |
# Get AWS credentials
|
12 |
bucket_name=""
|
13 |
|
14 |
+
RUN_AWS_FUNCTIONS = get_or_create_env_var("RUN_AWS_FUNCTIONS", "0")
|
15 |
print(f'The value of RUN_AWS_FUNCTIONS is {RUN_AWS_FUNCTIONS}')
|
16 |
|
17 |
AWS_REGION = get_or_create_env_var('AWS_REGION', 'eu-west-2')
|
18 |
print(f'The value of AWS_REGION is {AWS_REGION}')
|
19 |
|
20 |
+
# If you have an aws_config env file in the config folder, you can load in AWS keys this way
|
21 |
+
AWS_CONFIG_PATH = get_or_create_env_var('AWS_CONFIG_PATH', '/env/aws_config.env')
|
22 |
+
print(f'The value of AWS_CONFIG_PATH is {AWS_CONFIG_PATH}')
|
23 |
+
|
24 |
+
if os.path.exists(AWS_CONFIG_PATH):
|
25 |
+
print("Loading AWS keys from config folder")
|
26 |
+
load_dotenv(AWS_CONFIG_PATH)
|
27 |
+
|
28 |
AWS_ACCESS_KEY = get_or_create_env_var('AWS_ACCESS_KEY', '')
|
29 |
if AWS_ACCESS_KEY:
|
30 |
print(f'AWS_ACCESS_KEY found in environment variables')
|
tools/custom_image_analyser_engine.py
CHANGED
@@ -515,6 +515,7 @@ def do_aws_comprehend_call(current_batch, current_batch_mapping, comprehend_clie
|
|
515 |
|
516 |
except Exception as e:
|
517 |
if attempt == max_retries - 1:
|
|
|
518 |
raise
|
519 |
time.sleep(retry_delay)
|
520 |
|
@@ -571,7 +572,6 @@ def run_page_text_redaction(
|
|
571 |
allow_list=allow_list
|
572 |
)
|
573 |
|
574 |
-
#print("page_analyser_result:", page_analyser_result)
|
575 |
|
576 |
all_text_line_results = map_back_entity_results(
|
577 |
page_analyser_result,
|
@@ -579,10 +579,8 @@ def run_page_text_redaction(
|
|
579 |
all_text_line_results
|
580 |
)
|
581 |
|
582 |
-
#print("all_text_line_results:", all_text_line_results)
|
583 |
|
584 |
elif pii_identification_method == "AWS Comprehend":
|
585 |
-
#print("page text:", page_text)
|
586 |
|
587 |
# Process custom entities if any
|
588 |
if custom_entities:
|
@@ -600,8 +598,6 @@ def run_page_text_redaction(
|
|
600 |
allow_list=allow_list
|
601 |
)
|
602 |
|
603 |
-
print("page_analyser_result:", page_analyser_result)
|
604 |
-
|
605 |
all_text_line_results = map_back_entity_results(
|
606 |
page_analyser_result,
|
607 |
page_text_mapping,
|
|
|
515 |
|
516 |
except Exception as e:
|
517 |
if attempt == max_retries - 1:
|
518 |
+
print("AWS Comprehend calls failed due to", e)
|
519 |
raise
|
520 |
time.sleep(retry_delay)
|
521 |
|
|
|
572 |
allow_list=allow_list
|
573 |
)
|
574 |
|
|
|
575 |
|
576 |
all_text_line_results = map_back_entity_results(
|
577 |
page_analyser_result,
|
|
|
579 |
all_text_line_results
|
580 |
)
|
581 |
|
|
|
582 |
|
583 |
elif pii_identification_method == "AWS Comprehend":
|
|
|
584 |
|
585 |
# Process custom entities if any
|
586 |
if custom_entities:
|
|
|
598 |
allow_list=allow_list
|
599 |
)
|
600 |
|
|
|
|
|
601 |
all_text_line_results = map_back_entity_results(
|
602 |
page_analyser_result,
|
603 |
page_text_mapping,
|
tools/file_conversion.py
CHANGED
@@ -464,12 +464,10 @@ def redact_whole_pymupdf_page(rect_height, rect_width, image, page, custom_colou
|
|
464 |
def prepare_image_or_pdf(
|
465 |
file_paths: List[str],
|
466 |
in_redact_method: str,
|
467 |
-
in_allow_list: Optional[List[List[str]]] = None,
|
468 |
latest_file_completed: int = 0,
|
469 |
out_message: List[str] = [],
|
470 |
first_loop_state: bool = False,
|
471 |
number_of_pages:int = 1,
|
472 |
-
current_loop_page_number:int=0,
|
473 |
all_annotations_object:List = [],
|
474 |
prepare_for_review:bool = False,
|
475 |
in_fully_redacted_list:List[int]=[],
|
@@ -484,12 +482,10 @@ def prepare_image_or_pdf(
|
|
484 |
Args:
|
485 |
file_paths (List[str]): List of file paths to process.
|
486 |
in_redact_method (str): The redaction method to use.
|
487 |
-
in_allow_list (optional, Optional[List[List[str]]]): List of allowed terms for redaction.
|
488 |
latest_file_completed (optional, int): Index of the last completed file.
|
489 |
out_message (optional, List[str]): List to store output messages.
|
490 |
first_loop_state (optional, bool): Flag indicating if this is the first iteration.
|
491 |
number_of_pages (optional, int): integer indicating the number of pages in the document
|
492 |
-
current_loop_page_number (optional, int): Current number of loop
|
493 |
all_annotations_object(optional, List of annotation objects): All annotations for current document
|
494 |
prepare_for_review(optional, bool): Is this preparation step preparing pdfs and json files to review current redactions?
|
495 |
in_fully_redacted_list(optional, List of int): A list of pages to fully redact
|
|
|
464 |
def prepare_image_or_pdf(
|
465 |
file_paths: List[str],
|
466 |
in_redact_method: str,
|
|
|
467 |
latest_file_completed: int = 0,
|
468 |
out_message: List[str] = [],
|
469 |
first_loop_state: bool = False,
|
470 |
number_of_pages:int = 1,
|
|
|
471 |
all_annotations_object:List = [],
|
472 |
prepare_for_review:bool = False,
|
473 |
in_fully_redacted_list:List[int]=[],
|
|
|
482 |
Args:
|
483 |
file_paths (List[str]): List of file paths to process.
|
484 |
in_redact_method (str): The redaction method to use.
|
|
|
485 |
latest_file_completed (optional, int): Index of the last completed file.
|
486 |
out_message (optional, List[str]): List to store output messages.
|
487 |
first_loop_state (optional, bool): Flag indicating if this is the first iteration.
|
488 |
number_of_pages (optional, int): integer indicating the number of pages in the document
|
|
|
489 |
all_annotations_object(optional, List of annotation objects): All annotations for current document
|
490 |
prepare_for_review(optional, bool): Is this preparation step preparing pdfs and json files to review current redactions?
|
491 |
in_fully_redacted_list(optional, List of int): A list of pages to fully redact
|
tools/file_redaction.py
CHANGED
@@ -29,7 +29,7 @@ from tools.custom_image_analyser_engine import CustomImageAnalyzerEngine, OCRRes
|
|
29 |
from tools.file_conversion import process_file, image_dpi, convert_review_json_to_pandas_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords
|
30 |
from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
|
31 |
from tools.helper_functions import get_file_name_without_type, output_folder, clean_unicode_text, get_or_create_env_var, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector
|
32 |
-
from tools.file_conversion import process_file, is_pdf, is_pdf_or_image
|
33 |
from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult
|
34 |
from tools.presidio_analyzer_custom import recognizer_result_from_dict
|
35 |
|
@@ -99,6 +99,8 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
99 |
match_fuzzy_whole_phrase_bool:bool=True,
|
100 |
aws_access_key_textbox:str='',
|
101 |
aws_secret_key_textbox:str='',
|
|
|
|
|
102 |
output_folder:str=output_folder,
|
103 |
progress=gr.Progress(track_tqdm=True)):
|
104 |
'''
|
@@ -136,6 +138,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
136 |
- match_fuzzy_whole_phrase_bool (bool, optional): A boolean where 'True' means that the whole phrase is fuzzy matched, and 'False' means that each word is fuzzy matched separately (excluding stop words).
|
137 |
- aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
|
138 |
- aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
|
|
|
139 |
- output_folder (str, optional): Output folder for results.
|
140 |
- progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
|
141 |
|
@@ -145,6 +148,13 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
145 |
tic = time.perf_counter()
|
146 |
all_request_metadata = all_request_metadata_str.split('\n') if all_request_metadata_str else []
|
147 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
148 |
#print("prepared_pdf_file_paths:", prepared_pdf_file_paths[0])
|
149 |
review_out_file_paths = [prepared_pdf_file_paths[0]]
|
150 |
|
@@ -212,7 +222,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
212 |
estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
|
213 |
print("Estimated total processing time:", str(estimate_total_processing_time))
|
214 |
|
215 |
-
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
|
216 |
|
217 |
# If we have reached the last page, return message
|
218 |
if current_loop_page >= number_of_pages:
|
@@ -228,7 +238,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
228 |
|
229 |
review_out_file_paths.extend(out_review_file_path)
|
230 |
|
231 |
-
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
|
232 |
|
233 |
# Create allow list
|
234 |
# If string, assume file path
|
@@ -241,45 +251,52 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
241 |
else:
|
242 |
in_allow_list_flat = []
|
243 |
|
244 |
-
|
245 |
-
# Try to connect to AWS services only if RUN_AWS_FUNCTIONS environmental variable is 1
|
246 |
if pii_identification_method == "AWS Comprehend":
|
247 |
print("Trying to connect to AWS Comprehend service")
|
248 |
-
if
|
249 |
-
|
250 |
-
|
|
|
251 |
comprehend_client = boto3.client('comprehend',
|
252 |
aws_access_key_id=aws_access_key_textbox,
|
253 |
aws_secret_access_key=aws_secret_key_textbox)
|
|
|
|
|
|
|
254 |
elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
|
|
|
255 |
comprehend_client = boto3.client('comprehend',
|
256 |
aws_access_key_id=AWS_ACCESS_KEY,
|
257 |
-
aws_secret_access_key=AWS_SECRET_KEY)
|
258 |
else:
|
259 |
comprehend_client = ""
|
260 |
-
out_message = "Cannot connect to AWS Comprehend service. Please choose another PII identification method."
|
261 |
print(out_message)
|
262 |
-
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
|
263 |
else:
|
264 |
comprehend_client = ""
|
265 |
|
266 |
if in_redact_method == textract_option:
|
267 |
-
print("Trying to connect to AWS Textract service")
|
268 |
-
if
|
269 |
-
|
270 |
-
|
271 |
-
comprehend_client = boto3.client('textract',
|
272 |
aws_access_key_id=aws_access_key_textbox,
|
273 |
aws_secret_access_key=aws_secret_key_textbox)
|
|
|
|
|
|
|
274 |
elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
|
275 |
-
|
|
|
276 |
aws_access_key_id=AWS_ACCESS_KEY,
|
277 |
-
aws_secret_access_key=AWS_SECRET_KEY)
|
278 |
else:
|
279 |
textract_client = ""
|
280 |
-
out_message = "Cannot connect to AWS Textract. Please choose another text extraction method."
|
281 |
print(out_message)
|
282 |
-
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
|
283 |
else:
|
284 |
textract_client = ""
|
285 |
|
@@ -320,14 +337,14 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
320 |
out_message = "No file selected"
|
321 |
print(out_message)
|
322 |
|
323 |
-
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
|
324 |
|
325 |
if in_redact_method == tesseract_ocr_option or in_redact_method == textract_option:
|
326 |
|
327 |
#Analyse and redact image-based pdf or image
|
328 |
if is_pdf_or_image(file_path) == False:
|
329 |
out_message = "Please upload a PDF file or image file (JPG, PNG) for image analysis."
|
330 |
-
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
|
331 |
|
332 |
print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
|
333 |
|
@@ -370,7 +387,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
370 |
|
371 |
if is_pdf(file_path) == False:
|
372 |
out_message = "Please upload a PDF file for text analysis. If you have an image, select 'Image analysis'."
|
373 |
-
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
|
374 |
|
375 |
# Analyse text-based pdf
|
376 |
print('Redacting file as text-based PDF')
|
@@ -400,7 +417,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
400 |
else:
|
401 |
out_message = "No redaction method selected"
|
402 |
print(out_message)
|
403 |
-
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
|
404 |
|
405 |
# If at last page, save to file
|
406 |
if current_loop_page >= number_of_pages:
|
@@ -494,7 +511,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
494 |
out_file_paths = list(set(out_file_paths))
|
495 |
review_out_file_paths = [prepared_pdf_file_paths[0], out_review_file_path]
|
496 |
|
497 |
-
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
|
498 |
|
499 |
def convert_pikepdf_coords_to_pymupdf(pymupdf_page, pikepdf_bbox, type="pikepdf_annot"):
|
500 |
'''
|
|
|
29 |
from tools.file_conversion import process_file, image_dpi, convert_review_json_to_pandas_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords
|
30 |
from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
|
31 |
from tools.helper_functions import get_file_name_without_type, output_folder, clean_unicode_text, get_or_create_env_var, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector
|
32 |
+
from tools.file_conversion import process_file, is_pdf, is_pdf_or_image, prepare_image_or_pdf
|
33 |
from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult
|
34 |
from tools.presidio_analyzer_custom import recognizer_result_from_dict
|
35 |
|
|
|
99 |
match_fuzzy_whole_phrase_bool:bool=True,
|
100 |
aws_access_key_textbox:str='',
|
101 |
aws_secret_key_textbox:str='',
|
102 |
+
annotate_max_pages:int=1,
|
103 |
+
review_file_state=[],
|
104 |
output_folder:str=output_folder,
|
105 |
progress=gr.Progress(track_tqdm=True)):
|
106 |
'''
|
|
|
138 |
- match_fuzzy_whole_phrase_bool (bool, optional): A boolean where 'True' means that the whole phrase is fuzzy matched, and 'False' means that each word is fuzzy matched separately (excluding stop words).
|
139 |
- aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
|
140 |
- aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
|
141 |
+
- annotate_max_pages (int, optional): Maximum page value for the annotation object
|
142 |
- output_folder (str, optional): Output folder for results.
|
143 |
- progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
|
144 |
|
|
|
148 |
tic = time.perf_counter()
|
149 |
all_request_metadata = all_request_metadata_str.split('\n') if all_request_metadata_str else []
|
150 |
|
151 |
+
# If there are no prepared PDF file paths, it is most likely that the prepare_image_or_pdf function has not been run. So do it here to get the outputs you need
|
152 |
+
if not pymupdf_doc:
|
153 |
+
print("Prepared PDF file not found, running prepare_image_or_pdf function")
|
154 |
+
out_message, prepared_pdf_file_paths, prepared_pdf_image_paths, annotate_max_pages, annotate_max_pages, pymupdf_doc, annotations_all_pages, review_file_state = prepare_image_or_pdf(file_paths, in_redact_method, latest_file_completed, out_message, first_loop_state, annotate_max_pages, annotations_all_pages)
|
155 |
+
|
156 |
+
annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
157 |
+
|
158 |
#print("prepared_pdf_file_paths:", prepared_pdf_file_paths[0])
|
159 |
review_out_file_paths = [prepared_pdf_file_paths[0]]
|
160 |
|
|
|
222 |
estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
|
223 |
print("Estimated total processing time:", str(estimate_total_processing_time))
|
224 |
|
225 |
+
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
226 |
|
227 |
# If we have reached the last page, return message
|
228 |
if current_loop_page >= number_of_pages:
|
|
|
238 |
|
239 |
review_out_file_paths.extend(out_review_file_path)
|
240 |
|
241 |
+
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
242 |
|
243 |
# Create allow list
|
244 |
# If string, assume file path
|
|
|
251 |
else:
|
252 |
in_allow_list_flat = []
|
253 |
|
254 |
+
# Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
|
|
|
255 |
if pii_identification_method == "AWS Comprehend":
|
256 |
print("Trying to connect to AWS Comprehend service")
|
257 |
+
if aws_access_key_textbox and aws_secret_key_textbox:
|
258 |
+
print("Connecting to Comprehend using AWS access key and secret keys from textboxes.")
|
259 |
+
print("aws_access_key_textbox:", aws_access_key_textbox)
|
260 |
+
print("aws_secret_access_key:", aws_secret_key_textbox)
|
261 |
comprehend_client = boto3.client('comprehend',
|
262 |
aws_access_key_id=aws_access_key_textbox,
|
263 |
aws_secret_access_key=aws_secret_key_textbox)
|
264 |
+
elif RUN_AWS_FUNCTIONS == "1":
|
265 |
+
print("Connecting to Comprehend via existing SSO connection")
|
266 |
+
comprehend_client = boto3.client('comprehend')
|
267 |
elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
|
268 |
+
print("Getting Comprehend credentials from environment variables")
|
269 |
comprehend_client = boto3.client('comprehend',
|
270 |
aws_access_key_id=AWS_ACCESS_KEY,
|
271 |
+
aws_secret_access_key=AWS_SECRET_KEY)
|
272 |
else:
|
273 |
comprehend_client = ""
|
274 |
+
out_message = "Cannot connect to AWS Comprehend service. Please provide access keys under Textract settings on the Redaction settings tab, or choose another PII identification method."
|
275 |
print(out_message)
|
276 |
+
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
277 |
else:
|
278 |
comprehend_client = ""
|
279 |
|
280 |
if in_redact_method == textract_option:
|
281 |
+
print("Trying to connect to AWS Textract service")
|
282 |
+
if aws_access_key_textbox and aws_secret_key_textbox:
|
283 |
+
print("Connecting to Textract using AWS access key and secret keys from textboxes.")
|
284 |
+
textract_client = boto3.client('textract',
|
|
|
285 |
aws_access_key_id=aws_access_key_textbox,
|
286 |
aws_secret_access_key=aws_secret_key_textbox)
|
287 |
+
elif RUN_AWS_FUNCTIONS == "1":
|
288 |
+
print("Connecting to Textract via existing SSO connection")
|
289 |
+
textract_client = boto3.client('textract')
|
290 |
elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
|
291 |
+
print("Getting Textract credentials from environment variables.")
|
292 |
+
textract_client = boto3.client('textract',
|
293 |
aws_access_key_id=AWS_ACCESS_KEY,
|
294 |
+
aws_secret_access_key=AWS_SECRET_KEY)
|
295 |
else:
|
296 |
textract_client = ""
|
297 |
+
out_message = "Cannot connect to AWS Textract. Please provide access keys under Textract settings on the Redaction settings tab,choose another text extraction method."
|
298 |
print(out_message)
|
299 |
+
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
300 |
else:
|
301 |
textract_client = ""
|
302 |
|
|
|
337 |
out_message = "No file selected"
|
338 |
print(out_message)
|
339 |
|
340 |
+
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
341 |
|
342 |
if in_redact_method == tesseract_ocr_option or in_redact_method == textract_option:
|
343 |
|
344 |
#Analyse and redact image-based pdf or image
|
345 |
if is_pdf_or_image(file_path) == False:
|
346 |
out_message = "Please upload a PDF file or image file (JPG, PNG) for image analysis."
|
347 |
+
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
348 |
|
349 |
print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
|
350 |
|
|
|
387 |
|
388 |
if is_pdf(file_path) == False:
|
389 |
out_message = "Please upload a PDF file for text analysis. If you have an image, select 'Image analysis'."
|
390 |
+
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
391 |
|
392 |
# Analyse text-based pdf
|
393 |
print('Redacting file as text-based PDF')
|
|
|
417 |
else:
|
418 |
out_message = "No redaction method selected"
|
419 |
print(out_message)
|
420 |
+
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
421 |
|
422 |
# If at last page, save to file
|
423 |
if current_loop_page >= number_of_pages:
|
|
|
511 |
out_file_paths = list(set(out_file_paths))
|
512 |
review_out_file_paths = [prepared_pdf_file_paths[0], out_review_file_path]
|
513 |
|
514 |
+
return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
|
515 |
|
516 |
def convert_pikepdf_coords_to_pymupdf(pymupdf_page, pikepdf_bbox, type="pikepdf_annot"):
|
517 |
'''
|