seanpedrickcase commited on
Commit
391712c
·
1 Parent(s): 35a1591

Allowed for Textract and Comprehend API calls through AWS keys. File preparation function incorporated into main redaction function to avoid needing user to 'check in' during redaction process

Browse files
.dockerignore CHANGED
@@ -16,5 +16,5 @@ build/*
16
  dist/*
17
  build_deps/*
18
  logs/*
19
- doc_redaction_amplify_app/*
20
  user_guide/*
 
16
  dist/*
17
  build_deps/*
18
  logs/*
19
+ config/*
20
  user_guide/*
.gitignore CHANGED
@@ -16,5 +16,6 @@ build/*
16
  dist/*
17
  build_deps/*
18
  logs/*
 
19
  doc_redaction_amplify_app/*
20
  user_guide/*
 
16
  dist/*
17
  build_deps/*
18
  logs/*
19
+ config/*
20
  doc_redaction_amplify_app/*
21
  user_guide/*
Dockerfile CHANGED
@@ -56,6 +56,7 @@ RUN mkdir -p /home/user/app/output \
56
  && mkdir -p /home/user/app/input \
57
  && mkdir -p /home/user/app/tld \
58
  && mkdir -p /home/user/app/logs \
 
59
  && chown -R user:user /home/user/app
60
 
61
  # Copy installed packages from builder stage
 
56
  && mkdir -p /home/user/app/input \
57
  && mkdir -p /home/user/app/tld \
58
  && mkdir -p /home/user/app/logs \
59
+ && mkdir -p /home/user/app/config \
60
  && chown -R user:user /home/user/app
61
 
62
  # Copy installed packages from builder stage
README.md CHANGED
@@ -34,7 +34,16 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
34
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
35
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
36
 
37
- See the [advanced user guide here](#advanced-user-guide).
 
 
 
 
 
 
 
 
 
38
 
39
  ## Example data files
40
 
@@ -292,4 +301,25 @@ The app also allows you to import .xfdf files from Adobe Acrobat. To do this, go
292
 
293
  When you click the 'convert .xfdf comment file to review_file.csv' button, the app should take you up to the top of the screen where the new review file has been created and can be downloaded.
294
 
295
- ![Outputs from Adobe import](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/export_to_adobe/img/import_from_adobe_interface_outputs.PNG)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
35
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
36
 
37
+ See the [advanced user guide here](#advanced-user-guide):
38
+ - [Modifying and merging redaction review files](#modifying-and-merging-redaction-review-files)
39
+ - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
40
+ - [Merging existing redaction review files](#merging-existing-redaction-review-files)
41
+ - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
42
+ - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
43
+ - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
44
+ - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
45
+ - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
46
+ - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
47
 
48
  ## Example data files
49
 
 
301
 
302
  When you click the 'convert .xfdf comment file to review_file.csv' button, the app should take you up to the top of the screen where the new review file has been created and can be downloaded.
303
 
304
+ ![Outputs from Adobe import](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/export_to_adobe/img/import_from_adobe_interface_outputs.PNG)
305
+
306
+ ## Using AWS Textract and Comprehend when not running in an AWS environment
307
+
308
+ AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
309
+
310
+ However, it is possible to access these services directly via API from outside an AWS environment by creating IAM users and access keys with relevant permissions to access AWS Textract and Comprehend services. Please check with your IT and data security teams that this approach is acceptable for your data before trying the following approaches.
311
+
312
+ To do the following, in your AWS environment you will need to create a new user with permissions for "textract:AnalyzeDocument", "textract:DetectDocumentText", and "comprehend:DetectPiiEntities". Under security credentials, create new access keys - note down the access key and secret key.
313
+
314
+ ### Direct access by passing AWS access keys through app
315
+ The Redaction Settings tab now has boxes for entering the AWS access key and secret key. If you paste the relevant keys into these boxes before performing redaction, you should be able to use these services in the app.
316
+
317
+ ### Picking up AWS access keys through an .env file
318
+ The app also has the capability of picking up AWS access key details through a .env file located in a '/config/aws_config.env' file (default), or alternative .env file location specified by the environment variable AWS_CONFIG_PATH. The env file should look like the following with just two lines:
319
+
320
+ AWS_ACCESS_KEY=<your-access-key>
321
+ AWS_SECRET_KEY=<your-secret-key>
322
+
323
+ The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
324
+
325
+ Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
app.py CHANGED
@@ -178,12 +178,12 @@ with app:
178
  with gr.Tab("Redact PDFs/images"):
179
  with gr.Accordion("Redact document", open = True):
180
  in_doc_files = gr.File(label="Choose a document or image file (PDF, JPG, PNG)", file_count= "single", file_types=['.pdf', '.jpg', '.png', '.json'], height=file_input_height)
181
- if RUN_AWS_FUNCTIONS == "1":
182
- in_redaction_method = gr.Radio(label="Choose text extraction method. AWS Textract has a cost per page - $3.50 per 1,000 pages with signature detection (default), $1.50 without. Go to Redaction settings - AWS Textract options to remove signature detection.", value = default_ocr_val, choices=[text_ocr_option, tesseract_ocr_option, textract_option])
183
- pii_identification_method_drop = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
184
- else:
185
- in_redaction_method = gr.Radio(label="Choose text extraction method.", value = default_ocr_val, choices=[text_ocr_option, tesseract_ocr_option])
186
- pii_identification_method_drop = gr.Radio(label = "Choose PII detection method.", value = default_pii_detector, choices=[local_pii_detector], visible=False)
187
 
188
  gr.Markdown("""If you only want to redact certain pages, or certain entities (e.g. just email addresses, or a custom list of terms), please go to the redaction settings tab.""")
189
  document_redact_btn = gr.Button("Redact document", variant="primary")
@@ -343,8 +343,8 @@ with app:
343
  in_redact_language = gr.Dropdown(value = "en", choices = ["en"], label="Redaction language (only English currently supported)", multiselect=False, visible=False)
344
 
345
  with gr.Row():
346
- aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=False)
347
- aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=False)
348
 
349
  with gr.Accordion("Settings for open text or xlsx/csv files", open = False):
350
  anon_strat = gr.Radio(choices=["replace with <REDACTED>", "replace with <ENTITY_NAME>", "redact", "hash", "mask", "encrypt", "fake_first_name"], label="Select an anonymisation method.", value = "replace with <REDACTED>")
@@ -356,8 +356,6 @@ with app:
356
  merge_multiple_review_files_btn = gr.Button("Merge multiple review files into one", variant="primary")
357
 
358
 
359
-
360
-
361
  ### UI INTERACTION ###
362
 
363
  ###
@@ -366,14 +364,13 @@ with app:
366
  in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list])
367
 
368
  document_redact_btn.click(fn = reset_state_vars, outputs=[pdf_doc_state, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
369
- then(fn = prepare_image_or_pdf, inputs=[in_doc_files, in_redaction_method, in_allow_list, latest_file_completed_text, output_summary, first_loop_state, annotate_max_pages, current_loop_page_number, all_image_annotations_state], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state], api_name="prepare_doc").\
370
- then(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, in_redaction_method, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, output_summary, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, estimated_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox],
371
- outputs=[output_summary, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, estimated_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, output_review_files], api_name="redact_doc").\
372
  then(fn=update_annotator, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
373
 
374
  # If the app has completed a batch of pages, it will run this until the end of all pages in the document
375
- current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, in_redaction_method, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, output_summary, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, estimated_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox],
376
- outputs=[output_summary, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, estimated_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, output_review_files]).\
377
  then(fn=update_annotator, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
378
 
379
  # If a file has been completed, the function will continue onto the next document
@@ -387,7 +384,7 @@ with app:
387
  # Upload previous files for modifying redactions
388
  upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
389
  then(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
390
- then(fn = prepare_image_or_pdf, inputs=[output_review_files, in_redaction_method, in_allow_list, latest_file_completed_text, output_summary, second_loop_state, annotate_max_pages, current_loop_page_number, all_image_annotations_state, prepare_for_review_bool], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state]).\
391
  then(update_annotator, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
392
 
393
  # Page controls at top
@@ -446,12 +443,12 @@ with app:
446
 
447
  # Convert review file to xfdf Adobe format
448
  convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
449
- then(fn = prepare_image_or_pdf, inputs=[output_review_files, in_redaction_method, in_allow_list, latest_file_completed_text, output_summary, second_loop_state, annotate_max_pages, current_loop_page_number, all_image_annotations_state, prepare_for_review_bool], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state]).\
450
  then(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state], outputs=[adobe_review_files_out])
451
 
452
  # Convert xfdf Adobe file back to review_file.csv
453
  convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
454
- then(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, in_redaction_method, in_allow_list, latest_file_completed_text, output_summary, second_loop_state, annotate_max_pages, current_loop_page_number, all_image_annotations_state, prepare_for_review_bool], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state]).\
455
  then(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state], outputs=[output_review_files], scroll_to_output=True)
456
 
457
  ###
 
178
  with gr.Tab("Redact PDFs/images"):
179
  with gr.Accordion("Redact document", open = True):
180
  in_doc_files = gr.File(label="Choose a document or image file (PDF, JPG, PNG)", file_count= "single", file_types=['.pdf', '.jpg', '.png', '.json'], height=file_input_height)
181
+ # if RUN_AWS_FUNCTIONS == "1":
182
+ in_redaction_method = gr.Radio(label="Choose text extraction method. AWS Textract has a cost per page - $3.50 per 1,000 pages with signature detection (default), $1.50 without. Go to Redaction settings - AWS Textract options to remove signature detection.", value = default_ocr_val, choices=[text_ocr_option, tesseract_ocr_option, textract_option])
183
+ pii_identification_method_drop = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
184
+ # else:
185
+ # in_redaction_method = gr.Radio(label="Choose text extraction method.", value = default_ocr_val, choices=[text_ocr_option, tesseract_ocr_option])
186
+ # pii_identification_method_drop = gr.Radio(label = "Choose PII detection method.", value = default_pii_detector, choices=[local_pii_detector], visible=False)
187
 
188
  gr.Markdown("""If you only want to redact certain pages, or certain entities (e.g. just email addresses, or a custom list of terms), please go to the redaction settings tab.""")
189
  document_redact_btn = gr.Button("Redact document", variant="primary")
 
343
  in_redact_language = gr.Dropdown(value = "en", choices = ["en"], label="Redaction language (only English currently supported)", multiselect=False, visible=False)
344
 
345
  with gr.Row():
346
+ aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
347
+ aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
348
 
349
  with gr.Accordion("Settings for open text or xlsx/csv files", open = False):
350
  anon_strat = gr.Radio(choices=["replace with <REDACTED>", "replace with <ENTITY_NAME>", "redact", "hash", "mask", "encrypt", "fake_first_name"], label="Select an anonymisation method.", value = "replace with <REDACTED>")
 
356
  merge_multiple_review_files_btn = gr.Button("Merge multiple review files into one", variant="primary")
357
 
358
 
 
 
359
  ### UI INTERACTION ###
360
 
361
  ###
 
364
  in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list])
365
 
366
  document_redact_btn.click(fn = reset_state_vars, outputs=[pdf_doc_state, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
367
+ then(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, in_redaction_method, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, output_summary, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, estimated_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state],
368
+ outputs=[output_summary, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, estimated_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state], api_name="redact_doc").\
 
369
  then(fn=update_annotator, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
370
 
371
  # If the app has completed a batch of pages, it will run this until the end of all pages in the document
372
+ current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, in_redaction_method, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, output_summary, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, estimated_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_state, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state],
373
+ outputs=[output_summary, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, estimated_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_state, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state]).\
374
  then(fn=update_annotator, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
375
 
376
  # If a file has been completed, the function will continue onto the next document
 
384
  # Upload previous files for modifying redactions
385
  upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
386
  then(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
387
+ then(fn = prepare_image_or_pdf, inputs=[output_review_files, in_redaction_method, latest_file_completed_text, output_summary, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state], api_name="prepare_doc").\
388
  then(update_annotator, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base])
389
 
390
  # Page controls at top
 
443
 
444
  # Convert review file to xfdf Adobe format
445
  convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
446
+ then(fn = prepare_image_or_pdf, inputs=[output_review_files, in_redaction_method, latest_file_completed_text, output_summary, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state]).\
447
  then(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state], outputs=[adobe_review_files_out])
448
 
449
  # Convert xfdf Adobe file back to review_file.csv
450
  convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list]).\
451
+ then(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, in_redaction_method, latest_file_completed_text, output_summary, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool], outputs=[output_summary, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state]).\
452
  then(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state], outputs=[output_review_files], scroll_to_output=True)
453
 
454
  ###
how_to_create_exe_dist.txt CHANGED
@@ -1,3 +1,5 @@
 
 
1
  1. Create minimal environment to run the app in conda. E.g. 'conda create --name new_env'
2
 
3
  2. Activate the environment 'conda activate new_env'
@@ -14,7 +16,7 @@ NOTE: for ensuring that spaCy models are loaded into the program correctly in re
14
 
15
  9.Run the following (This helped me: https://github.com/pyinstaller/pyinstaller/issues/8108):
16
 
17
- a) In command line: pyi-makespec --additional-hooks-dir="build_deps" --add-data "tesseract/:tesseract/" --add-data "poppler/poppler-24.02.0/:poppler/poppler-24.02.0/" --collect-data=gradio_client --collect-data=gradio --hidden-import=gradio_image_annotation --collect-data=gradio_image_annotation --collect-all=gradio_image_annotation --hidden-import pyarrow.vendored.version --hidden-import pydicom.encoders --hidden-import=safehttpx --collect-all=safehttpx --hidden-import=presidio_analyzer --collect-all=presidio_analyzer --hidden-import=presidio_anonymizer --collect-all=presidio_anonymizer --hidden-import=presidio_image_redactor --collect-all=presidio_image_redactor --name DocRedactApp_0.2.0 app.py
18
 
19
  # Add --onefile to the above if you would like everything packaged as a single exe, although this will need to be extracted upon starting the app, slowing down initialisation time significantly.
20
 
@@ -30,7 +32,7 @@ a = Analysis(
30
 
31
  hook-presidio-image-redactor.py
32
 
33
- c) Back in command line, run this: pyinstaller --clean --noconfirm DocRedactApp_0.2.0.spec
34
 
35
 
36
  9. A 'dist' folder will be created with the executable inside along with all dependencies('dist\redaction').
 
1
+ Here are instructions for creating an .exe runnable version of the redaction app. Tested until Gradio version 5.17.0
2
+
3
  1. Create minimal environment to run the app in conda. E.g. 'conda create --name new_env'
4
 
5
  2. Activate the environment 'conda activate new_env'
 
16
 
17
  9.Run the following (This helped me: https://github.com/pyinstaller/pyinstaller/issues/8108):
18
 
19
+ a) In command line: pyi-makespec --additional-hooks-dir="build_deps" --add-data "tesseract/:tesseract/" --add-data "poppler/poppler-24.02.0/:poppler/poppler-24.02.0/" --collect-data=gradio_client --collect-data=gradio --hidden-import=gradio_image_annotation --collect-data=gradio_image_annotation --collect-all=gradio_image_annotation --hidden-import pyarrow.vendored.version --hidden-import pydicom.encoders --hidden-import=safehttpx --collect-all=safehttpx --hidden-import=presidio_analyzer --collect-all=presidio_analyzer --hidden-import=presidio_anonymizer --collect-all=presidio_anonymizer --hidden-import=presidio_image_redactor --collect-all=presidio_image_redactor --name DocRedactApp_0.3.0 app.py
20
 
21
  # Add --onefile to the above if you would like everything packaged as a single exe, although this will need to be extracted upon starting the app, slowing down initialisation time significantly.
22
 
 
32
 
33
  hook-presidio-image-redactor.py
34
 
35
+ c) Back in command line, run this: pyinstaller --clean --noconfirm DocRedactApp_0.3.0.spec
36
 
37
 
38
  9. A 'dist' folder will be created with the executable inside along with all dependencies('dist\redaction').
tools/aws_functions.py CHANGED
@@ -4,18 +4,27 @@ import boto3
4
  import tempfile
5
  import os
6
  from tools.helper_functions import get_or_create_env_var
 
7
 
8
  PandasDataFrame = Type[pd.DataFrame]
9
 
10
  # Get AWS credentials
11
  bucket_name=""
12
 
13
- RUN_AWS_FUNCTIONS = get_or_create_env_var("RUN_AWS_FUNCTIONS", "1")
14
  print(f'The value of RUN_AWS_FUNCTIONS is {RUN_AWS_FUNCTIONS}')
15
 
16
  AWS_REGION = get_or_create_env_var('AWS_REGION', 'eu-west-2')
17
  print(f'The value of AWS_REGION is {AWS_REGION}')
18
 
 
 
 
 
 
 
 
 
19
  AWS_ACCESS_KEY = get_or_create_env_var('AWS_ACCESS_KEY', '')
20
  if AWS_ACCESS_KEY:
21
  print(f'AWS_ACCESS_KEY found in environment variables')
 
4
  import tempfile
5
  import os
6
  from tools.helper_functions import get_or_create_env_var
7
+ from dotenv import load_dotenv
8
 
9
  PandasDataFrame = Type[pd.DataFrame]
10
 
11
  # Get AWS credentials
12
  bucket_name=""
13
 
14
+ RUN_AWS_FUNCTIONS = get_or_create_env_var("RUN_AWS_FUNCTIONS", "0")
15
  print(f'The value of RUN_AWS_FUNCTIONS is {RUN_AWS_FUNCTIONS}')
16
 
17
  AWS_REGION = get_or_create_env_var('AWS_REGION', 'eu-west-2')
18
  print(f'The value of AWS_REGION is {AWS_REGION}')
19
 
20
+ # If you have an aws_config env file in the config folder, you can load in AWS keys this way
21
+ AWS_CONFIG_PATH = get_or_create_env_var('AWS_CONFIG_PATH', '/env/aws_config.env')
22
+ print(f'The value of AWS_CONFIG_PATH is {AWS_CONFIG_PATH}')
23
+
24
+ if os.path.exists(AWS_CONFIG_PATH):
25
+ print("Loading AWS keys from config folder")
26
+ load_dotenv(AWS_CONFIG_PATH)
27
+
28
  AWS_ACCESS_KEY = get_or_create_env_var('AWS_ACCESS_KEY', '')
29
  if AWS_ACCESS_KEY:
30
  print(f'AWS_ACCESS_KEY found in environment variables')
tools/custom_image_analyser_engine.py CHANGED
@@ -515,6 +515,7 @@ def do_aws_comprehend_call(current_batch, current_batch_mapping, comprehend_clie
515
 
516
  except Exception as e:
517
  if attempt == max_retries - 1:
 
518
  raise
519
  time.sleep(retry_delay)
520
 
@@ -571,7 +572,6 @@ def run_page_text_redaction(
571
  allow_list=allow_list
572
  )
573
 
574
- #print("page_analyser_result:", page_analyser_result)
575
 
576
  all_text_line_results = map_back_entity_results(
577
  page_analyser_result,
@@ -579,10 +579,8 @@ def run_page_text_redaction(
579
  all_text_line_results
580
  )
581
 
582
- #print("all_text_line_results:", all_text_line_results)
583
 
584
  elif pii_identification_method == "AWS Comprehend":
585
- #print("page text:", page_text)
586
 
587
  # Process custom entities if any
588
  if custom_entities:
@@ -600,8 +598,6 @@ def run_page_text_redaction(
600
  allow_list=allow_list
601
  )
602
 
603
- print("page_analyser_result:", page_analyser_result)
604
-
605
  all_text_line_results = map_back_entity_results(
606
  page_analyser_result,
607
  page_text_mapping,
 
515
 
516
  except Exception as e:
517
  if attempt == max_retries - 1:
518
+ print("AWS Comprehend calls failed due to", e)
519
  raise
520
  time.sleep(retry_delay)
521
 
 
572
  allow_list=allow_list
573
  )
574
 
 
575
 
576
  all_text_line_results = map_back_entity_results(
577
  page_analyser_result,
 
579
  all_text_line_results
580
  )
581
 
 
582
 
583
  elif pii_identification_method == "AWS Comprehend":
 
584
 
585
  # Process custom entities if any
586
  if custom_entities:
 
598
  allow_list=allow_list
599
  )
600
 
 
 
601
  all_text_line_results = map_back_entity_results(
602
  page_analyser_result,
603
  page_text_mapping,
tools/file_conversion.py CHANGED
@@ -464,12 +464,10 @@ def redact_whole_pymupdf_page(rect_height, rect_width, image, page, custom_colou
464
  def prepare_image_or_pdf(
465
  file_paths: List[str],
466
  in_redact_method: str,
467
- in_allow_list: Optional[List[List[str]]] = None,
468
  latest_file_completed: int = 0,
469
  out_message: List[str] = [],
470
  first_loop_state: bool = False,
471
  number_of_pages:int = 1,
472
- current_loop_page_number:int=0,
473
  all_annotations_object:List = [],
474
  prepare_for_review:bool = False,
475
  in_fully_redacted_list:List[int]=[],
@@ -484,12 +482,10 @@ def prepare_image_or_pdf(
484
  Args:
485
  file_paths (List[str]): List of file paths to process.
486
  in_redact_method (str): The redaction method to use.
487
- in_allow_list (optional, Optional[List[List[str]]]): List of allowed terms for redaction.
488
  latest_file_completed (optional, int): Index of the last completed file.
489
  out_message (optional, List[str]): List to store output messages.
490
  first_loop_state (optional, bool): Flag indicating if this is the first iteration.
491
  number_of_pages (optional, int): integer indicating the number of pages in the document
492
- current_loop_page_number (optional, int): Current number of loop
493
  all_annotations_object(optional, List of annotation objects): All annotations for current document
494
  prepare_for_review(optional, bool): Is this preparation step preparing pdfs and json files to review current redactions?
495
  in_fully_redacted_list(optional, List of int): A list of pages to fully redact
 
464
  def prepare_image_or_pdf(
465
  file_paths: List[str],
466
  in_redact_method: str,
 
467
  latest_file_completed: int = 0,
468
  out_message: List[str] = [],
469
  first_loop_state: bool = False,
470
  number_of_pages:int = 1,
 
471
  all_annotations_object:List = [],
472
  prepare_for_review:bool = False,
473
  in_fully_redacted_list:List[int]=[],
 
482
  Args:
483
  file_paths (List[str]): List of file paths to process.
484
  in_redact_method (str): The redaction method to use.
 
485
  latest_file_completed (optional, int): Index of the last completed file.
486
  out_message (optional, List[str]): List to store output messages.
487
  first_loop_state (optional, bool): Flag indicating if this is the first iteration.
488
  number_of_pages (optional, int): integer indicating the number of pages in the document
 
489
  all_annotations_object(optional, List of annotation objects): All annotations for current document
490
  prepare_for_review(optional, bool): Is this preparation step preparing pdfs and json files to review current redactions?
491
  in_fully_redacted_list(optional, List of int): A list of pages to fully redact
tools/file_redaction.py CHANGED
@@ -29,7 +29,7 @@ from tools.custom_image_analyser_engine import CustomImageAnalyzerEngine, OCRRes
29
  from tools.file_conversion import process_file, image_dpi, convert_review_json_to_pandas_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords
30
  from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
31
  from tools.helper_functions import get_file_name_without_type, output_folder, clean_unicode_text, get_or_create_env_var, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector
32
- from tools.file_conversion import process_file, is_pdf, is_pdf_or_image
33
  from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult
34
  from tools.presidio_analyzer_custom import recognizer_result_from_dict
35
 
@@ -99,6 +99,8 @@ def choose_and_run_redactor(file_paths:List[str],
99
  match_fuzzy_whole_phrase_bool:bool=True,
100
  aws_access_key_textbox:str='',
101
  aws_secret_key_textbox:str='',
 
 
102
  output_folder:str=output_folder,
103
  progress=gr.Progress(track_tqdm=True)):
104
  '''
@@ -136,6 +138,7 @@ def choose_and_run_redactor(file_paths:List[str],
136
  - match_fuzzy_whole_phrase_bool (bool, optional): A boolean where 'True' means that the whole phrase is fuzzy matched, and 'False' means that each word is fuzzy matched separately (excluding stop words).
137
  - aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
138
  - aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
 
139
  - output_folder (str, optional): Output folder for results.
140
  - progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
141
 
@@ -145,6 +148,13 @@ def choose_and_run_redactor(file_paths:List[str],
145
  tic = time.perf_counter()
146
  all_request_metadata = all_request_metadata_str.split('\n') if all_request_metadata_str else []
147
 
 
 
 
 
 
 
 
148
  #print("prepared_pdf_file_paths:", prepared_pdf_file_paths[0])
149
  review_out_file_paths = [prepared_pdf_file_paths[0]]
150
 
@@ -212,7 +222,7 @@ def choose_and_run_redactor(file_paths:List[str],
212
  estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
213
  print("Estimated total processing time:", str(estimate_total_processing_time))
214
 
215
- return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
216
 
217
  # If we have reached the last page, return message
218
  if current_loop_page >= number_of_pages:
@@ -228,7 +238,7 @@ def choose_and_run_redactor(file_paths:List[str],
228
 
229
  review_out_file_paths.extend(out_review_file_path)
230
 
231
- return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
232
 
233
  # Create allow list
234
  # If string, assume file path
@@ -241,45 +251,52 @@ def choose_and_run_redactor(file_paths:List[str],
241
  else:
242
  in_allow_list_flat = []
243
 
244
-
245
- # Try to connect to AWS services only if RUN_AWS_FUNCTIONS environmental variable is 1
246
  if pii_identification_method == "AWS Comprehend":
247
  print("Trying to connect to AWS Comprehend service")
248
- if RUN_AWS_FUNCTIONS == "1":
249
- comprehend_client = boto3.client('comprehend')
250
- elif aws_access_key_textbox and aws_secret_key_textbox:
 
251
  comprehend_client = boto3.client('comprehend',
252
  aws_access_key_id=aws_access_key_textbox,
253
  aws_secret_access_key=aws_secret_key_textbox)
 
 
 
254
  elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
 
255
  comprehend_client = boto3.client('comprehend',
256
  aws_access_key_id=AWS_ACCESS_KEY,
257
- aws_secret_access_key=AWS_SECRET_KEY)
258
  else:
259
  comprehend_client = ""
260
- out_message = "Cannot connect to AWS Comprehend service. Please choose another PII identification method."
261
  print(out_message)
262
- return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
263
  else:
264
  comprehend_client = ""
265
 
266
  if in_redact_method == textract_option:
267
- print("Trying to connect to AWS Textract service")
268
- if RUN_AWS_FUNCTIONS == "1":
269
- textract_client = boto3.client('textract')
270
- elif aws_access_key_textbox and aws_secret_key_textbox:
271
- comprehend_client = boto3.client('textract',
272
  aws_access_key_id=aws_access_key_textbox,
273
  aws_secret_access_key=aws_secret_key_textbox)
 
 
 
274
  elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
275
- comprehend_client = boto3.client('textract',
 
276
  aws_access_key_id=AWS_ACCESS_KEY,
277
- aws_secret_access_key=AWS_SECRET_KEY)
278
  else:
279
  textract_client = ""
280
- out_message = "Cannot connect to AWS Textract. Please choose another text extraction method."
281
  print(out_message)
282
- return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
283
  else:
284
  textract_client = ""
285
 
@@ -320,14 +337,14 @@ def choose_and_run_redactor(file_paths:List[str],
320
  out_message = "No file selected"
321
  print(out_message)
322
 
323
- return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
324
 
325
  if in_redact_method == tesseract_ocr_option or in_redact_method == textract_option:
326
 
327
  #Analyse and redact image-based pdf or image
328
  if is_pdf_or_image(file_path) == False:
329
  out_message = "Please upload a PDF file or image file (JPG, PNG) for image analysis."
330
- return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
331
 
332
  print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
333
 
@@ -370,7 +387,7 @@ def choose_and_run_redactor(file_paths:List[str],
370
 
371
  if is_pdf(file_path) == False:
372
  out_message = "Please upload a PDF file for text analysis. If you have an image, select 'Image analysis'."
373
- return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
374
 
375
  # Analyse text-based pdf
376
  print('Redacting file as text-based PDF')
@@ -400,7 +417,7 @@ def choose_and_run_redactor(file_paths:List[str],
400
  else:
401
  out_message = "No redaction method selected"
402
  print(out_message)
403
- return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
404
 
405
  # If at last page, save to file
406
  if current_loop_page >= number_of_pages:
@@ -494,7 +511,7 @@ def choose_and_run_redactor(file_paths:List[str],
494
  out_file_paths = list(set(out_file_paths))
495
  review_out_file_paths = [prepared_pdf_file_paths[0], out_review_file_path]
496
 
497
- return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths
498
 
499
  def convert_pikepdf_coords_to_pymupdf(pymupdf_page, pikepdf_bbox, type="pikepdf_annot"):
500
  '''
 
29
  from tools.file_conversion import process_file, image_dpi, convert_review_json_to_pandas_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords
30
  from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
31
  from tools.helper_functions import get_file_name_without_type, output_folder, clean_unicode_text, get_or_create_env_var, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector
32
+ from tools.file_conversion import process_file, is_pdf, is_pdf_or_image, prepare_image_or_pdf
33
  from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult
34
  from tools.presidio_analyzer_custom import recognizer_result_from_dict
35
 
 
99
  match_fuzzy_whole_phrase_bool:bool=True,
100
  aws_access_key_textbox:str='',
101
  aws_secret_key_textbox:str='',
102
+ annotate_max_pages:int=1,
103
+ review_file_state=[],
104
  output_folder:str=output_folder,
105
  progress=gr.Progress(track_tqdm=True)):
106
  '''
 
138
  - match_fuzzy_whole_phrase_bool (bool, optional): A boolean where 'True' means that the whole phrase is fuzzy matched, and 'False' means that each word is fuzzy matched separately (excluding stop words).
139
  - aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
140
  - aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
141
+ - annotate_max_pages (int, optional): Maximum page value for the annotation object
142
  - output_folder (str, optional): Output folder for results.
143
  - progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
144
 
 
148
  tic = time.perf_counter()
149
  all_request_metadata = all_request_metadata_str.split('\n') if all_request_metadata_str else []
150
 
151
+ # If there are no prepared PDF file paths, it is most likely that the prepare_image_or_pdf function has not been run. So do it here to get the outputs you need
152
+ if not pymupdf_doc:
153
+ print("Prepared PDF file not found, running prepare_image_or_pdf function")
154
+ out_message, prepared_pdf_file_paths, prepared_pdf_image_paths, annotate_max_pages, annotate_max_pages, pymupdf_doc, annotations_all_pages, review_file_state = prepare_image_or_pdf(file_paths, in_redact_method, latest_file_completed, out_message, first_loop_state, annotate_max_pages, annotations_all_pages)
155
+
156
+ annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
157
+
158
  #print("prepared_pdf_file_paths:", prepared_pdf_file_paths[0])
159
  review_out_file_paths = [prepared_pdf_file_paths[0]]
160
 
 
222
  estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
223
  print("Estimated total processing time:", str(estimate_total_processing_time))
224
 
225
+ return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
226
 
227
  # If we have reached the last page, return message
228
  if current_loop_page >= number_of_pages:
 
238
 
239
  review_out_file_paths.extend(out_review_file_path)
240
 
241
+ return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
242
 
243
  # Create allow list
244
  # If string, assume file path
 
251
  else:
252
  in_allow_list_flat = []
253
 
254
+ # Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
 
255
  if pii_identification_method == "AWS Comprehend":
256
  print("Trying to connect to AWS Comprehend service")
257
+ if aws_access_key_textbox and aws_secret_key_textbox:
258
+ print("Connecting to Comprehend using AWS access key and secret keys from textboxes.")
259
+ print("aws_access_key_textbox:", aws_access_key_textbox)
260
+ print("aws_secret_access_key:", aws_secret_key_textbox)
261
  comprehend_client = boto3.client('comprehend',
262
  aws_access_key_id=aws_access_key_textbox,
263
  aws_secret_access_key=aws_secret_key_textbox)
264
+ elif RUN_AWS_FUNCTIONS == "1":
265
+ print("Connecting to Comprehend via existing SSO connection")
266
+ comprehend_client = boto3.client('comprehend')
267
  elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
268
+ print("Getting Comprehend credentials from environment variables")
269
  comprehend_client = boto3.client('comprehend',
270
  aws_access_key_id=AWS_ACCESS_KEY,
271
+ aws_secret_access_key=AWS_SECRET_KEY)
272
  else:
273
  comprehend_client = ""
274
+ out_message = "Cannot connect to AWS Comprehend service. Please provide access keys under Textract settings on the Redaction settings tab, or choose another PII identification method."
275
  print(out_message)
276
+ return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
277
  else:
278
  comprehend_client = ""
279
 
280
  if in_redact_method == textract_option:
281
+ print("Trying to connect to AWS Textract service")
282
+ if aws_access_key_textbox and aws_secret_key_textbox:
283
+ print("Connecting to Textract using AWS access key and secret keys from textboxes.")
284
+ textract_client = boto3.client('textract',
 
285
  aws_access_key_id=aws_access_key_textbox,
286
  aws_secret_access_key=aws_secret_key_textbox)
287
+ elif RUN_AWS_FUNCTIONS == "1":
288
+ print("Connecting to Textract via existing SSO connection")
289
+ textract_client = boto3.client('textract')
290
  elif AWS_ACCESS_KEY and AWS_SECRET_KEY:
291
+ print("Getting Textract credentials from environment variables.")
292
+ textract_client = boto3.client('textract',
293
  aws_access_key_id=AWS_ACCESS_KEY,
294
+ aws_secret_access_key=AWS_SECRET_KEY)
295
  else:
296
  textract_client = ""
297
+ out_message = "Cannot connect to AWS Textract. Please provide access keys under Textract settings on the Redaction settings tab,choose another text extraction method."
298
  print(out_message)
299
+ return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
300
  else:
301
  textract_client = ""
302
 
 
337
  out_message = "No file selected"
338
  print(out_message)
339
 
340
+ return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
341
 
342
  if in_redact_method == tesseract_ocr_option or in_redact_method == textract_option:
343
 
344
  #Analyse and redact image-based pdf or image
345
  if is_pdf_or_image(file_path) == False:
346
  out_message = "Please upload a PDF file or image file (JPG, PNG) for image analysis."
347
+ return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
348
 
349
  print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
350
 
 
387
 
388
  if is_pdf(file_path) == False:
389
  out_message = "Please upload a PDF file for text analysis. If you have an image, select 'Image analysis'."
390
+ return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
391
 
392
  # Analyse text-based pdf
393
  print('Redacting file as text-based PDF')
 
417
  else:
418
  out_message = "No redaction method selected"
419
  print(out_message)
420
+ return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
421
 
422
  # If at last page, save to file
423
  if current_loop_page >= number_of_pages:
 
511
  out_file_paths = list(set(out_file_paths))
512
  review_out_file_paths = [prepared_pdf_file_paths[0], out_review_file_path]
513
 
514
+ return out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, prepared_pdf_image_paths, review_file_state
515
 
516
  def convert_pikepdf_coords_to_pymupdf(pymupdf_page, pikepdf_bbox, type="pikepdf_annot"):
517
  '''