Spaces:

seanpedrickcase
/

document_redaction

Running

App Files Files Community

seanpedrickcase commited on Apr 28

Commit

f93e49c

1 Parent(s): 0042e78

Now local OCR outputs can be saved to file and reloaded to save preparation time. Bug fixing in logs and tabular data redaction. Update to documentation

Browse files

Files changed (10) hide show

README.md +40 -4
app.py +91 -50
tools/aws_textract.py +205 -20
tools/config.py +19 -16
tools/custom_csvlogger.py +26 -5
tools/custom_image_analyser_engine.py +142 -38
tools/data_anonymise.py +62 -42
tools/file_conversion.py +62 -6
tools/file_redaction.py +140 -36
tools/helper_functions.py +24 -7

README.md CHANGED Viewed

@@ -39,6 +39,7 @@ You can now [speak with a chat bot about this user guide](https://huggingface.co
     - [Redacting only specific pages](#redacting-only-specific-pages)
     - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
 - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
 See the [advanced user guide here](#advanced-user-guide):
 - [Merging redaction review files](#merging-redaction-review-files)
@@ -119,12 +120,14 @@ Click 'Redact document'. After loading in the document, the app should be able t
 - **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
 - **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
-### Additional AWS Textract outputs
 If you have used the AWS Textract option for extracting text, you may also see a '..._textract.json' file. This file contains all the relevant extracted text information that comes from the AWS Textract service. You can keep this file and upload it at a later date alongside your input document, which will enable you to skip calling AWS Textract every single time you want to do a redaction task, as follows:
 ![Document upload alongside Textract](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/document_upload_with_textract.PNG)
 ### Downloading output files from previous redaction tasks
 If you are logged in via AWS Cognito and you lose your app page for some reason (e.g. from a crash, reloading), it is possible recover your previous output files, provided the server has not been shut down since you redacted the document. Go to 'Redaction settings', then scroll to the bottom to see 'View all output files from this session'.
@@ -307,7 +310,7 @@ To filter the 'Search suggested redactions' table you can:
 Once you have filtered the table, you have a few options underneath on what you can do with the filtered rows:
 - Click the 'Exclude specific row from redactions' button to remove only the redaction from the last row you clicked on from the document.
-- Click the 'Exclude all items in table from redactions' button to remove all redactions visible in the table from the document.
 **NOTE**: After excluding redactions using either of the above options, click the 'Reset filters' button below to ensure that the dropdowns and table return to seeing all remaining redactions in the document.
@@ -325,6 +328,40 @@ You can search through the extracted text by using the search bar just above the
 ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
 # ADVANCED USER GUIDE
 This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
@@ -469,13 +506,12 @@ The app should then pick up these keys when trying to access the AWS Textract an
 Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
-## Modifying and merging redaction review files
 You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
 As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
-### Modifying existing redaction review files
 If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
 ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)

     - [Redacting only specific pages](#redacting-only-specific-pages)
     - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
 - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
+- [Redacting tabular data files (CSV/XLSX) or copy and pasted text](#redacting-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
 See the [advanced user guide here](#advanced-user-guide):
 - [Merging redaction review files](#merging-redaction-review-files)
 - **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
 - **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
+### Additional AWS Textract / local OCR outputs
 If you have used the AWS Textract option for extracting text, you may also see a '..._textract.json' file. This file contains all the relevant extracted text information that comes from the AWS Textract service. You can keep this file and upload it at a later date alongside your input document, which will enable you to skip calling AWS Textract every single time you want to do a redaction task, as follows:
 ![Document upload alongside Textract](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/document_upload_with_textract.PNG)
+Similarly, if you have used the 'Local OCR method' to extract text, you may see a '..._ocr_results_with_words.json' file. This file works in the same way as the AWS Textract .json results described above, and can be uploaded alongside an input document to save time on text extraction in future in the same way.
 ### Downloading output files from previous redaction tasks
 If you are logged in via AWS Cognito and you lose your app page for some reason (e.g. from a crash, reloading), it is possible recover your previous output files, provided the server has not been shut down since you redacted the document. Go to 'Redaction settings', then scroll to the bottom to see 'View all output files from this session'.
 Once you have filtered the table, you have a few options underneath on what you can do with the filtered rows:
 - Click the 'Exclude specific row from redactions' button to remove only the redaction from the last row you clicked on from the document.
+- Click the 'Exclude all items in table from redactions' button to remove all redactions visible in the table from the document. **Important:** ensure that you have clicked the blue tick icon next to the search box before doing this, or you will remove all redactions from the document. If you do end up doing this, click the 'Undo last element removal' button below to restore the redactions.
 **NOTE**: After excluding redactions using either of the above options, click the 'Reset filters' button below to ensure that the dropdowns and table return to seeing all remaining redactions in the document.
 ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
+## Redacting tabular data files (XLSX/CSV) or copy and pasted text
+### Tabular data files (XLSX/CSV)
+The app can be used to redact tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
+To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.
+![csv upload](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/tabular_files/file_upload_csv_columns.PNG)
+If you were instead to upload an xlsx file, you would see also a list of all the sheets in the xlsx file that can be redacted. The 'Select columns' area underneath will suggest a list of all columns in the file across all sheets.
+![xlsx upload](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/tabular_files/file_upload_xlsx_columns.PNG)
+Once you have chosen your input file and sheets/columns to redact, you can choose the redaction method. 'Local' will use the same local model as used for documents on the first tab. 'AWS Comprehend' will give better results, at a slight cost.
+When you click Redact text/data files, you will see the progress of the redaction task by file and sheet, and you will receive a csv output with the redacted data.
+### Choosing output anonymisation format
+You can also choose the anonymisation format of your output results.  Open the tab 'Anonymisation output format' to see the options. By default, any detected PII will be replaced with the word 'REDACTED' in the cell. You can choose one of the following options as the form of replacement for the redacted text:
+- replace with 'REDACTED': Replaced by the word 'REDACTED' (default)
+- replace with <ENTITY_NAME>: Replaced by e.g. 'PERSON' for people, 'EMAIL_ADDRESS' for emails etc.
+- redact completely: Text is removed completely and replaced by nothing.
+- hash: Replaced by a unique long ID code that is consistent with entity text. I.e. a particular name will always have the same ID code.
+- mask: Replace with stars '*'.
+### Redacting copy and pasted text
+You can also write open text into an input box and redact that using the same methods as described above. To do this, write or paste text into the 'Enter open text' box that appears when you open the 'Redact open text' tab. Then select a redaction method, and an anonymisation output format as described above. The redacted text will be printed in the output textbox, and will also be saved to a simple csv file in the output file box.
+![Text analysis output](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/tabular_files/text_anonymisation_outputs.PNG)
+### Redaction log outputs
+A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
 # ADVANCED USER GUIDE
 This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
 Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
+## Modifying existing redaction review files
 You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
 As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
 If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
 ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)

app.py CHANGED Viewed

@@ -5,7 +5,7 @@ import gradio as gr
 from gradio_image_annotation import image_annotator
 from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER, SESSION_OUTPUT_FOLDER, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC, HOST_NAME, DEFAULT_COST_CODE, OUTPUT_COST_CODES_PATH, OUTPUT_ALLOW_LIST_PATH, COGNITO_AUTH, SAVE_LOGS_TO_CSV, SAVE_LOGS_TO_DYNAMODB, ACCESS_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_ACCESS_LOG_HEADERS, CSV_ACCESS_LOG_HEADERS, FEEDBACK_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_FEEDBACK_LOG_HEADERS, CSV_FEEDBACK_LOG_HEADERS, USAGE_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_USAGE_LOG_HEADERS, CSV_USAGE_LOG_HEADERS
-from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select
 from tools.aws_functions import upload_file_to_s3, download_file_from_s3, upload_log_file_to_s3
 from tools.file_redaction import choose_and_run_redactor
 from tools.file_conversion import prepare_image_or_pdf, get_input_file_names
@@ -47,9 +47,6 @@ else:
 SAVE_LOGS_TO_CSV = eval(SAVE_LOGS_TO_CSV)
 SAVE_LOGS_TO_DYNAMODB = eval(SAVE_LOGS_TO_DYNAMODB)
-print("SAVE_LOGS_TO_CSV:", SAVE_LOGS_TO_CSV)
-print("SAVE_LOGS_TO_DYNAMODB:", SAVE_LOGS_TO_DYNAMODB)
 if CSV_ACCESS_LOG_HEADERS: CSV_ACCESS_LOG_HEADERS = eval(CSV_ACCESS_LOG_HEADERS)
 if CSV_FEEDBACK_LOG_HEADERS: CSV_FEEDBACK_LOG_HEADERS = eval(CSV_FEEDBACK_LOG_HEADERS)
 if CSV_USAGE_LOG_HEADERS: CSV_USAGE_LOG_HEADERS = eval(CSV_USAGE_LOG_HEADERS)
@@ -77,6 +74,9 @@ with app:
     all_decision_process_table_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"),  label="all_decision_process_table", visible=False, type="pandas", wrap=True)
     review_file_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="review_file_df", visible=False, type="pandas", wrap=True)
     session_hash_state = gr.Textbox(label= "session_hash_state", value="", visible=False)
     host_name_textbox = gr.Textbox(label= "host_name_textbox", value=HOST_NAME, visible=False)
     s3_output_folder_state = gr.Textbox(label= "s3_output_folder_state", value="", visible=False)
@@ -121,7 +121,12 @@ with app:
     doc_full_file_name_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
     doc_file_name_no_extension_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
-    blank_doc_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False) # Left blank for when user does not want to report file names
     doc_file_name_with_extension_textbox = gr.Textbox(label = "doc_file_name_with_extension_textbox", value="", visible=False)
     doc_file_name_textbox_list = gr.Dropdown(label = "doc_file_name_textbox_list", value="", allow_custom_value=True,visible=False)
     latest_review_file_path = gr.Textbox(label = "latest_review_file_path", value="", visible=False) # Latest review file path output from redaction
@@ -200,6 +205,7 @@ with app:
     cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
     textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=False)
     total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=False)
     estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost ($)", value=0, visible=False, precision=2)
     estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=False, precision=2)
@@ -256,10 +262,14 @@ with app:
             if SHOW_COSTS == "True":
                 with gr.Accordion("Estimated costs and time taken", open = True, visible=True):
                     with gr.Row(equal_height=True):
-                        textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=True)
-                        total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=True)
-                        estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost (£)", value=0.00, precision=2, visible=True)
-                        estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=True, precision=2)
             if GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True":
                 with gr.Accordion("Apply cost code", open = True, visible=True):
@@ -397,7 +407,7 @@ with app:
     ###
     with gr.Tab(label="Open text or Excel/csv files"):
         gr.Markdown("""### Choose open text or a tabular data file (xlsx or csv) to redact.""")
-        with gr.Accordion("Paste open text", open = False):
             in_text = gr.Textbox(label="Enter open text", lines=10)
         with gr.Accordion("Upload xlsx or csv files", open = True):
             in_data_files = gr.File(label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet', '.csv.gz'], height=file_input_height)
@@ -407,6 +417,9 @@ with app:
         in_colnames = gr.Dropdown(choices=["Choose columns to anonymise"], multiselect = True, label="Select columns that you want to anonymise (showing columns present across all files).")
         pii_identification_method_drop_tabular = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
         tabular_data_redact_btn = gr.Button("Redact text/data files", variant="primary")
@@ -464,10 +477,10 @@ with app:
                 aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
                 aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
-        with gr.Accordion("Settings for open text or xlsx/csv files", open = False):
-            anon_strat = gr.Radio(choices=["replace with 'REDACTED'", "replace with <ENTITY_NAME>", "redact completely", "hash", "mask", "encrypt", "fake_first_name"], label="Select an anonymisation method.", value = "replace with 'REDACTED'")
-        log_files_output = gr.File(label="Log file output", interactive=False)
         with gr.Accordion("Combine multiple review files", open = False):
             multiple_review_files_in_out = gr.File(label="Combine multiple review_file.csv files together here.", file_count='multiple', file_types=['.csv'])
@@ -493,14 +506,17 @@ with app:
         handwrite_signature_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
         textract_output_found_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
         only_extract_text_radio.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
         # Calculate time taken
-        total_pdf_page_count.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio,          pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
-        text_extract_method_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
-        pii_identification_method_drop.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
-        handwrite_signature_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
-        textract_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
-        only_extract_text_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
     # Allow user to select items from cost code dataframe for cost code
     if SHOW_COSTS=="True" and (GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True"):
@@ -510,27 +526,30 @@ with app:
         cost_code_choice_drop.select(update_cost_code_dataframe_from_dropdown_select, inputs=[cost_code_choice_drop, cost_code_dataframe_base], outputs=[cost_code_dataframe])
     in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
-    success(fn = prepare_image_or_pdf, inputs=[in_doc_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, first_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool_false, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base]).\
-    success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox])
     # Run redaction function
     document_redact_btn.click(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call, textract_query_number]).\
         success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
-        success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
-                    outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path], api_name="redact_doc").\
                     success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
     # If the app has completed a batch of pages, it will rerun the redaction process until the end of all pages in the document
-    current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
-                    outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path]).\
                     success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
     # If a file has been completed, the function will continue onto the next document
-    latest_file_completed_text.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
-                    outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path]).\
                     success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state]).\
                     success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
-                    success(fn=reveal_feedback_buttons, outputs=[pdf_feedback_radio, pdf_further_details_text, pdf_submit_feedback_btn, pdf_feedback_title])
     # If the line level ocr results are changed by load in by user or by a new redaction task, replace the ocr results displayed in the table
     all_line_level_ocr_results_df_base.change(reset_ocr_base_dataframe, inputs=[all_line_level_ocr_results_df_base], outputs=[all_line_level_ocr_results_df])
@@ -548,8 +567,8 @@ with app:
     convert_textract_outputs_to_ocr_results.click(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
         success(fn= check_textract_outputs_exist, inputs=[textract_output_found_checkbox]).\
         success(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call]).\
-        success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, textract_only_method_drop, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, no_redaction_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
-                    outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path])
     ###
     # REVIEW PDF REDACTIONS
@@ -558,7 +577,7 @@ with app:
     # Upload previous files for modifying redactions
     upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
         success(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
-        success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base], api_name="prepare_doc").\
         success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
     # Page number controls
@@ -620,12 +639,12 @@ with app:
     # Convert review file to xfdf Adobe format
     convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
-        success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder]).\
         success(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state, output_folder_textbox, document_cropboxes, page_sizes], outputs=[adobe_review_files_out])
     # Convert xfdf Adobe file back to review_file.csv
     convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
-        success(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder]).\
         success(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state, output_folder_textbox], outputs=[output_review_files], scroll_to_output=True)
     ###
@@ -634,11 +653,14 @@ with app:
     in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets]).\
                   success(fn=get_input_file_names, inputs=[in_data_files], outputs=[data_file_name_no_extension_textbox, data_file_name_with_extension_textbox, data_full_file_name_textbox, data_file_name_textbox_list, total_pdf_page_count])
-    tabular_data_redact_btn.click(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, first_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state], api_name="redact_data")
     # If the output file count text box changes, keep going with redacting each data file until done
-    text_tabular_files_done.change(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, second_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state]).\
-    success(fn = reveal_feedback_buttons, outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title])
     ###
     # IDENTIFY DUPLICATE PAGES
@@ -715,17 +737,30 @@ with app:
     success(fn = upload_log_file_to_s3, inputs=[access_logs_state, access_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
     ### FEEDBACK LOGS
-    # User submitted feedback for pdf redactions
-    pdf_callback = CSVLogger_custom(dataset_file_name=log_file_name)
-    pdf_callback.setup([pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], FEEDBACK_LOGS_FOLDER)
-    pdf_submit_feedback_btn.click(lambda *args: pdf_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], None, preprocess=False).\
-    success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[pdf_further_details_text])
-    # User submitted feedback for data redactions
-    data_callback = CSVLogger_custom(dataset_file_name=log_file_name)
-    data_callback.setup([data_feedback_radio, data_further_details_text, data_full_file_name_textbox], FEEDBACK_LOGS_FOLDER)
-    data_submit_feedback_btn.click(lambda *args: data_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [data_feedback_radio, data_further_details_text, data_full_file_name_textbox], None, preprocess=False).\
-    success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[data_further_details_text])
     ### USAGE LOGS
     # Log processing usage - time taken for redaction queries, and also logs for queries to Textract/Comprehend
@@ -738,15 +773,21 @@ with app:
         latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
         success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
         successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
         success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
     else:
-        usage_callback.setup([session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, data_full_file_name_textbox, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], USAGE_LOGS_FOLDER)
-        latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, data_full_file_name_textbox, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
         success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
-        successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, data_full_file_name_textbox, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
         success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
 if __name__ == "__main__":

 from gradio_image_annotation import image_annotator
 from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER, SESSION_OUTPUT_FOLDER, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC, HOST_NAME, DEFAULT_COST_CODE, OUTPUT_COST_CODES_PATH, OUTPUT_ALLOW_LIST_PATH, COGNITO_AUTH, SAVE_LOGS_TO_CSV, SAVE_LOGS_TO_DYNAMODB, ACCESS_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_ACCESS_LOG_HEADERS, CSV_ACCESS_LOG_HEADERS, FEEDBACK_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_FEEDBACK_LOG_HEADERS, CSV_FEEDBACK_LOG_HEADERS, USAGE_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_USAGE_LOG_HEADERS, CSV_USAGE_LOG_HEADERS
+from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select, check_for_existing_local_ocr_file, reset_data_vars, reset_aws_call_vars
 from tools.aws_functions import upload_file_to_s3, download_file_from_s3, upload_log_file_to_s3
 from tools.file_redaction import choose_and_run_redactor
 from tools.file_conversion import prepare_image_or_pdf, get_input_file_names
 SAVE_LOGS_TO_CSV = eval(SAVE_LOGS_TO_CSV)
 SAVE_LOGS_TO_DYNAMODB = eval(SAVE_LOGS_TO_DYNAMODB)
 if CSV_ACCESS_LOG_HEADERS: CSV_ACCESS_LOG_HEADERS = eval(CSV_ACCESS_LOG_HEADERS)
 if CSV_FEEDBACK_LOG_HEADERS: CSV_FEEDBACK_LOG_HEADERS = eval(CSV_FEEDBACK_LOG_HEADERS)
 if CSV_USAGE_LOG_HEADERS: CSV_USAGE_LOG_HEADERS = eval(CSV_USAGE_LOG_HEADERS)
     all_decision_process_table_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"),  label="all_decision_process_table", visible=False, type="pandas", wrap=True)
     review_file_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="review_file_df", visible=False, type="pandas", wrap=True)
+    all_page_line_level_ocr_results = gr.State([])
+    all_page_line_level_ocr_results_with_children = gr.State([])
     session_hash_state = gr.Textbox(label= "session_hash_state", value="", visible=False)
     host_name_textbox = gr.Textbox(label= "host_name_textbox", value=HOST_NAME, visible=False)
     s3_output_folder_state = gr.Textbox(label= "s3_output_folder_state", value="", visible=False)
     doc_full_file_name_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
     doc_file_name_no_extension_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
+    blank_doc_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
+    blank_data_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "data_full_file_name_textbox", value="", visible=False)
+    placeholder_doc_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "doc_full_file_name_textbox", value="document", visible=False)
+    placeholder_data_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "data_full_file_name_textbox", value="data_file", visible=False)
+    # Left blank for when user does not want to report file names
     doc_file_name_with_extension_textbox = gr.Textbox(label = "doc_file_name_with_extension_textbox", value="", visible=False)
     doc_file_name_textbox_list = gr.Dropdown(label = "doc_file_name_textbox_list", value="", allow_custom_value=True,visible=False)
     latest_review_file_path = gr.Textbox(label = "latest_review_file_path", value="", visible=False) # Latest review file path output from redaction
     cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
     textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=False)
+    local_ocr_output_found_checkbox = gr.Checkbox(value= False, label="Existing local OCR output file found", interactive=False, visible=False)
     total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=False)
     estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost ($)", value=0, visible=False, precision=2)
     estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=False, precision=2)
             if SHOW_COSTS == "True":
                 with gr.Accordion("Estimated costs and time taken", open = True, visible=True):
                     with gr.Row(equal_height=True):
+                        with gr.Column(scale=1):
+                            textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=True)
+                            local_ocr_output_found_checkbox = gr.Checkbox(value= False, label="Existing local OCR output file found", interactive=False, visible=True)
+                        with gr.Column(scale=4):
+                            with gr.Row(equal_height=True):
+                                total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=True)
+                                estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost (£)", value=0.00, precision=2, visible=True)
+                                estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=True, precision=2)
             if GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True":
                 with gr.Accordion("Apply cost code", open = True, visible=True):
     ###
     with gr.Tab(label="Open text or Excel/csv files"):
         gr.Markdown("""### Choose open text or a tabular data file (xlsx or csv) to redact.""")
+        with gr.Accordion("Redact open text", open = False):
             in_text = gr.Textbox(label="Enter open text", lines=10)
         with gr.Accordion("Upload xlsx or csv files", open = True):
             in_data_files = gr.File(label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet', '.csv.gz'], height=file_input_height)
         in_colnames = gr.Dropdown(choices=["Choose columns to anonymise"], multiselect = True, label="Select columns that you want to anonymise (showing columns present across all files).")
         pii_identification_method_drop_tabular = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
+        with gr.Accordion("Anonymisation output format", open = False):
+            anon_strat = gr.Radio(choices=["replace with 'REDACTED'", "replace with <ENTITY_NAME>", "redact completely", "hash", "mask"], label="Select an anonymisation method.", value = "replace with 'REDACTED'") # , "encrypt", "fake_first_name" are also available, but are not currently included as not that useful in current form
         tabular_data_redact_btn = gr.Button("Redact text/data files", variant="primary")
                 aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
                 aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
+        with gr.Accordion("Log file outputs", open = False):
+            log_files_output = gr.File(label="Log file output", interactive=False)
         with gr.Accordion("Combine multiple review files", open = False):
             multiple_review_files_in_out = gr.File(label="Combine multiple review_file.csv files together here.", file_count='multiple', file_types=['.csv'])
         handwrite_signature_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
         textract_output_found_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
         only_extract_text_radio.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
+        textract_output_found_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
         # Calculate time taken
+        total_pdf_page_count.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio,          pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
+        text_extract_method_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
+        pii_identification_method_drop.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
+        handwrite_signature_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
+        textract_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox,  pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
+        only_extract_text_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
+        textract_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
+        local_ocr_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
     # Allow user to select items from cost code dataframe for cost code
     if SHOW_COSTS=="True" and (GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True"):
         cost_code_choice_drop.select(update_cost_code_dataframe_from_dropdown_select, inputs=[cost_code_choice_drop, cost_code_dataframe_base], outputs=[cost_code_dataframe])
     in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
+    success(fn = prepare_image_or_pdf, inputs=[in_doc_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, first_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool_false, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base, local_ocr_output_found_checkbox]).\
+    success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
+    success(fn=check_for_existing_local_ocr_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[local_ocr_output_found_checkbox])
     # Run redaction function
     document_redact_btn.click(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call, textract_query_number]).\
         success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
+        success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
+                    outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children], api_name="redact_doc").\
                     success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
     # If the app has completed a batch of pages, it will rerun the redaction process until the end of all pages in the document
+    current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
+                    outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children]).\
                     success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
     # If a file has been completed, the function will continue onto the next document
+    latest_file_completed_text.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
+                    outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children]).\
                     success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state]).\
                     success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
+                    success(fn=check_for_existing_local_ocr_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[local_ocr_output_found_checkbox]).\
+                    success(fn=reveal_feedback_buttons, outputs=[pdf_feedback_radio, pdf_further_details_text, pdf_submit_feedback_btn, pdf_feedback_title]).\
+                    success(fn = reset_aws_call_vars, outputs=[comprehend_query_number, textract_query_number])
     # If the line level ocr results are changed by load in by user or by a new redaction task, replace the ocr results displayed in the table
     all_line_level_ocr_results_df_base.change(reset_ocr_base_dataframe, inputs=[all_line_level_ocr_results_df_base], outputs=[all_line_level_ocr_results_df])
     convert_textract_outputs_to_ocr_results.click(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
         success(fn= check_textract_outputs_exist, inputs=[textract_output_found_checkbox]).\
         success(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call]).\
+        success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, textract_only_method_drop, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, no_redaction_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
+                    outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children])
     ###
     # REVIEW PDF REDACTIONS
     # Upload previous files for modifying redactions
     upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
         success(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
+        success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base, local_ocr_output_found_checkbox], api_name="prepare_doc").\
         success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
     # Page number controls
     # Convert review file to xfdf Adobe format
     convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
+        success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder, local_ocr_output_found_checkbox]).\
         success(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state, output_folder_textbox, document_cropboxes, page_sizes], outputs=[adobe_review_files_out])
     # Convert xfdf Adobe file back to review_file.csv
     convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
+        success(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder, local_ocr_output_found_checkbox]).\
         success(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state, output_folder_textbox], outputs=[output_review_files], scroll_to_output=True)
     ###
     in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets]).\
                   success(fn=get_input_file_names, inputs=[in_data_files], outputs=[data_file_name_no_extension_textbox, data_file_name_with_extension_textbox, data_full_file_name_textbox, data_file_name_textbox_list, total_pdf_page_count])
+    tabular_data_redact_btn.click(reset_data_vars, outputs=[actual_time_taken_number, log_files_output_list_state, comprehend_query_number]).\
+    success(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, first_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox, actual_time_taken_number], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state, actual_time_taken_number], api_name="redact_data").\
+    success(fn = reveal_feedback_buttons, outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title])
+    # Currently only supports redacting one data file at a time
     # If the output file count text box changes, keep going with redacting each data file until done
+    # text_tabular_files_done.change(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, second_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox, actual_time_taken_number], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state, actual_time_taken_number]).\
+    # success(fn = reveal_feedback_buttons, outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title])
     ###
     # IDENTIFY DUPLICATE PAGES
     success(fn = upload_log_file_to_s3, inputs=[access_logs_state, access_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
     ### FEEDBACK LOGS
+    if DISPLAY_FILE_NAMES_IN_LOGS == 'True':
+        # User submitted feedback for pdf redactions
+        pdf_callback = CSVLogger_custom(dataset_file_name=log_file_name)
+        pdf_callback.setup([pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], FEEDBACK_LOGS_FOLDER)
+        pdf_submit_feedback_btn.click(lambda *args: pdf_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], None, preprocess=False).\
+        success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[pdf_further_details_text])
+        # User submitted feedback for data redactions
+        data_callback = CSVLogger_custom(dataset_file_name=log_file_name)
+        data_callback.setup([data_feedback_radio, data_further_details_text, data_full_file_name_textbox], FEEDBACK_LOGS_FOLDER)
+        data_submit_feedback_btn.click(lambda *args: data_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [data_feedback_radio, data_further_details_text, data_full_file_name_textbox], None, preprocess=False).\
+        success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[data_further_details_text])
+    else:
+        # User submitted feedback for pdf redactions
+        pdf_callback = CSVLogger_custom(dataset_file_name=log_file_name)
+        pdf_callback.setup([pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], FEEDBACK_LOGS_FOLDER)
+        pdf_submit_feedback_btn.click(lambda *args: pdf_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [pdf_feedback_radio, pdf_further_details_text, placeholder_doc_file_name_no_extension_textbox_for_logs], None, preprocess=False).\
+        success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[pdf_further_details_text])
+        # User submitted feedback for data redactions
+        data_callback = CSVLogger_custom(dataset_file_name=log_file_name)
+        data_callback.setup([data_feedback_radio, data_further_details_text, data_full_file_name_textbox], FEEDBACK_LOGS_FOLDER)
+        data_submit_feedback_btn.click(lambda *args: data_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [data_feedback_radio, data_further_details_text, placeholder_data_file_name_no_extension_textbox_for_logs], None, preprocess=False).\
+        success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[data_further_details_text])
     ### USAGE LOGS
     # Log processing usage - time taken for redaction queries, and also logs for queries to Textract/Comprehend
         latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
         success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
+        text_tabular_files_done.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop_tabular, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
+        success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
         successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
         success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
     else:
+        usage_callback.setup([session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, blank_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], USAGE_LOGS_FOLDER)
+        latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, placeholder_doc_file_name_no_extension_textbox_for_logs, blank_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
+        success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
+        text_tabular_files_done.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, placeholder_data_file_name_no_extension_textbox_for_logs,  actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop_tabular, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
         success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
+        successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, placeholder_doc_file_name_no_extension_textbox_for_logs, blank_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
         success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
 if __name__ == "__main__":

tools/aws_textract.py CHANGED Viewed

@@ -108,6 +108,174 @@ def convert_pike_pdf_page_to_bytes(pdf:object, page_num:int):
     return pdf_bytes
 def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_no:int):
     '''
     Convert the json response from textract to the OCRResult format used elsewhere in the code. Looks for lines, words, and signatures. Handwriting and signatures are set aside especially for later in case the user wants to override the default behaviour and redact all handwriting/signatures.
@@ -118,7 +286,7 @@ def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_
     handwriting_recogniser_results = []
     signatures = []
     handwriting = []
-    ocr_results_with_children = {}
     text_block={}
     i = 1
@@ -141,7 +309,7 @@ def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_
     is_signature = False
     is_handwriting = False
-    for text_block in text_blocks:
         if (text_block['BlockType'] == 'LINE') | (text_block['BlockType'] == 'SIGNATURE'): # (text_block['BlockType'] == 'WORD') |
@@ -244,36 +412,53 @@ def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_
                     'text': line_text,
                     'bounding_box': (line_left, line_top, line_right, line_bottom)
                 }]
-            ocr_results_with_children["text_line_" + str(i)] = {
                 "line": i,
                 'text': line_text,
                 'bounding_box': (line_left, line_top, line_right, line_bottom),
-                'words': words
-            }
             # Create OCRResult with absolute coordinates
             ocr_result = OCRResult(line_text, line_left, line_top, width_abs, height_abs)
             all_ocr_results.append(ocr_result)
-            is_signature_or_handwriting = is_signature | is_handwriting
-            # If it is signature or handwriting, will overwrite the default behaviour of the PII analyser
-            if is_signature_or_handwriting:
-                if recogniser_result not in signature_or_handwriting_recogniser_results:
-                    signature_or_handwriting_recogniser_results.append(recogniser_result)
-                if is_signature:
-                    if recogniser_result not in signature_recogniser_results:
-                        signature_recogniser_results.append(recogniser_result)
-                if is_handwriting:
-                    if recogniser_result not in handwriting_recogniser_results:
-                        handwriting_recogniser_results.append(recogniser_result)
-            i += 1
-    return all_ocr_results, signature_or_handwriting_recogniser_results, signature_recogniser_results, handwriting_recogniser_results, ocr_results_with_children
 def load_and_convert_textract_json(textract_json_file_path:str, log_files_output_paths:str, page_sizes_df:pd.DataFrame):
     """
@@ -315,7 +500,7 @@ def load_and_convert_textract_json(textract_json_file_path:str, log_files_output
             return {}, True, log_files_output_paths  # Conversion failed
     else:
         print("Invalid Textract JSON format: 'Blocks' missing.")
-        print("textract data:", textract_data)
         return {}, True, log_files_output_paths  # Return empty data if JSON is not recognized
 def restructure_textract_output(textract_output: dict, page_sizes_df:pd.DataFrame):

     return pdf_bytes
+# def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_no:int):
+#     '''
+#     Convert the json response from textract to the OCRResult format used elsewhere in the code. Looks for lines, words, and signatures. Handwriting and signatures are set aside especially for later in case the user wants to override the default behaviour and redact all handwriting/signatures.
+#     '''
+#     all_ocr_results = []
+#     signature_or_handwriting_recogniser_results = []
+#     signature_recogniser_results = []
+#     handwriting_recogniser_results = []
+#     signatures = []
+#     handwriting = []
+#     ocr_results_with_words = {}
+#     text_block={}
+#     i = 1
+#     # Assuming json_data is structured as a dictionary with a "pages" key
+#     #if "pages" in json_data:
+#     # Find the specific page data
+#     page_json_data = json_data #next((page for page in json_data["pages"] if page["page_no"] == page_no), None)
+#     #print("page_json_data:", page_json_data)
+#     if "Blocks" in page_json_data:
+#         # Access the data for the specific page
+#         text_blocks = page_json_data["Blocks"]  # Access the Blocks within the page data
+#     # This is a new page
+#     elif "page_no" in page_json_data:
+#         text_blocks = page_json_data["data"]["Blocks"]
+#     else: text_blocks = []
+#     is_signature = False
+#     is_handwriting = False
+#     for text_block in text_blocks:
+#         if (text_block['BlockType'] == 'LINE') | (text_block['BlockType'] == 'SIGNATURE'): # (text_block['BlockType'] == 'WORD') |
+#             # Extract text and bounding box for the line
+#             line_bbox = text_block["Geometry"]["BoundingBox"]
+#             line_left = int(line_bbox["Left"] * page_width)
+#             line_top = int(line_bbox["Top"] * page_height)
+#             line_right = int((line_bbox["Left"] + line_bbox["Width"]) * page_width)
+#             line_bottom = int((line_bbox["Top"] + line_bbox["Height"]) * page_height)
+#             width_abs = int(line_bbox["Width"] * page_width)
+#             height_abs = int(line_bbox["Height"] * page_height)
+#             if text_block['BlockType'] == 'LINE':
+#                 # Extract text and bounding box for the line
+#                 line_text = text_block.get('Text', '')
+#                 words = []
+#                 current_line_handwriting_results = []  # Track handwriting results for this line
+#                 if 'Relationships' in text_block:
+#                     for relationship in text_block['Relationships']:
+#                         if relationship['Type'] == 'CHILD':
+#                             for child_id in relationship['Ids']:
+#                                 child_block = next((block for block in text_blocks if block['Id'] == child_id), None)
+#                                 if child_block and child_block['BlockType'] == 'WORD':
+#                                     word_text = child_block.get('Text', '')
+#                                     word_bbox = child_block["Geometry"]["BoundingBox"]
+#                                     confidence = child_block.get('Confidence','')
+#                                     word_left = int(word_bbox["Left"] * page_width)
+#                                     word_top = int(word_bbox["Top"] * page_height)
+#                                     word_right = int((word_bbox["Left"] + word_bbox["Width"]) * page_width)
+#                                     word_bottom = int((word_bbox["Top"] + word_bbox["Height"]) * page_height)
+#                                     # Extract BoundingBox details
+#                                     word_width = word_bbox["Width"]
+#                                     word_height = word_bbox["Height"]
+#                                     # Convert proportional coordinates to absolute coordinates
+#                                     word_width_abs = int(word_width * page_width)
+#                                     word_height_abs = int(word_height * page_height)
+#                                     words.append({
+#                                         'text': word_text,
+#                                         'bounding_box': (word_left, word_top, word_right, word_bottom)
+#                                     })
+#                                     # Check for handwriting
+#                                     text_type = child_block.get("TextType", '')
+#                                     if text_type == "HANDWRITING":
+#                                         is_handwriting = True
+#                                         entity_name = "HANDWRITING"
+#                                         word_end = len(word_text)
+#                                         recogniser_result = CustomImageRecognizerResult(
+#                                             entity_type=entity_name,
+#                                             text=word_text,
+#                                             score=confidence,
+#                                             start=0,
+#                                             end=word_end,
+#                                             left=word_left,
+#                                             top=word_top,
+#                                             width=word_width_abs,
+#                                             height=word_height_abs
+#                                         )
+#                                         # Add to handwriting collections immediately
+#                                         handwriting.append(recogniser_result)
+#                                         handwriting_recogniser_results.append(recogniser_result)
+#                                         signature_or_handwriting_recogniser_results.append(recogniser_result)
+#                                         current_line_handwriting_results.append(recogniser_result)
+#             # If handwriting or signature, add to bounding box
+#             elif (text_block['BlockType'] == 'SIGNATURE'):
+#                 line_text = "SIGNATURE"
+#                 is_signature = True
+#                 entity_name = "SIGNATURE"
+#                 confidence = text_block.get('Confidence', 0)
+#                 word_end = len(line_text)
+#                 recogniser_result = CustomImageRecognizerResult(
+#                     entity_type=entity_name,
+#                     text=line_text,
+#                     score=confidence,
+#                     start=0,
+#                     end=word_end,
+#                     left=line_left,
+#                     top=line_top,
+#                     width=width_abs,
+#                     height=height_abs
+#                 )
+#                 # Add to signature collections immediately
+#                 signatures.append(recogniser_result)
+#                 signature_recogniser_results.append(recogniser_result)
+#                 signature_or_handwriting_recogniser_results.append(recogniser_result)
+#                 words = [{
+#                     'text': line_text,
+#                     'bounding_box': (line_left, line_top, line_right, line_bottom)
+#                 }]
+#             ocr_results_with_words["text_line_" + str(i)] = {
+#                 "line": i,
+#                 'text': line_text,
+#                 'bounding_box': (line_left, line_top, line_right, line_bottom),
+#                 'words': words
+#             }
+#             # Create OCRResult with absolute coordinates
+#             ocr_result = OCRResult(line_text, line_left, line_top, width_abs, height_abs)
+#             all_ocr_results.append(ocr_result)
+#             is_signature_or_handwriting = is_signature | is_handwriting
+#             # If it is signature or handwriting, will overwrite the default behaviour of the PII analyser
+#             if is_signature_or_handwriting:
+#                 if recogniser_result not in signature_or_handwriting_recogniser_results:
+#                     signature_or_handwriting_recogniser_results.append(recogniser_result)
+#                 if is_signature:
+#                     if recogniser_result not in signature_recogniser_results:
+#                         signature_recogniser_results.append(recogniser_result)
+#                 if is_handwriting:
+#                     if recogniser_result not in handwriting_recogniser_results:
+#                         handwriting_recogniser_results.append(recogniser_result)
+#             i += 1
+#     return all_ocr_results, signature_or_handwriting_recogniser_results, signature_recogniser_results, handwriting_recogniser_results, ocr_results_with_words
 def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_no:int):
     '''
     Convert the json response from textract to the OCRResult format used elsewhere in the code. Looks for lines, words, and signatures. Handwriting and signatures are set aside especially for later in case the user wants to override the default behaviour and redact all handwriting/signatures.
     handwriting_recogniser_results = []
     signatures = []
     handwriting = []
+    ocr_results_with_words = {}
     text_block={}
     i = 1
     is_signature = False
     is_handwriting = False
+    for text_block in text_blocks:
         if (text_block['BlockType'] == 'LINE') | (text_block['BlockType'] == 'SIGNATURE'): # (text_block['BlockType'] == 'WORD') |
                     'text': line_text,
                     'bounding_box': (line_left, line_top, line_right, line_bottom)
                 }]
+        else:
+            line_text = ""
+            words=[]
+            line_left = 0
+            line_top = 0
+            line_right = 0
+            line_bottom = 0
+            width_abs = 0
+            height_abs = 0
+        if line_text:
+            ocr_results_with_words["text_line_" + str(i)] = {
                 "line": i,
                 'text': line_text,
                 'bounding_box': (line_left, line_top, line_right, line_bottom),
+                'words': words,
+                'page': page_no
+            }
             # Create OCRResult with absolute coordinates
             ocr_result = OCRResult(line_text, line_left, line_top, width_abs, height_abs)
             all_ocr_results.append(ocr_result)
+        is_signature_or_handwriting = is_signature | is_handwriting
+        # If it is signature or handwriting, will overwrite the default behaviour of the PII analyser
+        if is_signature_or_handwriting:
+            if recogniser_result not in signature_or_handwriting_recogniser_results:
+                signature_or_handwriting_recogniser_results.append(recogniser_result)
+            if is_signature:
+                if recogniser_result not in signature_recogniser_results:
+                    signature_recogniser_results.append(recogniser_result)
+            if is_handwriting:
+                if recogniser_result not in handwriting_recogniser_results:
+                    handwriting_recogniser_results.append(recogniser_result)
+        i += 1
+    # Add page key to the line level results
+    all_ocr_results_with_page = {"page": page_no, "results": all_ocr_results}
+    ocr_results_with_words_with_page = {"page": page_no, "results": ocr_results_with_words}
+    return all_ocr_results_with_page, signature_or_handwriting_recogniser_results, signature_recogniser_results, handwriting_recogniser_results, ocr_results_with_words_with_page
 def load_and_convert_textract_json(textract_json_file_path:str, log_files_output_paths:str, page_sizes_df:pd.DataFrame):
     """
             return {}, True, log_files_output_paths  # Conversion failed
     else:
         print("Invalid Textract JSON format: 'Blocks' missing.")
+        #print("textract data:", textract_data)
         return {}, True, log_files_output_paths  # Return empty data if JSON is not recognized
 def restructure_textract_output(textract_output: dict, page_sizes_df:pd.DataFrame):

tools/config.py CHANGED Viewed

@@ -108,21 +108,7 @@ if AWS_SECRET_KEY: print(f'AWS_SECRET_KEY found in environment variables')
 DOCUMENT_REDACTION_BUCKET = get_or_create_env_var('DOCUMENT_REDACTION_BUCKET', '')
-### WHOLE DOCUMENT API OPTIONS
-SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS = get_or_create_env_var('SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS', 'False') # This feature not currently implemented
-TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET', '')
-TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER', 'input')
-TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER', 'output')
-LOAD_PREVIOUS_TEXTRACT_JOBS_S3 = get_or_create_env_var('LOAD_PREVIOUS_TEXTRACT_JOBS_S3', 'False') # Whether or not to load previous Textract jobs from S3
-TEXTRACT_JOBS_S3_LOC = get_or_create_env_var('TEXTRACT_JOBS_S3_LOC', 'output') # Subfolder in the DOCUMENT_REDACTION_BUCKET where the Textract jobs are stored
-TEXTRACT_JOBS_LOCAL_LOC = get_or_create_env_var('TEXTRACT_JOBS_LOCAL_LOC', 'output') # Local subfolder where the Textract jobs are stored
 # Custom headers e.g. if routing traffic through Cloudfront
 # Retrieving or setting CUSTOM_HEADER
@@ -191,7 +177,6 @@ CSV_ACCESS_LOG_HEADERS = get_or_create_env_var('CSV_ACCESS_LOG_HEADERS', '') # I
 CSV_FEEDBACK_LOG_HEADERS = get_or_create_env_var('CSV_FEEDBACK_LOG_HEADERS', '') # If blank, uses component labels
 CSV_USAGE_LOG_HEADERS = get_or_create_env_var('CSV_USAGE_LOG_HEADERS', '["session_hash_textbox",	"doc_full_file_name_textbox",	"data_full_file_name_textbox",	"actual_time_taken_number",	"total_page_count",	"textract_query_number", "pii_detection_method", "comprehend_query_number",  "cost_code", "textract_handwriting_signature", "host_name_textbox", "text_extraction_method", "is_this_a_textract_api_call"]') # If blank, uses component labels
 ### DYNAMODB logs. Whether to save to DynamoDB, and the headers of the table
 SAVE_LOGS_TO_DYNAMODB = get_or_create_env_var('SAVE_LOGS_TO_DYNAMODB', 'False')
@@ -260,6 +245,8 @@ S3_ALLOW_LIST_PATH = get_or_create_env_var('S3_ALLOW_LIST_PATH', '') # default_a
 if ALLOW_LIST_PATH: OUTPUT_ALLOW_LIST_PATH = ALLOW_LIST_PATH
 else: OUTPUT_ALLOW_LIST_PATH = 'config/default_allow_list.csv'
 SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
 GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', 'True')
@@ -275,4 +262,20 @@ else: OUTPUT_COST_CODES_PATH = ''
 ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
-if ENFORCE_COST_CODES == 'True': GET_COST_CODES = 'True'

 DOCUMENT_REDACTION_BUCKET = get_or_create_env_var('DOCUMENT_REDACTION_BUCKET', '')
 # Custom headers e.g. if routing traffic through Cloudfront
 # Retrieving or setting CUSTOM_HEADER
 CSV_FEEDBACK_LOG_HEADERS = get_or_create_env_var('CSV_FEEDBACK_LOG_HEADERS', '') # If blank, uses component labels
 CSV_USAGE_LOG_HEADERS = get_or_create_env_var('CSV_USAGE_LOG_HEADERS', '["session_hash_textbox",	"doc_full_file_name_textbox",	"data_full_file_name_textbox",	"actual_time_taken_number",	"total_page_count",	"textract_query_number", "pii_detection_method", "comprehend_query_number",  "cost_code", "textract_handwriting_signature", "host_name_textbox", "text_extraction_method", "is_this_a_textract_api_call"]') # If blank, uses component labels
 ### DYNAMODB logs. Whether to save to DynamoDB, and the headers of the table
 SAVE_LOGS_TO_DYNAMODB = get_or_create_env_var('SAVE_LOGS_TO_DYNAMODB', 'False')
 if ALLOW_LIST_PATH: OUTPUT_ALLOW_LIST_PATH = ALLOW_LIST_PATH
 else: OUTPUT_ALLOW_LIST_PATH = 'config/default_allow_list.csv'
+### COST CODE OPTIONS
 SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
 GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', 'True')
 ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
+if ENFORCE_COST_CODES == 'True': GET_COST_CODES = 'True'
+### WHOLE DOCUMENT API OPTIONS
+SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS = get_or_create_env_var('SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS', 'False') # This feature not currently implemented
+TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET', '')
+TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER', 'input')
+TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER', 'output')
+LOAD_PREVIOUS_TEXTRACT_JOBS_S3 = get_or_create_env_var('LOAD_PREVIOUS_TEXTRACT_JOBS_S3', 'False') # Whether or not to load previous Textract jobs from S3
+TEXTRACT_JOBS_S3_LOC = get_or_create_env_var('TEXTRACT_JOBS_S3_LOC', 'output') # Subfolder in the DOCUMENT_REDACTION_BUCKET where the Textract jobs are stored
+TEXTRACT_JOBS_LOCAL_LOC = get_or_create_env_var('TEXTRACT_JOBS_LOCAL_LOC', 'output') # Local subfolder where the Textract jobs are stored

tools/custom_csvlogger.py CHANGED Viewed

@@ -15,6 +15,9 @@ from typing import TYPE_CHECKING, Any
 from gradio_client import utils as client_utils
 import gradio as gr
 from gradio import utils, wasm_utils
 if TYPE_CHECKING:
     from gradio.components import Component
@@ -202,12 +205,30 @@ class CSVLogger_custom(FlaggingCallback):
                     line_count = len(list(csv.reader(csvfile))) - 1
         if save_to_dynamodb == True:
-            if dynamodb_table_name is None:
-                raise ValueError("You must provide a dynamodb_table_name if save_to_dynamodb is True")
-            dynamodb = boto3.resource('dynamodb')
-            client = boto3.client('dynamodb')
             if dynamodb_headers:
                 dynamodb_headers = dynamodb_headers

 from gradio_client import utils as client_utils
 import gradio as gr
 from gradio import utils, wasm_utils
+from tools.config import AWS_REGION, AWS_ACCESS_KEY, AWS_SECRET_KEY, RUN_AWS_FUNCTIONS
+from botocore.exceptions import NoCredentialsError, TokenRetrievalError
 if TYPE_CHECKING:
     from gradio.components import Component
                     line_count = len(list(csv.reader(csvfile))) - 1
         if save_to_dynamodb == True:
+            if RUN_AWS_FUNCTIONS == "1":
+                try:
+                    print("Connecting to DynamoDB via existing SSO connection")
+                    dynamodb = boto3.resource('dynamodb', region_name=AWS_REGION)
+                    #client = boto3.client('dynamodb')
+                    test_connection = dynamodb.meta.client.list_tables()
+                except Exception as e:
+                    print("No SSO credentials found:", e)
+                    if AWS_ACCESS_KEY and AWS_SECRET_KEY:
+                        print("Trying DynamoDB credentials from environment variables")
+                        dynamodb = boto3.resource('dynamodb',aws_access_key_id=AWS_ACCESS_KEY,
+                            aws_secret_access_key=AWS_SECRET_KEY, region_name=AWS_REGION)
+                        # client = boto3.client('dynamodb',aws_access_key_id=AWS_ACCESS_KEY,
+                        #     aws_secret_access_key=AWS_SECRET_KEY, region_name=AWS_REGION)
+                    else:
+                        raise Exception("AWS credentials for DynamoDB logging not found")
+            else:
+                raise Exception("AWS credentials for DynamoDB logging not found")
+            if dynamodb_table_name is None:
+                raise ValueError("You must provide a dynamodb_table_name if save_to_dynamodb is True")
             if dynamodb_headers:
                 dynamodb_headers = dynamodb_headers

tools/custom_image_analyser_engine.py CHANGED Viewed

@@ -775,9 +775,52 @@ def merge_text_bounding_boxes(analyser_results:dict, characters: List[LTChar], c
     return analysed_bounding_boxes
-# Function to combine OCR results into line-level results
-def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:float=12.0):
-    # Group OCR results into lines based on y_threshold
     lines = []
     current_line = []
     for result in sorted(ocr_results, key=lambda x: x.top):
@@ -796,26 +839,11 @@ def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:fl
     # Flatten the sorted lines back into a single list
     sorted_results = [result for line in lines for result in line]
-    combined_results = []
-    new_format_results = {}
     current_line = []
     current_bbox = None
-    line_counter = 1
-    def create_ocr_result_with_children(combined_results, i, current_bbox, current_line):
-        combined_results["text_line_" + str(i)] = {
-        "line": i,
-        'text': current_bbox.text,
-        'bounding_box': (current_bbox.left, current_bbox.top,
-                            current_bbox.left + current_bbox.width,
-                            current_bbox.top + current_bbox.height),
-        'words': [{'text': word.text,
-                    'bounding_box': (word.left, word.top,
-                                    word.left + word.width,
-                                    word.top + word.height)}
-                    for word in current_line]
-    }
-        return combined_results["text_line_" + str(i)]
     for result in sorted_results:
         if not current_line:
@@ -841,22 +869,98 @@ def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:fl
             else:
                 # Commit the current line and start a new one
-                combined_results.append(current_bbox)
-                new_format_results["text_line_" + str(line_counter)] = create_ocr_result_with_children(new_format_results, line_counter, current_bbox, current_line)
                 line_counter += 1
                 current_line = [result]
                 current_bbox = result
     # Append the last line
     if current_bbox:
-        combined_results.append(current_bbox)
-        new_format_results["text_line_" + str(line_counter)] = create_ocr_result_with_children(new_format_results, line_counter, current_bbox, current_line)
-    return combined_results, new_format_results
 class CustomImageAnalyzerEngine:
     def __init__(
@@ -910,7 +1014,7 @@ class CustomImageAnalyzerEngine:
     def analyze_text(
         self,
         line_level_ocr_results: List[OCRResult],
-        ocr_results_with_children: Dict[str, Dict],
         chosen_redact_comprehend_entities: List[str],
         pii_identification_method: str = "Local",
         comprehend_client = "",
@@ -1035,9 +1139,9 @@ class CustomImageAnalyzerEngine:
         combined_results = []
         for i, text_line in enumerate(line_level_ocr_results):
             line_results = next((results for idx, results in all_text_line_results if idx == i), [])
-            if line_results and i < len(ocr_results_with_children):
-                child_level_key = list(ocr_results_with_children.keys())[i]
-                ocr_results_with_children_line_level = ocr_results_with_children[child_level_key]
                 for result in line_results:
                     bbox_results = self.map_analyzer_results_to_bounding_boxes(
@@ -1051,7 +1155,7 @@ class CustomImageAnalyzerEngine:
                         )],
                         text_line.text,
                         text_analyzer_kwargs.get('allow_list', []),
-                        ocr_results_with_children_line_level
                     )
                     combined_results.extend(bbox_results)
@@ -1063,14 +1167,14 @@ class CustomImageAnalyzerEngine:
     redaction_relevant_ocr_results: List[OCRResult],
     full_text: str,
     allow_list: List[str],
-    ocr_results_with_children_child_info: Dict[str, Dict]
 ) -> List[CustomImageRecognizerResult]:
         redaction_bboxes = []
         for redaction_relevant_ocr_result in redaction_relevant_ocr_results:
-            #print("ocr_results_with_children_child_info:", ocr_results_with_children_child_info)
-            line_text = ocr_results_with_children_child_info['text']
             line_length = len(line_text)
             redaction_text = redaction_relevant_ocr_result.text
@@ -1096,7 +1200,7 @@ class CustomImageAnalyzerEngine:
                     # print(f"Found match: '{matched_text}' in line")
-                    # for word_info in ocr_results_with_children_child_info.get('words', []):
                     #     # Check if this word is part of our match
                     #     if any(word.lower() in word_info['text'].lower() for word in matched_words):
                     #         matching_word_boxes.append(word_info['bounding_box'])
@@ -1105,11 +1209,11 @@ class CustomImageAnalyzerEngine:
                     # Find the corresponding words in the OCR results
                     matching_word_boxes = []
-                    #print("ocr_results_with_children_child_info:", ocr_results_with_children_child_info)
                     current_position = 0
-                    for word_info in ocr_results_with_children_child_info.get('words', []):
                         word_text = word_info['text']
                         word_length = len(word_text)

     return analysed_bounding_boxes
+def recreate_page_line_level_ocr_results_with_page(page_line_level_ocr_results_with_words: dict):
+    reconstructed_results = []
+    # Assume all lines belong to the same page, so we can just read it from one item
+    #page = next(iter(page_line_level_ocr_results_with_words.values()))["page"]
+    page = page_line_level_ocr_results_with_words["page"]
+    for line_data in page_line_level_ocr_results_with_words["results"].values():
+        bbox = line_data["bounding_box"]
+        text = line_data["text"]
+        # Recreate the OCRResult (you'll need the OCRResult class imported)
+        line_result = OCRResult(
+            text=text,
+            left=bbox[0],
+            top=bbox[1],
+            width=bbox[2] - bbox[0],
+            height=bbox[3] - bbox[1],
+        )
+        reconstructed_results.append(line_result)
+    page_line_level_ocr_results_with_page = {"page": page, "results": reconstructed_results}
+    return page_line_level_ocr_results_with_page
+def create_ocr_result_with_children(combined_results:dict, i:int, current_bbox:dict, current_line:list):
+        combined_results["text_line_" + str(i)] = {
+        "line": i,
+        'text': current_bbox.text,
+        'bounding_box': (current_bbox.left, current_bbox.top,
+                            current_bbox.left + current_bbox.width,
+                            current_bbox.top + current_bbox.height),
+        'words': [{'text': word.text,
+                    'bounding_box': (word.left, word.top,
+                                    word.left + word.width,
+                                    word.top + word.height)}
+                    for word in current_line]
+    }
+        return combined_results["text_line_" + str(i)]
+def combine_ocr_results(ocr_results: dict, x_threshold: float = 50.0, y_threshold: float = 12.0, page: int = 1):
+    '''
+    Group OCR results into lines based on y_threshold. Create line level ocr results, and word level OCR results
+    '''
     lines = []
     current_line = []
     for result in sorted(ocr_results, key=lambda x: x.top):
     # Flatten the sorted lines back into a single list
     sorted_results = [result for line in lines for result in line]
+    page_line_level_ocr_results = []
+    page_line_level_ocr_results_with_words = {}
     current_line = []
     current_bbox = None
+    line_counter = 1
     for result in sorted_results:
         if not current_line:
             else:
                 # Commit the current line and start a new one
+                page_line_level_ocr_results.append(current_bbox)
+                page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
+                #page_line_level_ocr_results_with_words["text_line_" + str(line_counter)]["page"] = page
                 line_counter += 1
                 current_line = [result]
                 current_bbox = result
     # Append the last line
     if current_bbox:
+        page_line_level_ocr_results.append(current_bbox)
+        page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
+        #page_line_level_ocr_results_with_words["text_line_" + str(line_counter)]["page"] = page
+    # Add page key to the line level results
+    page_line_level_ocr_results_with_page = {"page": page, "results": page_line_level_ocr_results}
+    page_line_level_ocr_results_with_words = {"page": page, "results": page_line_level_ocr_results_with_words}
+    return page_line_level_ocr_results_with_page, page_line_level_ocr_results_with_words
+# Function to combine OCR results into line-level results
+# def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:float=12.0):
+#     '''
+#     Group OCR results into lines based on y_threshold. Create line level ocr results, and word level OCR results
+#     '''
+#     lines = []
+#     current_line = []
+#     for result in sorted(ocr_results, key=lambda x: x.top):
+#         if not current_line or abs(result.top - current_line[0].top) <= y_threshold:
+#             current_line.append(result)
+#         else:
+#             lines.append(current_line)
+#             current_line = [result]
+#     if current_line:
+#         lines.append(current_line)
+#     # Sort each line by left position
+#     for line in lines:
+#         line.sort(key=lambda x: x.left)
+#     # Flatten the sorted lines back into a single list
+#     sorted_results = [result for line in lines for result in line]
+#     page_line_level_ocr_results = []
+#     page_line_level_ocr_results_with_words = {}
+#     current_line = []
+#     current_bbox = None
+#     line_counter = 1
+#     for result in sorted_results:
+#         if not current_line:
+#             # Start a new line
+#             current_line.append(result)
+#             current_bbox = result
+#         else:
+#             # Check if the result is on the same line (y-axis) and close horizontally (x-axis)
+#             last_result = current_line[-1]
+#             if abs(result.top - last_result.top) <= y_threshold and \
+#                (result.left - (last_result.left + last_result.width)) <= x_threshold:
+#                 # Update the bounding box to include the new word
+#                 new_right = max(current_bbox.left + current_bbox.width, result.left + result.width)
+#                 current_bbox = OCRResult(
+#                     text=f"{current_bbox.text} {result.text}",
+#                     left=current_bbox.left,
+#                     top=current_bbox.top,
+#                     width=new_right - current_bbox.left,
+#                     height=max(current_bbox.height, result.height)
+#                 )
+#                 current_line.append(result)
+#             else:
+#                 # Commit the current line and start a new one
+#                 page_line_level_ocr_results.append(current_bbox)
+#                 page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
+#                 line_counter += 1
+#                 current_line = [result]
+#                 current_bbox = result
+#     # Append the last line
+#     if current_bbox:
+#         page_line_level_ocr_results.append(current_bbox)
+#         page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
+#     return page_line_level_ocr_results, page_line_level_ocr_results_with_words
 class CustomImageAnalyzerEngine:
     def __init__(
     def analyze_text(
         self,
         line_level_ocr_results: List[OCRResult],
+        ocr_results_with_words: Dict[str, Dict],
         chosen_redact_comprehend_entities: List[str],
         pii_identification_method: str = "Local",
         comprehend_client = "",
         combined_results = []
         for i, text_line in enumerate(line_level_ocr_results):
             line_results = next((results for idx, results in all_text_line_results if idx == i), [])
+            if line_results and i < len(ocr_results_with_words):
+                child_level_key = list(ocr_results_with_words.keys())[i]
+                ocr_results_with_words_line_level = ocr_results_with_words[child_level_key]
                 for result in line_results:
                     bbox_results = self.map_analyzer_results_to_bounding_boxes(
                         )],
                         text_line.text,
                         text_analyzer_kwargs.get('allow_list', []),
+                        ocr_results_with_words_line_level
                     )
                     combined_results.extend(bbox_results)
     redaction_relevant_ocr_results: List[OCRResult],
     full_text: str,
     allow_list: List[str],
+    ocr_results_with_words_child_info: Dict[str, Dict]
 ) -> List[CustomImageRecognizerResult]:
         redaction_bboxes = []
         for redaction_relevant_ocr_result in redaction_relevant_ocr_results:
+            #print("ocr_results_with_words_child_info:", ocr_results_with_words_child_info)
+            line_text = ocr_results_with_words_child_info['text']
             line_length = len(line_text)
             redaction_text = redaction_relevant_ocr_result.text
                     # print(f"Found match: '{matched_text}' in line")
+                    # for word_info in ocr_results_with_words_child_info.get('words', []):
                     #     # Check if this word is part of our match
                     #     if any(word.lower() in word_info['text'].lower() for word in matched_words):
                     #         matching_word_boxes.append(word_info['bounding_box'])
                     # Find the corresponding words in the OCR results
                     matching_word_boxes = []
+                    #print("ocr_results_with_words_child_info:", ocr_results_with_words_child_info)
                     current_position = 0
+                    for word_info in ocr_results_with_words_child_info.get('words', []):
                         word_text = word_info['text']
                         word_length = len(word_text)

tools/data_anonymise.py CHANGED Viewed

@@ -1,10 +1,12 @@
 import re
 import secrets
 import base64
 import time
 import boto3
 import botocore
 import pandas as pd
 from faker import Faker
 from gradio import Progress
@@ -226,6 +228,7 @@ def anonymise_data_files(file_paths: List[str],
                          comprehend_query_number:int=0,
                          aws_access_key_textbox:str='',
                          aws_secret_key_textbox:str='',
                          progress: Progress = Progress(track_tqdm=True)):
     """
     This function anonymises data files based on the provided parameters.
@@ -252,6 +255,7 @@ def anonymise_data_files(file_paths: List[str],
     - comprehend_query_number (int, optional): A counter tracking the number of queries to AWS Comprehend.
     - aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
     - aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
     - progress (Progress, optional): A Progress object to track progress. Defaults to a Progress object with track_tqdm=True.
     """
@@ -277,9 +281,16 @@ def anonymise_data_files(file_paths: List[str],
     if not out_file_paths:
         out_file_paths = []
-    if in_allow_list:
-        in_allow_list_flat = in_allow_list #[item for sublist in in_allow_list for item in sublist]
     else:
         in_allow_list_flat = []
@@ -306,7 +317,7 @@ def anonymise_data_files(file_paths: List[str],
         else:
             comprehend_client = ""
             out_message = "Cannot connect to AWS Comprehend service. Please provide access keys under Textract settings on the Redaction settings tab, or choose another PII identification method."
-            print(out_message)
     # Check if files and text exist
     if not file_paths:
@@ -314,7 +325,7 @@ def anonymise_data_files(file_paths: List[str],
             file_paths=['open_text']
         else:
             out_message = "Please enter text or a file to redact."
-            return out_message, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths
     # If we have already redacted the last file, return the input out_message and file list to the relevant components
     if latest_file_completed >= len(file_paths):
@@ -322,18 +333,18 @@ def anonymise_data_files(file_paths: List[str],
         # Set to a very high number so as not to mess with subsequent file processing by the user
         latest_file_completed = 99
         final_out_message = '\n'.join(out_message)
-        return final_out_message, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths
     file_path_loop = [file_paths[int(latest_file_completed)]]
-    for anon_file in progress.tqdm(file_path_loop, desc="Anonymising files", unit = "file"):
         if anon_file=='open_text':
             anon_df = pd.DataFrame(data={'text':[in_text]})
             chosen_cols=['text']
             sheet_name = ""
             file_type = ""
-            out_file_part = anon_file
             out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, "", log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=OUTPUT_FOLDER)
         else:
@@ -350,26 +361,22 @@ def anonymise_data_files(file_paths: List[str],
                     out_message.append("No Excel sheets selected. Please select at least one to anonymise.")
                     continue
-                anon_xlsx = pd.ExcelFile(anon_file)
                 # Create xlsx file:
-                anon_xlsx_export_file_name = output_folder + out_file_part + "_redacted.xlsx"
-                from openpyxl import Workbook
-                wb = Workbook()
-                wb.save(anon_xlsx_export_file_name)
                 # Iterate through the sheet names
-                for sheet_name in in_excel_sheets:
                     # Read each sheet into a DataFrame
                     if sheet_name not in anon_xlsx.sheet_names:
                         continue
                     anon_df = pd.read_excel(anon_file, sheet_name=sheet_name)
-                    out_file_paths, out_message, key_string, log_files_output_paths  = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, "", log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=output_folder)
             else:
                 sheet_name = ""
                 anon_df = read_file(anon_file)
@@ -380,23 +387,28 @@ def anonymise_data_files(file_paths: List[str],
         # Increase latest file completed count unless we are at the last file
         if latest_file_completed != len(file_paths):
             print("Completed file number:", str(latest_file_completed))
-            latest_file_completed += 1
         toc = time.perf_counter()
-        out_time = f"in {toc - tic:0.1f} seconds."
-        print(out_time)
-        if anon_strat == "encrypt":
-            out_message.append(". Your decryption key is " + key_string + ".")
         out_message.append("Anonymisation of file '" + out_file_part + "' successfully completed in")
         out_message_out = '\n'.join(out_message)
         out_message_out = out_message_out + " " + out_time
         out_message_out = out_message_out + "\n\nGo to to the Redaction settings tab to see redaction logs. Please give feedback on the results below to help improve this app."
-    return out_message_out, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths
 def anon_wrapper_func(
     anon_file: str,
@@ -495,7 +507,6 @@ def anon_wrapper_func(
     anon_df_out = anon_df_out[all_cols_original_order]
     # Export file
     #  Rename anonymisation strategy for file path naming
     if anon_strat == "replace with 'REDACTED'": anon_strat_txt = "redact_replace"
     elif anon_strat == "replace with <ENTITY_NAME>": anon_strat_txt = "redact_entity_type"
@@ -507,8 +518,14 @@ def anon_wrapper_func(
         anon_export_file_name = anon_xlsx_export_file_name
         # Create a Pandas Excel writer using XlsxWriter as the engine.
-        with pd.ExcelWriter(anon_xlsx_export_file_name, engine='openpyxl', mode='a') as writer:
             # Write each DataFrame to a different worksheet.
             anon_df_out.to_excel(writer, sheet_name=excel_sheet_name, index=None)
@@ -532,7 +549,7 @@ def anon_wrapper_func(
     # Print result text to output text box if just anonymising open text
     if anon_file=='open_text':
-        out_message = [anon_df_out['text'][0]]
     return out_file_paths, out_message, key_string, log_files_output_paths
@@ -551,8 +568,16 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
     # DataFrame to dict
     df_dict = df.to_dict(orient="list")
-    if in_allow_list:
-        in_allow_list_flat = in_allow_list #[item for sublist in in_allow_list for item in sublist]
     else:
         in_allow_list_flat = []
@@ -577,11 +602,8 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
     #analyzer = nlp_analyser #AnalyzerEngine()
     batch_analyzer = BatchAnalyzerEngine(analyzer_engine=nlp_analyser)
     anonymizer = AnonymizerEngine()#conflict_resolution=ConflictResolutionStrategy.MERGE_SIMILAR_OR_CONTAINED)
-    batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
     analyzer_results = []
     if pii_identification_method == "Local":
@@ -692,12 +714,6 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
     analyse_time_out = f"Analysing the text took {analyse_toc - analyse_tic:0.1f} seconds."
     print(analyse_time_out)
-    # Create faker function (note that it has to receive a value)
-    #fake = Faker("en_UK")
-    #def fake_first_name(x):
-    #    return fake.first_name()
     # Set up the anonymization configuration WITHOUT DATE_TIME
     simple_replace_config = eval('{"DEFAULT": OperatorConfig("replace", {"new_value": "REDACTED"})}')
     replace_config = eval('{"DEFAULT": OperatorConfig("replace")}')
@@ -714,9 +730,13 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
     if anon_strat == "mask": chosen_mask_config = mask_config
     if anon_strat == "encrypt":
         chosen_mask_config = people_encrypt_config
-        # Generate a 128-bit AES key. Then encode the key using base64 to get a string representation
-        key = secrets.token_bytes(16)  # 128 bits = 16 bytes
         key_string = base64.b64encode(key).decode('utf-8')
     elif anon_strat == "fake_first_name": chosen_mask_config = fake_first_name_config
     # I think in general people will want to keep date / times - removed Mar 2025 as I don't want to assume for people.

 import re
+import os
 import secrets
 import base64
 import time
 import boto3
 import botocore
 import pandas as pd
+from openpyxl import Workbook, load_workbook
 from faker import Faker
 from gradio import Progress
                          comprehend_query_number:int=0,
                          aws_access_key_textbox:str='',
                          aws_secret_key_textbox:str='',
+                         actual_time_taken_number:float=0,
                          progress: Progress = Progress(track_tqdm=True)):
     """
     This function anonymises data files based on the provided parameters.
     - comprehend_query_number (int, optional): A counter tracking the number of queries to AWS Comprehend.
     - aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
     - aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
+    - actual_time_taken_number (float, optional): Time taken to do the redaction.
     - progress (Progress, optional): A Progress object to track progress. Defaults to a Progress object with track_tqdm=True.
     """
     if not out_file_paths:
         out_file_paths = []
+    if isinstance(in_allow_list, list):
+        if in_allow_list:
+            in_allow_list_flat = in_allow_list
+        else:
+            in_allow_list_flat = []
+    elif isinstance(in_allow_list, pd.DataFrame):
+        if not in_allow_list.empty:
+            in_allow_list_flat = list(in_allow_list.iloc[:, 0].unique())
+        else:
+            in_allow_list_flat = []
     else:
         in_allow_list_flat = []
         else:
             comprehend_client = ""
             out_message = "Cannot connect to AWS Comprehend service. Please provide access keys under Textract settings on the Redaction settings tab, or choose another PII identification method."
+            raise(out_message)
     # Check if files and text exist
     if not file_paths:
             file_paths=['open_text']
         else:
             out_message = "Please enter text or a file to redact."
+            raise Exception(out_message)
     # If we have already redacted the last file, return the input out_message and file list to the relevant components
     if latest_file_completed >= len(file_paths):
         # Set to a very high number so as not to mess with subsequent file processing by the user
         latest_file_completed = 99
         final_out_message = '\n'.join(out_message)
+        return final_out_message, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths, actual_time_taken_number
     file_path_loop = [file_paths[int(latest_file_completed)]]
+    for anon_file in progress.tqdm(file_path_loop, desc="Anonymising files", unit = "files"):
         if anon_file=='open_text':
             anon_df = pd.DataFrame(data={'text':[in_text]})
             chosen_cols=['text']
+            out_file_part = anon_file
             sheet_name = ""
             file_type = ""
             out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, "", log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=OUTPUT_FOLDER)
         else:
                     out_message.append("No Excel sheets selected. Please select at least one to anonymise.")
                     continue
                 # Create xlsx file:
+                anon_xlsx = pd.ExcelFile(anon_file)
+                anon_xlsx_export_file_name = output_folder + out_file_part + "_redacted.xlsx"
                 # Iterate through the sheet names
+                for sheet_name in progress.tqdm(in_excel_sheets, desc="Anonymising sheets", unit = "sheets"):
                     # Read each sheet into a DataFrame
                     if sheet_name not in anon_xlsx.sheet_names:
                         continue
                     anon_df = pd.read_excel(anon_file, sheet_name=sheet_name)
+                    out_file_paths, out_message, key_string, log_files_output_paths  = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, anon_xlsx_export_file_name, log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=output_folder)
             else:
                 sheet_name = ""
                 anon_df = read_file(anon_file)
         # Increase latest file completed count unless we are at the last file
         if latest_file_completed != len(file_paths):
             print("Completed file number:", str(latest_file_completed))
+            latest_file_completed += 1
         toc = time.perf_counter()
+        out_time_float = toc - tic
+        out_time = f"in {out_time_float:0.1f} seconds."
+        print(out_time)
+        actual_time_taken_number += out_time_float
         out_message.append("Anonymisation of file '" + out_file_part + "' successfully completed in")
         out_message_out = '\n'.join(out_message)
         out_message_out = out_message_out + " " + out_time
+        if anon_strat == "encrypt":
+            out_message_out.append(". Your decryption key is " + key_string)
         out_message_out = out_message_out + "\n\nGo to to the Redaction settings tab to see redaction logs. Please give feedback on the results below to help improve this app."
+        out_message_out = re.sub(r'^\n+|^\. ', '', out_message_out).strip()
+    return out_message_out, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths, actual_time_taken_number
 def anon_wrapper_func(
     anon_file: str,
     anon_df_out = anon_df_out[all_cols_original_order]
     # Export file
     #  Rename anonymisation strategy for file path naming
     if anon_strat == "replace with 'REDACTED'": anon_strat_txt = "redact_replace"
     elif anon_strat == "replace with <ENTITY_NAME>": anon_strat_txt = "redact_entity_type"
         anon_export_file_name = anon_xlsx_export_file_name
+        if not os.path.exists(anon_xlsx_export_file_name):
+            wb = Workbook()
+            ws = wb.active  # Get the default active sheet
+            ws.title = excel_sheet_name
+            wb.save(anon_xlsx_export_file_name)
         # Create a Pandas Excel writer using XlsxWriter as the engine.
+        with pd.ExcelWriter(anon_xlsx_export_file_name, engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
             # Write each DataFrame to a different worksheet.
             anon_df_out.to_excel(writer, sheet_name=excel_sheet_name, index=None)
     # Print result text to output text box if just anonymising open text
     if anon_file=='open_text':
+        out_message = ["'" + anon_df_out['text'][0] + "'"]
     return out_file_paths, out_message, key_string, log_files_output_paths
     # DataFrame to dict
     df_dict = df.to_dict(orient="list")
+    if isinstance(in_allow_list, list):
+        if in_allow_list:
+            in_allow_list_flat = in_allow_list
+        else:
+            in_allow_list_flat = []
+    elif isinstance(in_allow_list, pd.DataFrame):
+        if not in_allow_list.empty:
+            in_allow_list_flat = list(in_allow_list.iloc[:, 0].unique())
+        else:
+            in_allow_list_flat = []
     else:
         in_allow_list_flat = []
     #analyzer = nlp_analyser #AnalyzerEngine()
     batch_analyzer = BatchAnalyzerEngine(analyzer_engine=nlp_analyser)
     anonymizer = AnonymizerEngine()#conflict_resolution=ConflictResolutionStrategy.MERGE_SIMILAR_OR_CONTAINED)
+    batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
     analyzer_results = []
     if pii_identification_method == "Local":
     analyse_time_out = f"Analysing the text took {analyse_toc - analyse_tic:0.1f} seconds."
     print(analyse_time_out)
     # Set up the anonymization configuration WITHOUT DATE_TIME
     simple_replace_config = eval('{"DEFAULT": OperatorConfig("replace", {"new_value": "REDACTED"})}')
     replace_config = eval('{"DEFAULT": OperatorConfig("replace")}')
     if anon_strat == "mask": chosen_mask_config = mask_config
     if anon_strat == "encrypt":
         chosen_mask_config = people_encrypt_config
+        key = secrets.token_bytes(16)  # 128 bits = 16 bytes
         key_string = base64.b64encode(key).decode('utf-8')
+        # Now inject the key into the operator config
+        for entity, operator in chosen_mask_config.items():
+            if operator.operator_name == "encrypt":
+                operator.params = {"key": key_string}
     elif anon_strat == "fake_first_name": chosen_mask_config = fake_first_name_config
     # I think in general people will want to keep date / times - removed Mar 2025 as I don't want to assume for people.

tools/file_conversion.py CHANGED Viewed

@@ -462,7 +462,8 @@ def prepare_image_or_pdf(
     input_folder:str=INPUT_FOLDER,
     prepare_images:bool=True,
     page_sizes:list[dict]=[],
-    textract_output_found:bool = False,
     progress: Progress = Progress(track_tqdm=True)
 ) -> tuple[List[str], List[str]]:
     """
@@ -484,7 +485,8 @@ def prepare_image_or_pdf(
         output_folder (optional, str): The output folder for file save
         prepare_images (optional, bool): A boolean indicating whether to create images for each PDF page. Defaults to True.
         page_sizes(optional, List[dict]): A list of dicts containing information about page sizes in various formats.
-        textract_output_found (optional, bool): A boolean indicating whether textract output has already been found . Defaults to False.
         progress (optional, Progress): Progress tracker for the operation
@@ -536,7 +538,7 @@ def prepare_image_or_pdf(
             final_out_message = '\n'.join(out_message)
         else:
             final_out_message = out_message
-        return final_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df
     progress(0.1, desc='Preparing file')
@@ -639,8 +641,8 @@ def prepare_image_or_pdf(
                     # Assuming file_path is a NamedString or similar
                     all_annotations_object = json.loads(file_path)  # Use loads for string content
-            # Assume it's a textract json
-            elif (file_extension in ['.json']) and (prepare_for_review != True):
                 print("Saving Textract output")
                 # Copy it to the output folder so it can be used later.
                 output_textract_json_file_name = file_path_without_ext
@@ -654,6 +656,20 @@ def prepare_image_or_pdf(
                 textract_output_found = True
                 continue
             # NEW IF STATEMENT
             # If you have an annotations object from the above code
             if all_annotations_object:
@@ -773,7 +789,40 @@ def prepare_image_or_pdf(
     number_of_pages = len(page_sizes)#len(image_file_paths)
-    return combined_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df
 def convert_text_pdf_to_img_pdf(in_file_path:str, out_text_file_path:List[str], image_dpi:float=image_dpi, output_folder:str=OUTPUT_FOLDER, input_folder:str=INPUT_FOLDER):
     file_path_without_ext = get_file_name_without_type(in_file_path)
@@ -1280,6 +1329,8 @@ def convert_annotation_data_to_dataframe(all_annotations: List[Dict[str, Any]]):
     # but it's good practice if columns could be missing for other reasons.
     final_df = final_df.reindex(columns=final_col_order, fill_value=pd.NA)
     return final_df
 def create_annotation_dicts_from_annotation_df(
@@ -1558,6 +1609,9 @@ def convert_annotation_json_to_review_df(
          except TypeError as e:
               print(f"Warning: Could not sort DataFrame due to type error in sort columns: {e}")
               # Proceed without sorting
     return review_file_df
 def fill_missing_box_ids(data_input: dict) -> dict:
@@ -1787,6 +1841,8 @@ def convert_review_df_to_annotation_json(
     Returns:
         List of dictionaries suitable for Gradio Annotation output, one dict per image/page.
     """
     if not page_sizes:
         raise ValueError("page_sizes argument is required and cannot be empty.")

     input_folder:str=INPUT_FOLDER,
     prepare_images:bool=True,
     page_sizes:list[dict]=[],
+    textract_output_found:bool = False,
+    local_ocr_output_found:bool = False,
     progress: Progress = Progress(track_tqdm=True)
 ) -> tuple[List[str], List[str]]:
     """
         output_folder (optional, str): The output folder for file save
         prepare_images (optional, bool): A boolean indicating whether to create images for each PDF page. Defaults to True.
         page_sizes(optional, List[dict]): A list of dicts containing information about page sizes in various formats.
+        textract_output_found (optional, bool): A boolean indicating whether Textract analysis output has already been found. Defaults to False.
+        local_ocr_output_found (optional, bool): A boolean indicating whether local OCR analysis output has already been found. Defaults to False.
         progress (optional, Progress): Progress tracker for the operation
             final_out_message = '\n'.join(out_message)
         else:
             final_out_message = out_message
+        return final_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df, local_ocr_output_found
     progress(0.1, desc='Preparing file')
                     # Assuming file_path is a NamedString or similar
                     all_annotations_object = json.loads(file_path)  # Use loads for string content
+            # Save Textract file to folder
+            elif (file_extension in ['.json']) and '_textract' in file_path_without_ext: #(prepare_for_review != True):
                 print("Saving Textract output")
                 # Copy it to the output folder so it can be used later.
                 output_textract_json_file_name = file_path_without_ext
                 textract_output_found = True
                 continue
+            elif (file_extension in ['.json']) and '_ocr_results_with_words' in file_path_without_ext: #(prepare_for_review != True):
+                print("Saving local OCR output")
+                # Copy it to the output folder so it can be used later.
+                output_ocr_results_with_words_json_file_name = file_path_without_ext
+                if not file_path.endswith("_ocr_results_with_words.json"): output_ocr_results_with_words_json_file_name = file_path_without_ext + "_ocr_results_with_words.json"
+                else: output_ocr_results_with_words_json_file_name = file_path_without_ext + ".json"
+                out_ocr_results_with_words_path = os.path.join(output_folder, output_ocr_results_with_words_json_file_name)
+                # Use shutil to copy the file directly
+                shutil.copy2(file_path, out_ocr_results_with_words_path)  # Preserves metadata
+                local_ocr_output_found = True
+                continue
             # NEW IF STATEMENT
             # If you have an annotations object from the above code
             if all_annotations_object:
     number_of_pages = len(page_sizes)#len(image_file_paths)
+    return combined_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df, local_ocr_output_found
+def load_and_convert_ocr_results_with_words_json(ocr_results_with_words_json_file_path:str, log_files_output_paths:str, page_sizes_df:pd.DataFrame):
+    """
+    Loads Textract JSON from a file, detects if conversion is needed, and converts if necessary.
+    """
+    if not os.path.exists(ocr_results_with_words_json_file_path):
+        print("No existing OCR results file found.")
+        return [], True, log_files_output_paths  # Return empty dict and flag indicating missing file
+    no_ocr_results_with_words_file = False
+    print("Found existing OCR results json results file.")
+    # Track log files
+    if ocr_results_with_words_json_file_path not in log_files_output_paths:
+        log_files_output_paths.append(ocr_results_with_words_json_file_path)
+    try:
+        with open(ocr_results_with_words_json_file_path, 'r', encoding='utf-8') as json_file:
+            ocr_results_with_words_data = json.load(json_file)
+    except json.JSONDecodeError:
+        print("Error: Failed to parse OCR results JSON file. Returning empty data.")
+        return [], True, log_files_output_paths  # Indicate failure
+    # Check if conversion is needed
+    if "page" and "results" in ocr_results_with_words_data[0]:
+        print("JSON already in the correct format for app. No changes needed.")
+        return ocr_results_with_words_data, False, log_files_output_paths  # No conversion required
+    else:
+        print("Invalid OCR result JSON format: 'page' or 'results' key missing.")
+        #print("OCR results with words data:", ocr_results_with_words_data)
+        return [], True, log_files_output_paths  # Return empty data if JSON is not recognized
 def convert_text_pdf_to_img_pdf(in_file_path:str, out_text_file_path:List[str], image_dpi:float=image_dpi, output_folder:str=OUTPUT_FOLDER, input_folder:str=INPUT_FOLDER):
     file_path_without_ext = get_file_name_without_type(in_file_path)
     # but it's good practice if columns could be missing for other reasons.
     final_df = final_df.reindex(columns=final_col_order, fill_value=pd.NA)
+    final_df = final_df.dropna(subset=["xmin", "xmax", "ymin", "ymax", "text", "id", "label"])
     return final_df
 def create_annotation_dicts_from_annotation_df(
          except TypeError as e:
               print(f"Warning: Could not sort DataFrame due to type error in sort columns: {e}")
               # Proceed without sorting
+    review_file_df = review_file_df.dropna(subset=["xmin", "xmax", "ymin", "ymax", "text", "id", "label"])
     return review_file_df
 def fill_missing_box_ids(data_input: dict) -> dict:
     Returns:
         List of dictionaries suitable for Gradio Annotation output, one dict per image/page.
     """
+    review_file_df = review_file_df.dropna(subset=["xmin", "xmax", "ymin", "ymax", "text", "id", "label"])
     if not page_sizes:
         raise ValueError("page_sizes argument is required and cannot be empty.")

tools/file_redaction.py CHANGED Viewed

@@ -20,8 +20,8 @@ from gradio import Progress
 from collections import defaultdict  # For efficient grouping
 from tools.config import OUTPUT_FOLDER, IMAGES_DPI, MAX_IMAGE_PIXELS, RUN_AWS_FUNCTIONS, AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION, PAGE_BREAK_VALUE, MAX_TIME_VALUE, LOAD_TRUNCATED_IMAGES, INPUT_FOLDER
-from tools.custom_image_analyser_engine import CustomImageAnalyzerEngine, OCRResult, combine_ocr_results, CustomImageRecognizerResult, run_page_text_redaction, merge_text_bounding_boxes
-from tools.file_conversion import convert_annotation_json_to_review_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords, is_pdf, is_pdf_or_image, prepare_image_or_pdf, divide_coordinates_by_page_sizes, multiply_coordinates_by_page_sizes, convert_annotation_data_to_dataframe, divide_coordinates_by_page_sizes, create_annotation_dicts_from_annotation_df, remove_duplicate_images_with_blank_boxes, fill_missing_ids, fill_missing_box_ids
 from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
 from tools.helper_functions import get_file_name_without_type, clean_unicode_text, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option
 from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult, load_and_convert_textract_json
@@ -101,6 +101,8 @@ def choose_and_run_redactor(file_paths:List[str],
  input_folder:str=INPUT_FOLDER,
  total_textract_query_number:int=0,
  ocr_file_path:str="",
  prepare_images:bool=True,
  progress=gr.Progress(track_tqdm=True)):
     '''
@@ -149,7 +151,9 @@ def choose_and_run_redactor(file_paths:List[str],
     - review_file_path (str, optional): The latest review file path created by the app
     - input_folder (str, optional): The custom input path, if provided
     - total_textract_query_number (int, optional): The number of textract queries up until this point.
-    - ocr_file_path (str, optional): The latest ocr file path created by the app
     - prepare_images (bool, optional): Boolean to determine whether to load images for the PDF.
     - progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
@@ -179,9 +183,16 @@ def choose_and_run_redactor(file_paths:List[str],
         out_file_paths = []
         estimate_total_processing_time = 0
         estimated_time_taken_state = 0
     # If not the first time around, and the current page loop has been set to a huge number (been through all pages), reset current page to 0
     elif (first_loop_state == False) & (current_loop_page == 999):
         current_loop_page = 0
     # Choose the correct file to prepare
     if isinstance(file_paths, str): file_paths_list = [os.path.abspath(file_paths)]
@@ -219,6 +230,8 @@ def choose_and_run_redactor(file_paths:List[str],
         elif out_message:
             combined_out_message = combined_out_message + '\n' + out_message
         # Only send across review file if redaction has been done
         if pii_identification_method != no_redaction_option:
@@ -226,10 +239,15 @@ def choose_and_run_redactor(file_paths:List[str],
                 #review_file_path = [x for x in out_file_paths if "review_file" in x]
                 if review_file_path: review_out_file_paths.append(review_file_path)
         estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
         print("Estimated total processing time:", str(estimate_total_processing_time))
-        return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path
     #if first_loop_state == False:
     # Prepare documents and images as required if they don't already exist
@@ -259,7 +277,7 @@ def choose_and_run_redactor(file_paths:List[str],
     # Call prepare_image_or_pdf only if needed
     if prepare_images_flag is not None:
-        out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df = prepare_image_or_pdf(
             file_paths_loop, text_extraction_method, 0, out_message, True,
             annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
             output_folder, prepare_images=prepare_images_flag, page_sizes=page_sizes, input_folder=input_folder
@@ -274,11 +292,15 @@ def choose_and_run_redactor(file_paths:List[str],
     page_sizes = page_sizes_df.to_dict(orient="records")
     number_of_pages = pymupdf_doc.page_count
     # If we have reached the last page, return message and outputs
     if current_loop_page >= number_of_pages:
         print("Reached last page of document:", current_loop_page)
         # Set to a very high number so as not to mix up with subsequent file processing by the user
         current_loop_page = 999
         if out_message:
@@ -291,7 +313,7 @@ def choose_and_run_redactor(file_paths:List[str],
                 #review_file_path = [x for x in out_file_paths if "review_file" in x]
                 if review_file_path: review_out_file_paths.append(review_file_path)
-        return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path
     # Load/create allow list
     # If string, assume file path
@@ -421,7 +443,7 @@ def choose_and_run_redactor(file_paths:List[str],
             print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
-            pymupdf_doc, all_pages_decision_process_table, out_file_paths, new_textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number = redact_image_pdf(file_path,
              pdf_image_file_paths,
              language,
              chosen_redact_entities,
@@ -447,7 +469,9 @@ def choose_and_run_redactor(file_paths:List[str],
              max_fuzzy_spelling_mistakes_num,
              match_fuzzy_whole_phrase_bool,
              page_sizes_df,
-             text_extraction_only,
              log_files_output_paths=log_files_output_paths,
              output_folder=output_folder)
@@ -598,7 +622,10 @@ def choose_and_run_redactor(file_paths:List[str],
     if not review_file_path: review_out_file_paths = [prepared_pdf_file_paths[-1]]
     else: review_out_file_paths = [prepared_pdf_file_paths[-1], review_file_path]
-    return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path
 def convert_pikepdf_coords_to_pymupdf(pymupdf_page:Page, pikepdf_bbox, type="pikepdf_annot"):
     '''
@@ -1163,7 +1190,9 @@ def redact_image_pdf(file_path:str,
                      max_fuzzy_spelling_mistakes_num:int=1,
                      match_fuzzy_whole_phrase_bool:bool=True,
                      page_sizes_df:pd.DataFrame=pd.DataFrame(),
-                     text_extraction_only:bool=False,
                      page_break_val:int=int(PAGE_BREAK_VALUE),
                      log_files_output_paths:List=[],
                      max_time:int=int(MAX_TIME_VALUE),
@@ -1235,7 +1264,6 @@ def redact_image_pdf(file_path:str,
         print(out_message_warning)
         #raise Exception(out_message)
     number_of_pages = pymupdf_doc.page_count
     print("Number of pages:", str(number_of_pages))
@@ -1253,14 +1281,24 @@ def redact_image_pdf(file_path:str,
         textract_data, is_missing, log_files_output_paths = load_and_convert_textract_json(textract_json_file_path, log_files_output_paths, page_sizes_df)
         original_textract_data = textract_data.copy()
     ###
     if current_loop_page == 0: page_loop_start = 0
     else: page_loop_start = current_loop_page
     progress_bar = tqdm(range(page_loop_start, number_of_pages), unit="pages remaining", desc="Redacting pages")
-    all_pages_decision_process_table_list = [all_pages_decision_process_table]
     all_line_level_ocr_results_df_list = [all_line_level_ocr_results_df]
     # Go through each page
     for page_no in progress_bar:
@@ -1268,6 +1306,7 @@ def redact_image_pdf(file_path:str,
         handwriting_or_signature_boxes = []
         page_signature_recogniser_results = []
         page_handwriting_recogniser_results = []
         page_break_return = False
         reported_page_number = str(page_no + 1)
@@ -1317,8 +1356,44 @@ def redact_image_pdf(file_path:str,
                 #print("print(type(image_path)):", print(type(image_path)))
                 #if not isinstance(image_path, image_path.image_path) or not isinstance(image_path, str): raise Exception("image_path object for page", reported_page_number, "not found, cannot perform local OCR analysis.")
-                page_word_level_ocr_results = image_analyser.perform_ocr(image_path)
-                page_line_level_ocr_results, page_line_level_ocr_results_with_children = combine_ocr_results(page_word_level_ocr_results)
             # Check if page exists in existing textract data. If not, send to service to analyse
             if text_extraction_method == textract_option:
@@ -1382,16 +1457,28 @@ def redact_image_pdf(file_path:str,
                         # If the page exists, retrieve the data
                         text_blocks = next(page['data'] for page in textract_data["pages"] if page['page_no'] == reported_page_number)
-                page_line_level_ocr_results, handwriting_or_signature_boxes, page_signature_recogniser_results, page_handwriting_recogniser_results, page_line_level_ocr_results_with_children = json_to_ocrresult(text_blocks, page_width, page_height, reported_page_number)
             if pii_identification_method != no_redaction_option:
                 # Step 2: Analyse text and identify PII
                 if chosen_redact_entities or chosen_redact_comprehend_entities:
                     page_redaction_bounding_boxes, comprehend_query_number_new = image_analyser.analyze_text(
-                        page_line_level_ocr_results,
-                        page_line_level_ocr_results_with_children,
                         chosen_redact_comprehend_entities = chosen_redact_comprehend_entities,
                         pii_identification_method = pii_identification_method,
                         comprehend_client=comprehend_client,
@@ -1406,7 +1493,7 @@ def redact_image_pdf(file_path:str,
                 else: page_redaction_bounding_boxes = []
                 # Merge redaction bounding boxes that are close together
-                page_merged_redaction_bboxes = merge_img_bboxes(page_redaction_bounding_boxes, page_line_level_ocr_results_with_children, page_signature_recogniser_results, page_handwriting_recogniser_results, handwrite_signature_checkbox)
             else: page_merged_redaction_bboxes = []
@@ -1492,19 +1579,6 @@ def redact_image_pdf(file_path:str,
             decision_process_table = fill_missing_ids(decision_process_table)
             #decision_process_table.to_csv("output/decision_process_table_with_ids.csv")
-            # Convert to DataFrame and add to ongoing logging table
-            line_level_ocr_results_df = pd.DataFrame([{
-                'page': reported_page_number,
-                'text': result.text,
-                'left': result.left,
-                'top': result.top,
-                'width': result.width,
-                'height': result.height
-            } for result in page_line_level_ocr_results])
-            all_line_level_ocr_results_df_list.append(line_level_ocr_results_df)
             toc = time.perf_counter()
             time_taken = toc - tic
@@ -1529,6 +1603,8 @@ def redact_image_pdf(file_path:str,
                     # Append new annotation if it doesn't exist
                     annotations_all_pages.append(page_image_annotations)
                 if text_extraction_method == textract_option:
                     if original_textract_data != textract_data:
                         # Write the updated existing textract data back to the JSON file
@@ -1538,12 +1614,21 @@ def redact_image_pdf(file_path:str,
                     if textract_json_file_path not in log_files_output_paths:
                         log_files_output_paths.append(textract_json_file_path)
                 all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
                 all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
                 current_loop_page += 1
-                return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number
         # If it's an image file
         if is_pdf(file_path) == False:
@@ -1576,10 +1661,20 @@ def redact_image_pdf(file_path:str,
                 if textract_json_file_path not in log_files_output_paths:
                     log_files_output_paths.append(textract_json_file_path)
             all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
             all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
-            return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number
     if text_extraction_method == textract_option:
         # Write the updated existing textract data back to the JSON file
@@ -1591,15 +1686,24 @@ def redact_image_pdf(file_path:str,
         if textract_json_file_path not in log_files_output_paths:
             log_files_output_paths.append(textract_json_file_path)
     all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
     all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
-    # Convert decision table to relative coordinates
     all_pages_decision_process_table = divide_coordinates_by_page_sizes(all_pages_decision_process_table, page_sizes_df, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax")
     all_line_level_ocr_results_df = divide_coordinates_by_page_sizes(all_line_level_ocr_results_df, page_sizes_df, xmin="left", xmax="width", ymin="top", ymax="height")
-    return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number
 ###

 from collections import defaultdict  # For efficient grouping
 from tools.config import OUTPUT_FOLDER, IMAGES_DPI, MAX_IMAGE_PIXELS, RUN_AWS_FUNCTIONS, AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION, PAGE_BREAK_VALUE, MAX_TIME_VALUE, LOAD_TRUNCATED_IMAGES, INPUT_FOLDER
+from tools.custom_image_analyser_engine import CustomImageAnalyzerEngine, OCRResult, combine_ocr_results, CustomImageRecognizerResult, run_page_text_redaction, merge_text_bounding_boxes, recreate_page_line_level_ocr_results_with_page
+from tools.file_conversion import convert_annotation_json_to_review_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords, is_pdf, is_pdf_or_image, prepare_image_or_pdf, divide_coordinates_by_page_sizes, multiply_coordinates_by_page_sizes, convert_annotation_data_to_dataframe, divide_coordinates_by_page_sizes, create_annotation_dicts_from_annotation_df, remove_duplicate_images_with_blank_boxes, fill_missing_ids, fill_missing_box_ids, load_and_convert_ocr_results_with_words_json
 from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
 from tools.helper_functions import get_file_name_without_type, clean_unicode_text, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option
 from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult, load_and_convert_textract_json
  input_folder:str=INPUT_FOLDER,
  total_textract_query_number:int=0,
  ocr_file_path:str="",
+ all_page_line_level_ocr_results = [],
+ all_page_line_level_ocr_results_with_words = [],
  prepare_images:bool=True,
  progress=gr.Progress(track_tqdm=True)):
     '''
     - review_file_path (str, optional): The latest review file path created by the app
     - input_folder (str, optional): The custom input path, if provided
     - total_textract_query_number (int, optional): The number of textract queries up until this point.
+    - ocr_file_path (str, optional): The latest ocr file path created by the app.
+    - all_page_line_level_ocr_results (list, optional): All line level text on the page with bounding boxes.
+    - all_page_line_level_ocr_results_with_words (list, optional): All word level text on the page with bounding boxes.
     - prepare_images (bool, optional): Boolean to determine whether to load images for the PDF.
     - progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
         out_file_paths = []
         estimate_total_processing_time = 0
         estimated_time_taken_state = 0
+        comprehend_query_number = 0
+        total_textract_query_number = 0
+    elif current_loop_page == 0:
+        comprehend_query_number = 0
+        total_textract_query_number = 0
     # If not the first time around, and the current page loop has been set to a huge number (been through all pages), reset current page to 0
     elif (first_loop_state == False) & (current_loop_page == 999):
         current_loop_page = 0
+        total_textract_query_number = 0
+        comprehend_query_number = 0
     # Choose the correct file to prepare
     if isinstance(file_paths, str): file_paths_list = [os.path.abspath(file_paths)]
         elif out_message:
             combined_out_message = combined_out_message + '\n' + out_message
+        combined_out_message = re.sub(r'^\n+', '', combined_out_message).strip()
         # Only send across review file if redaction has been done
         if pii_identification_method != no_redaction_option:
                 #review_file_path = [x for x in out_file_paths if "review_file" in x]
                 if review_file_path: review_out_file_paths.append(review_file_path)
+        if not isinstance(pymupdf_doc, list):
+            number_of_pages = pymupdf_doc.page_count
+            if total_textract_query_number  > number_of_pages:
+                total_textract_query_number = number_of_pages
         estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
         print("Estimated total processing time:", str(estimate_total_processing_time))
+        return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
     #if first_loop_state == False:
     # Prepare documents and images as required if they don't already exist
     # Call prepare_image_or_pdf only if needed
     if prepare_images_flag is not None:
+        out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df, local_ocr_output_found_checkbox = prepare_image_or_pdf(
             file_paths_loop, text_extraction_method, 0, out_message, True,
             annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
             output_folder, prepare_images=prepare_images_flag, page_sizes=page_sizes, input_folder=input_folder
     page_sizes = page_sizes_df.to_dict(orient="records")
     number_of_pages = pymupdf_doc.page_count
     # If we have reached the last page, return message and outputs
     if current_loop_page >= number_of_pages:
         print("Reached last page of document:", current_loop_page)
+        if total_textract_query_number  > number_of_pages:
+            total_textract_query_number = number_of_pages
         # Set to a very high number so as not to mix up with subsequent file processing by the user
         current_loop_page = 999
         if out_message:
                 #review_file_path = [x for x in out_file_paths if "review_file" in x]
                 if review_file_path: review_out_file_paths.append(review_file_path)
+        return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
     # Load/create allow list
     # If string, assume file path
             print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
+            pymupdf_doc, all_pages_decision_process_table, out_file_paths, new_textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words = redact_image_pdf(file_path,
              pdf_image_file_paths,
              language,
              chosen_redact_entities,
              max_fuzzy_spelling_mistakes_num,
              match_fuzzy_whole_phrase_bool,
              page_sizes_df,
+             text_extraction_only,
+             all_page_line_level_ocr_results,
+             all_page_line_level_ocr_results_with_words,
              log_files_output_paths=log_files_output_paths,
              output_folder=output_folder)
     if not review_file_path: review_out_file_paths = [prepared_pdf_file_paths[-1]]
     else: review_out_file_paths = [prepared_pdf_file_paths[-1], review_file_path]
+    if total_textract_query_number > number_of_pages:
+        total_textract_query_number = number_of_pages
+    return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
 def convert_pikepdf_coords_to_pymupdf(pymupdf_page:Page, pikepdf_bbox, type="pikepdf_annot"):
     '''
                      max_fuzzy_spelling_mistakes_num:int=1,
                      match_fuzzy_whole_phrase_bool:bool=True,
                      page_sizes_df:pd.DataFrame=pd.DataFrame(),
+                     text_extraction_only:bool=False,
+                     all_page_line_level_ocr_results = [],
+                     all_page_line_level_ocr_results_with_words = [],
                      page_break_val:int=int(PAGE_BREAK_VALUE),
                      log_files_output_paths:List=[],
                      max_time:int=int(MAX_TIME_VALUE),
         print(out_message_warning)
         #raise Exception(out_message)
     number_of_pages = pymupdf_doc.page_count
     print("Number of pages:", str(number_of_pages))
         textract_data, is_missing, log_files_output_paths = load_and_convert_textract_json(textract_json_file_path, log_files_output_paths, page_sizes_df)
         original_textract_data = textract_data.copy()
+        print("Successfully loaded in Textract analysis results from file")
+    # If running local OCR option, check if file already exists. If it does, load in existing data
+    if text_extraction_method == tesseract_ocr_option:
+        all_page_line_level_ocr_results_with_words_json_file_path = output_folder + file_name + "_ocr_results_with_words.json"
+        all_page_line_level_ocr_results_with_words, is_missing, log_files_output_paths = load_and_convert_ocr_results_with_words_json(all_page_line_level_ocr_results_with_words_json_file_path, log_files_output_paths, page_sizes_df)
+        original_all_page_line_level_ocr_results_with_words = all_page_line_level_ocr_results_with_words.copy()
+        print("Loaded in local OCR analysis results from file")
     ###
     if current_loop_page == 0: page_loop_start = 0
     else: page_loop_start = current_loop_page
     progress_bar = tqdm(range(page_loop_start, number_of_pages), unit="pages remaining", desc="Redacting pages")
     all_line_level_ocr_results_df_list = [all_line_level_ocr_results_df]
+    all_pages_decision_process_table_list = [all_pages_decision_process_table]
     # Go through each page
     for page_no in progress_bar:
         handwriting_or_signature_boxes = []
         page_signature_recogniser_results = []
         page_handwriting_recogniser_results = []
+        page_line_level_ocr_results_with_words = []
         page_break_return = False
         reported_page_number = str(page_no + 1)
                 #print("print(type(image_path)):", print(type(image_path)))
                 #if not isinstance(image_path, image_path.image_path) or not isinstance(image_path, str): raise Exception("image_path object for page", reported_page_number, "not found, cannot perform local OCR analysis.")
+                # Check for existing page_line_level_ocr_results_with_words object:
+                # page_line_level_ocr_results = (
+                # all_page_line_level_ocr_results.get('results', [])
+                # if all_page_line_level_ocr_results.get('page') == reported_page_number
+                # else []
+                # )
+                if all_page_line_level_ocr_results_with_words:
+                    # Find the first dict where 'page' matches
+                    #print("all_page_line_level_ocr_results_with_words:", all_page_line_level_ocr_results_with_words)
+                    print("All pages available:", [item.get('page') for item in all_page_line_level_ocr_results_with_words])
+                    #print("Looking for page:", reported_page_number)
+                    matching_page = next(
+                    (item for item in all_page_line_level_ocr_results_with_words if int(item.get('page', -1)) == int(reported_page_number)),
+                    None
+                    )
+                    #print("matching_page:", matching_page)
+                    page_line_level_ocr_results_with_words = matching_page if matching_page else []
+                else: page_line_level_ocr_results_with_words = []
+                if page_line_level_ocr_results_with_words:
+                    print("Found OCR results for page in existing OCR with words object")
+                    page_line_level_ocr_results = recreate_page_line_level_ocr_results_with_page(page_line_level_ocr_results_with_words)
+                else:
+                    page_word_level_ocr_results = image_analyser.perform_ocr(image_path)
+                    print("page_word_level_ocr_results:", page_word_level_ocr_results)
+                    page_line_level_ocr_results, page_line_level_ocr_results_with_words = combine_ocr_results(page_word_level_ocr_results, page=reported_page_number)
+                    all_page_line_level_ocr_results_with_words.append(page_line_level_ocr_results_with_words)
+                    print("All pages available:", [item.get('page') for item in all_page_line_level_ocr_results_with_words])
             # Check if page exists in existing textract data. If not, send to service to analyse
             if text_extraction_method == textract_option:
                         # If the page exists, retrieve the data
                         text_blocks = next(page['data'] for page in textract_data["pages"] if page['page_no'] == reported_page_number)
+                page_line_level_ocr_results, handwriting_or_signature_boxes, page_signature_recogniser_results, page_handwriting_recogniser_results, page_line_level_ocr_results_with_words = json_to_ocrresult(text_blocks, page_width, page_height, reported_page_number)
+            # Convert to DataFrame and add to ongoing logging table
+            line_level_ocr_results_df = pd.DataFrame([{
+                'page': page_line_level_ocr_results['page'],
+                'text': result.text,
+                'left': result.left,
+                'top': result.top,
+                'width': result.width,
+                'height': result.height
+            } for result in page_line_level_ocr_results['results']])
+            all_line_level_ocr_results_df_list.append(line_level_ocr_results_df)
             if pii_identification_method != no_redaction_option:
                 # Step 2: Analyse text and identify PII
                 if chosen_redact_entities or chosen_redact_comprehend_entities:
                     page_redaction_bounding_boxes, comprehend_query_number_new = image_analyser.analyze_text(
+                        page_line_level_ocr_results['results'],
+                        page_line_level_ocr_results_with_words['results'],
                         chosen_redact_comprehend_entities = chosen_redact_comprehend_entities,
                         pii_identification_method = pii_identification_method,
                         comprehend_client=comprehend_client,
                 else: page_redaction_bounding_boxes = []
                 # Merge redaction bounding boxes that are close together
+                page_merged_redaction_bboxes = merge_img_bboxes(page_redaction_bounding_boxes, page_line_level_ocr_results_with_words['results'], page_signature_recogniser_results, page_handwriting_recogniser_results, handwrite_signature_checkbox)
             else: page_merged_redaction_bboxes = []
             decision_process_table = fill_missing_ids(decision_process_table)
             #decision_process_table.to_csv("output/decision_process_table_with_ids.csv")
             toc = time.perf_counter()
             time_taken = toc - tic
                     # Append new annotation if it doesn't exist
                     annotations_all_pages.append(page_image_annotations)
                 if text_extraction_method == textract_option:
                     if original_textract_data != textract_data:
                         # Write the updated existing textract data back to the JSON file
                     if textract_json_file_path not in log_files_output_paths:
                         log_files_output_paths.append(textract_json_file_path)
+                if text_extraction_method == tesseract_ocr_option:
+                    if original_all_page_line_level_ocr_results_with_words != all_page_line_level_ocr_results_with_words:
+                        # Write the updated existing textract data back to the JSON file
+                        with open(all_page_line_level_ocr_results_with_words_json_file_path, 'w') as json_file:
+                            json.dump(all_page_line_level_ocr_results_with_words, json_file, separators=(",", ":"))  # indent=4 makes the JSON file pretty-printed
+                    if all_page_line_level_ocr_results_with_words_json_file_path not in log_files_output_paths:
+                        log_files_output_paths.append(all_page_line_level_ocr_results_with_words_json_file_path)
                 all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
                 all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
                 current_loop_page += 1
+                return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
         # If it's an image file
         if is_pdf(file_path) == False:
                 if textract_json_file_path not in log_files_output_paths:
                     log_files_output_paths.append(textract_json_file_path)
+            if text_extraction_method == tesseract_ocr_option:
+                if original_all_page_line_level_ocr_results_with_words != all_page_line_level_ocr_results_with_words:
+                    # Write the updated existing textract data back to the JSON file
+                    with open(all_page_line_level_ocr_results_with_words_json_file_path, 'w') as json_file:
+                        json.dump(all_page_line_level_ocr_results_with_words, json_file, separators=(",", ":"))  # indent=4 makes the JSON file pretty-printed
+                if all_page_line_level_ocr_results_with_words_json_file_path not in log_files_output_paths:
+                    log_files_output_paths.append(all_page_line_level_ocr_results_with_words_json_file_path)
             all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
             all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
+            return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
     if text_extraction_method == textract_option:
         # Write the updated existing textract data back to the JSON file
         if textract_json_file_path not in log_files_output_paths:
             log_files_output_paths.append(textract_json_file_path)
+    if text_extraction_method == tesseract_ocr_option:
+        if original_all_page_line_level_ocr_results_with_words != all_page_line_level_ocr_results_with_words:
+            # Write the updated existing textract data back to the JSON file
+            with open(all_page_line_level_ocr_results_with_words_json_file_path, 'w') as json_file:
+                json.dump(all_page_line_level_ocr_results_with_words, json_file, separators=(",", ":"))  # indent=4 makes the JSON file pretty-printed
+        if all_page_line_level_ocr_results_with_words_json_file_path not in log_files_output_paths:
+            log_files_output_paths.append(all_page_line_level_ocr_results_with_words_json_file_path)
     all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
     all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
+    # Convert decision table and ocr results to relative coordinates
     all_pages_decision_process_table = divide_coordinates_by_page_sizes(all_pages_decision_process_table, page_sizes_df, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax")
     all_line_level_ocr_results_df = divide_coordinates_by_page_sizes(all_line_level_ocr_results_df, page_sizes_df, xmin="left", xmax="width", ymin="top", ymax="height")
+    return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
 ###

tools/helper_functions.py CHANGED Viewed

@@ -39,6 +39,12 @@ def reset_ocr_results_state():
 def reset_review_vars():
     return pd.DataFrame(), pd.DataFrame()
 def load_in_default_allow_list(allow_list_file_path):
     if isinstance(allow_list_file_path, str):
         allow_list_file_path = [allow_list_file_path]
@@ -201,9 +207,6 @@ def put_columns_in_df(in_file:List[str]):
                 df = pd.read_excel(file_name, sheet_name=sheet_name)
                 # Process the DataFrame (e.g., print its contents)
-                print(f"Sheet Name: {sheet_name}")
-                print(df.head())  # Print the first few rows
                 new_choices.extend(list(df.columns))
             all_sheet_names.extend(new_sheet_names)
@@ -226,7 +229,17 @@ def check_for_existing_textract_file(doc_file_name_no_extension_textbox:str, out
     textract_output_path = os.path.join(output_folder, doc_file_name_no_extension_textbox + "_textract.json")
     if os.path.exists(textract_output_path):
-        print("Existing Textract file found.")
         return True
     else:
@@ -477,9 +490,10 @@ def calculate_time_taken(number_of_pages:str,
                         pii_identification_method:str,
                         textract_output_found_checkbox:bool,
                         only_extract_text_radio:bool,
                         convert_page_time:float=0.5,
-                        textract_page_time:float=1,
-                        comprehend_page_time:float=1,
                         local_text_extraction_page_time:float=0.3,
                         local_pii_redaction_page_time:float=0.5,
                         local_ocr_extraction_page_time:float=1.5,
@@ -494,7 +508,9 @@ def calculate_time_taken(number_of_pages:str,
     - number_of_pages: The number of pages in the uploaded document(s).
     - text_extract_method_radio: The method of text extraction.
     - pii_identification_method_drop: The method of personally-identifiable information removal.
     - only_extract_text_radio (bool, optional): Option to only extract text from the document rather than redact.
     - textract_page_time (float, optional): Approximate time to query AWS Textract.
     - comprehend_page_time (float, optional): Approximate time to query text on a page with AWS Comprehend.
     - local_text_redaction_page_time (float, optional): Approximate time to extract text on a page with the local text redaction option.
@@ -522,7 +538,8 @@ def calculate_time_taken(number_of_pages:str,
         if textract_output_found_checkbox != True:
             page_extraction_time_taken = number_of_pages * textract_page_time
     elif text_extract_method_radio == local_ocr_option:
-        page_extraction_time_taken = number_of_pages * local_ocr_extraction_page_time
     elif text_extract_method_radio == text_ocr_option:
         page_conversion_time_taken = number_of_pages * local_text_extraction_page_time

 def reset_review_vars():
     return pd.DataFrame(), pd.DataFrame()
+def reset_data_vars():
+    return 0, [], 0
+def reset_aws_call_vars():
+    return 0, 0
 def load_in_default_allow_list(allow_list_file_path):
     if isinstance(allow_list_file_path, str):
         allow_list_file_path = [allow_list_file_path]
                 df = pd.read_excel(file_name, sheet_name=sheet_name)
                 # Process the DataFrame (e.g., print its contents)
                 new_choices.extend(list(df.columns))
             all_sheet_names.extend(new_sheet_names)
     textract_output_path = os.path.join(output_folder, doc_file_name_no_extension_textbox + "_textract.json")
     if os.path.exists(textract_output_path):
+        print("Existing Textract analysis output file found.")
+        return True
+    else:
+        return False
+def check_for_existing_local_ocr_file(doc_file_name_no_extension_textbox:str, output_folder:str=OUTPUT_FOLDER):
+    local_ocr_output_path = os.path.join(output_folder, doc_file_name_no_extension_textbox + "_ocr_results_with_words.json")
+    if os.path.exists(local_ocr_output_path):
+        print("Existing local OCR analysis output file found.")
         return True
     else:
                         pii_identification_method:str,
                         textract_output_found_checkbox:bool,
                         only_extract_text_radio:bool,
+                        local_ocr_output_found_checkbox:bool,
                         convert_page_time:float=0.5,
+                        textract_page_time:float=1.2,
+                        comprehend_page_time:float=1.2,
                         local_text_extraction_page_time:float=0.3,
                         local_pii_redaction_page_time:float=0.5,
                         local_ocr_extraction_page_time:float=1.5,
     - number_of_pages: The number of pages in the uploaded document(s).
     - text_extract_method_radio: The method of text extraction.
     - pii_identification_method_drop: The method of personally-identifiable information removal.
+    - textract_output_found_checkbox (bool, optional): Boolean indicating if AWS Textract text extraction outputs have been found.
     - only_extract_text_radio (bool, optional): Option to only extract text from the document rather than redact.
+    - local_ocr_output_found_checkbox (bool, optional): Boolean indicating if local OCR text extraction outputs have been found.
     - textract_page_time (float, optional): Approximate time to query AWS Textract.
     - comprehend_page_time (float, optional): Approximate time to query text on a page with AWS Comprehend.
     - local_text_redaction_page_time (float, optional): Approximate time to extract text on a page with the local text redaction option.
         if textract_output_found_checkbox != True:
             page_extraction_time_taken = number_of_pages * textract_page_time
     elif text_extract_method_radio == local_ocr_option:
+        if local_ocr_output_found_checkbox != True:
+            page_extraction_time_taken = number_of_pages * local_ocr_extraction_page_time
     elif text_extract_method_radio == text_ocr_option:
         page_conversion_time_taken = number_of_pages * local_text_extraction_page_time