Sean Pedrick-Case commited on
Commit
d998102
·
unverified ·
2 Parent(s): 4d4ca01 47a3a80

Merge pull request #18 from seanpedrick-case/dev

Browse files

Improved review efficiency, logging to DynamoDB, local OCR text extraction saves, bug fixes

.dockerignore CHANGED
@@ -17,4 +17,6 @@ dist/*
17
  build_deps/*
18
  logs/*
19
  config/*
20
- user_guide/*
 
 
 
17
  build_deps/*
18
  logs/*
19
  config/*
20
+ user_guide/*
21
+ cdk/*
22
+ web/*
.gitignore CHANGED
@@ -18,4 +18,6 @@ build_deps/*
18
  logs/*
19
  config/*
20
  doc_redaction_amplify_app/*
21
- user_guide/*
 
 
 
18
  logs/*
19
  config/*
20
  doc_redaction_amplify_app/*
21
+ user_guide/*
22
+ cdk/*
23
+ web/*
README.md CHANGED
@@ -20,6 +20,12 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
20
 
21
  # USER GUIDE
22
 
 
 
 
 
 
 
23
  ## Table of contents
24
 
25
  - [Example data files](#example-data-files)
@@ -33,59 +39,102 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
33
  - [Redacting only specific pages](#redacting-only-specific-pages)
34
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
35
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
 
36
 
37
  See the [advanced user guide here](#advanced-user-guide):
38
- - [Modifying and merging redaction review files](#modifying-and-merging-redaction-review-files)
39
- - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
40
- - [Merging existing redaction review files](#merging-existing-redaction-review-files)
41
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
42
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
43
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
44
  - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
45
  - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
 
46
  - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
 
47
 
48
  ## Example data files
49
 
50
- Please refer to these example files to follow this guide:
51
  - [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
52
  - [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
53
  - [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
 
54
 
55
  ## Basic redaction
56
 
57
- The document redaction app can detect personally-identifiable information (PII) in documents. Documents can be redacted directly, or suggested redactions can be reviewed and modified using a grapical user interface.
58
 
59
  Download the example PDFs above to your computer. Open up the redaction app with the link provided by email.
60
 
61
  ![Upload files](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/file_upload_highlight.PNG)
62
 
63
- Click on the upload files area, and select the three different files (they should all be stored in the same folder if you want them to be redacted at the same time).
 
 
 
 
64
 
65
- First, select one of the three text extraction options below:
66
- - 'Local model - selectable text' - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
67
- - 'Local OCR model - PDFs without selectable text' - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
68
- - 'AWS Textract service - all PDF types' - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
 
 
 
 
 
 
 
69
 
70
  If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
71
- - 'Local' - This uses the spacy package to rapidly detect PII in extracted text. This method is often sufficient if you are just interested in redacting specific terms defined in a custom list.
72
- - 'AWS Comprehend' - This method calls an AWS service to provide more accurate identification of PII in extracted text.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- Hit 'Redact document'. After loading in the document, the app should be able to process about 30 pages per minute (depending on redaction methods chose above). When ready, you should see a message saying that processing is complete, with output files appearing in the bottom right.
75
 
76
  ![Redaction outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_outputs.PNG)
77
 
78
- - '...redacted.pdf' files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
79
- - '...ocr_results.csv' files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
80
- - '...review_file.csv' files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
81
 
82
- Additional outputs are available under the 'Redaction settings' tab. Scroll to the bottom and you should see more files:
83
 
84
- ![Additional processing outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_additional_outputs.PNG)
85
 
86
- - '...review_file.json' is the same file as the review file above, but in .json format.
87
- - '...decision_process_output.csv' is also similar to the review file above, with a few more details on the location and scores of identified PII in the document.
88
- - If you are using AWS Textract, you should also get a .json file with the Textract outputs. It could be useful to retain this document to avoid having to repeatedly analyse the same document in future (this .json file can be uploaded into the app on the first redaction tab to load into local memory before redaction).
 
 
 
 
 
 
 
 
89
 
90
  We have covered redacting documents with the default redaction options. The '...redacted.pdf' file output may be enough for your purposes. But it is very likely that you will need to customise your redaction options, which we will cover below.
91
 
@@ -126,6 +175,16 @@ There may be full pages in a document that you want to redact. The app also prov
126
 
127
  Using the above approaches to allow, deny, and full page redaction lists will give you an output [like this](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/Partnership-Agreement-Toolkit_0_0_redacted.pdf).
128
 
 
 
 
 
 
 
 
 
 
 
129
  ### Redacting additional types of personal information
130
 
131
  You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
@@ -146,7 +205,9 @@ Say also we are only interested in redacting page 1 of the loaded documents. On
146
 
147
  ## Handwriting and signature redaction
148
 
149
- The file [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf) is provided as an example document to test AWS Textract + redaction with a document that has signatures in. If you have access to AWS Textract in the app, try removing all entity types from redaction on the Redaction settings and clicking the big X to the right of 'Entities to redact'. Ensure that handwriting and signatures are enabled for redaction on the Redaction Settings tab(enabled by default):
 
 
150
 
151
  ![Handwriting and signatures](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/textract_handwriting_signatures.PNG)
152
 
@@ -156,72 +217,169 @@ The outputs should show handwriting/signatures redacted (see pages 5 - 7), which
156
 
157
  ## Reviewing and modifying suggested redactions
158
 
159
- Quite often there are certain terms suggested for redaction by the model that don't match quite what you intended. The app allows you to review and modify suggested redactions for the last file redacted. Refresh your browser tab. On the first tab 'PDFs/images' upload the ['Example of files sent to a professor before applying.pdf'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf) file. Let's stick with the 'Local model - selectable text' option, and click 'Redact document'. Once the outputs are created, go to the 'Review redactions' tab.
 
 
160
 
161
- On this tab you have a visual interface that allows you to inspect and modify redactions suggested by the app.
162
 
163
  ![Review redactions](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_redactions.PNG)
164
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
  You can change the page viewed either by clicking 'Previous page' or 'Next page', or by typing a specific page number in the 'Current page' box and pressing Enter on your keyboard. Each time you switch page, it will save redactions you have made on the page you are moving from, so you will not lose changes you have made.
166
 
167
- On your selected page, each redaction is highlighted with a box next to its suggested entity type. By default the interface allows you to modify existing redaction boxes. Click and hold on an existing box to move it. Click on one of the small boxes at the edges to change the size of the box. To delete a box, click on it to highlight it, then press delete on your keyboard. Alternatively, double click on a box and click 'Remove' on the box that appears.
168
 
169
- To change to 'add new redactions' mode, scroll to the bottom of the page. Click on the box icon, and your cursor will change into a crosshair. Now you can add new redaction boxes where you wish.
 
 
 
 
 
 
170
 
171
  ![Change redaction mode](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/change_review_mode.PNG)
172
 
173
- On the right of the screen there is a dropdown and table where you can filter to entity types that have been found throughout the document. You can choose a specific entity type to see which pages the entity is present on. If you want to go to the page specified in the table, you can click on a cell in the table and the review page will be changed to that page.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
- ![Change redaction mode](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/list_find_labels.PNG)
176
 
177
- Note that the table currently only shows entity types, and not specific found text. So for instance if you provide a list of specific terms to redact in the [deny list](#deny-list-example), they will all be labelled just as 'CUSTOM'. A feature to include in the near term will include being able to view specific redacted text in this table to get a better sense of the PII entities found.
178
 
179
- Once you happy with your modified changes throughout the document, click 'Apply revised redactions' at the top of the page. The app will then run through all the pages in the document to update the redactions, and will output a modified PDF file. The modified PDF will appear at the top of the page in the file area. It will also output a revised '...review_file.csv' that you can then use for future review tasks.
 
 
 
 
 
 
 
 
180
 
181
  ![Review modified outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_mod_outputs.PNG)
182
 
183
- Any feedback or comments on the app, please get in touch!
184
 
185
- # ADVANCED USER GUIDE
186
 
187
- This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
188
 
189
- ## Table of contents
190
 
191
- - [Modifying and merging redaction review files](#modifying-and-merging-redaction-review-files)
192
- - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
193
- - [Merging existing redaction review files](#merging-existing-redaction-review-files)
194
- - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
195
- - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
196
- - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
197
- - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
198
- - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
199
 
 
200
 
201
- ## Modifying and merging redaction review files
 
 
202
 
203
- You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
204
 
205
- As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
 
206
 
207
- ### Modifying existing redaction review files
208
- If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
209
 
210
- ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)
211
 
212
- The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Finad & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
213
 
214
- How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
215
 
216
- Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
217
 
218
- I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
219
 
220
- ![Review file after modification](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/partnership_redactions_after.PNG)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
221
 
222
- We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
223
 
224
- ### Merging existing redaction review files
225
 
226
  Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
227
 
@@ -303,6 +461,30 @@ When you click the 'convert .xfdf comment file to review_file.csv' button, the a
303
 
304
  ![Outputs from Adobe import](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/export_to_adobe/img/import_from_adobe_interface_outputs.PNG)
305
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
306
  ## Using AWS Textract and Comprehend when not running in an AWS environment
307
 
308
  AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
@@ -322,4 +504,26 @@ AWS_SECRET_KEY= your-secret-key
322
 
323
  The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
324
 
325
- Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
  # USER GUIDE
22
 
23
+ ## Experiment with the test (public) version of the app
24
+ You can test out many of the features described in this user guide at the [public test version of the app](https://huggingface.co/spaces/seanpedrickcase/document_redaction), which is free. AWS functions (e.g. Textract, Comprehend) are not enabled (unless you have valid API keys).
25
+
26
+ ## Chat over this user guide
27
+ You can now [speak with a chat bot about this user guide](https://huggingface.co/spaces/seanpedrickcase/Light-PDF-Web-QA-Chatbot) (beta!)
28
+
29
  ## Table of contents
30
 
31
  - [Example data files](#example-data-files)
 
39
  - [Redacting only specific pages](#redacting-only-specific-pages)
40
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
41
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
42
+ - [Redacting tabular data files (CSV/XLSX) or copy and pasted text](#redacting-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
43
 
44
  See the [advanced user guide here](#advanced-user-guide):
45
+ - [Merging redaction review files](#merging-redaction-review-files)
 
 
46
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
47
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
48
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
49
  - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
50
  - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
51
+ - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
52
  - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
53
+ - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
54
 
55
  ## Example data files
56
 
57
+ Please try these example files to follow along with this guide:
58
  - [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
59
  - [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
60
  - [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
61
+ - [Dummy case note data](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv)
62
 
63
  ## Basic redaction
64
 
65
+ The document redaction app can detect personally-identifiable information (PII) in documents. Documents can be redacted directly, or suggested redactions can be reviewed and modified using a grapical user interface. Basic document redaction can be performed quickly using the default options.
66
 
67
  Download the example PDFs above to your computer. Open up the redaction app with the link provided by email.
68
 
69
  ![Upload files](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/file_upload_highlight.PNG)
70
 
71
+ ### Upload files to the app
72
+
73
+ The 'Redact PDFs/images tab' currently accepts PDFs and image files (JPG, PNG) for redaction. Click on the 'Drop files here or Click to Upload' area of the screen, and select one of the three different [example files](#example-data-files) (they should all be stored in the same folder if you want them to be redacted at the same time).
74
+
75
+ ### Text extraction
76
 
77
+ First, select one of the three text extraction options:
78
+ - **'Local model - selectable text'** - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
79
+ - **'Local OCR model - PDFs without selectable text'** - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
80
+ - **'AWS Textract service - all PDF types'** - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
81
+
82
+ ### Optional - select signature extraction
83
+ If you chose the AWS Textract service above, you can choose if you want handwriting and/or signatures redacted by default. Choosing signatures here will have a cost implication, as identifying signatures will cost ~£2.66 ($3.50) per 1,000 pages vs ~£1.14 ($1.50) per 1,000 pages without signature detection.
84
+
85
+ ![AWS Textract handwriting and signature options](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_handwriting_signatures.PNG)
86
+
87
+ ### PII redaction method
88
 
89
  If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
90
+ - **'Only extract text - (no redaction)'** - If you are only interested in getting the text out of the document for further processing (e.g. to find duplicate pages, or to review text on the Review redactions page)
91
+ - **'Local'** - This uses the spacy package to rapidly detect PII in extracted text. This method is often sufficient if you are just interested in redacting specific terms defined in a custom list.
92
+ - **'AWS Comprehend'** - This method calls an AWS service to provide more accurate identification of PII in extracted text.
93
+
94
+ ### Optional - costs and time estimation
95
+ If the option is enabled (by your system admin, in the config file), you will see a cost and time estimate for the redaction process. 'Existing Textract output file found' will be checked automatically if previous Textract text extraction files exist in the output folder, or have been [previously uploaded by the user](#aws-textract-outputs) (saving time and money for redaction).
96
+
97
+ ![Cost and time estimation](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/costs_and_time.PNG)
98
+
99
+ ### Optional - cost code selection
100
+ If the option is enabled (by your system admin, in the config file), you may be prompted to select a cost code before continuing with the redaction task.
101
+
102
+ ![Cost code selection](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/cost_code_selection.PNG)
103
+
104
+ The relevant cost code can be found either by: 1. Using the search bar above the data table to find relevant cost codes, then clicking on the relevant row, or 2. typing it directly into the dropdown to the right, where it should filter as you type.
105
+
106
+ ### Optional - Submit whole documents to Textract API
107
+ If this option is enabled (by your system admin, in the config file), you will have the option to submit whole documents in quick succession to the AWS Textract service to get extracted text outputs quickly (faster than using the 'Redact document' process described here). This feature is described in more detail in the [advanced user guide](#using-the-aws-textract-document-api).
108
+
109
+ ![Textract document API](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_document_api.PNG)
110
+
111
+ ### Redact the document
112
+
113
+ Click 'Redact document'. After loading in the document, the app should be able to process about 30 pages per minute (depending on redaction methods chose above). When ready, you should see a message saying that processing is complete, with output files appearing in the bottom right.
114
 
115
+ ### Redaction outputs
116
 
117
  ![Redaction outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_outputs.PNG)
118
 
119
+ - **'...redacted.pdf'** files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
120
+ - **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
121
+ - **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
122
 
123
+ ### Additional AWS Textract / local OCR outputs
124
 
125
+ If you have used the AWS Textract option for extracting text, you may also see a '..._textract.json' file. This file contains all the relevant extracted text information that comes from the AWS Textract service. You can keep this file and upload it at a later date alongside your input document, which will enable you to skip calling AWS Textract every single time you want to do a redaction task, as follows:
126
 
127
+ ![Document upload alongside Textract](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/document_upload_with_textract.PNG)
128
+
129
+ Similarly, if you have used the 'Local OCR method' to extract text, you may see a '..._ocr_results_with_words.json' file. This file works in the same way as the AWS Textract .json results described above, and can be uploaded alongside an input document to save time on text extraction in future in the same way.
130
+
131
+ ### Downloading output files from previous redaction tasks
132
+
133
+ If you are logged in via AWS Cognito and you lose your app page for some reason (e.g. from a crash, reloading), it is possible recover your previous output files, provided the server has not been shut down since you redacted the document. Go to 'Redaction settings', then scroll to the bottom to see 'View all output files from this session'.
134
+
135
+ ![View all output files](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/view_all_output_files.PNG)
136
+
137
+ ### Basic redaction summary
138
 
139
  We have covered redacting documents with the default redaction options. The '...redacted.pdf' file output may be enough for your purposes. But it is very likely that you will need to customise your redaction options, which we will cover below.
140
 
 
175
 
176
  Using the above approaches to allow, deny, and full page redaction lists will give you an output [like this](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/Partnership-Agreement-Toolkit_0_0_redacted.pdf).
177
 
178
+ #### Adding to the loaded allow, deny, and whole page lists in-app
179
+
180
+ If you open the accordion below the allow list options called 'Manually modify custom allow...', you should be able to see a few tables with options to add new rows:
181
+
182
+ ![Manually modify allow or deny list](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/manually_modify.PNG)
183
+
184
+ If the table is empty, you can add a new entry, you can add a new row by clicking on the '+' item below each table header. If there is existing data, you may need to click on the three dots to the right and select 'Add row below'. Type the item you wish to keep/remove in the cell, and then (important) press enter to add this new item to the allow/deny/whole page list. Your output tables should look something like below.
185
+
186
+ ![Manually modify allow or deny list filled](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/manually_modify_filled.PNG)
187
+
188
  ### Redacting additional types of personal information
189
 
190
  You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
 
205
 
206
  ## Handwriting and signature redaction
207
 
208
+ The file [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf) is provided as an example document to test AWS Textract + redaction with a document that has signatures in. If you have access to AWS Textract in the app, try removing all entity types from redaction on the Redaction settings and clicking the big X to the right of 'Entities to redact'.
209
+
210
+ To ensure that handwriting and signatures are enabled (enabled by default), on the front screen go the 'AWS Textract signature detection' to enable/disable the following options :
211
 
212
  ![Handwriting and signatures](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/textract_handwriting_signatures.PNG)
213
 
 
217
 
218
  ## Reviewing and modifying suggested redactions
219
 
220
+ Sometimes the app will suggest redactions that are incorrect, or will miss personal information entirely. The app allows you to review and modify suggested redactions to compensate for this. You can do this on the 'Review redactions' tab.
221
+
222
+ We will go through ways to review suggested redactions with an example.On the first tab 'PDFs/images' upload the ['Example of files sent to a professor before applying.pdf'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf) file. Let's stick with the 'Local model - selectable text' option, and click 'Redact document'. Once the outputs are created, go to the 'Review redactions' tab.
223
 
224
+ On the 'Review redactions' tab you have a visual interface that allows you to inspect and modify redactions suggested by the app. There are quite a few options to look at, so we'll go from top to bottom.
225
 
226
  ![Review redactions](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_redactions.PNG)
227
 
228
+ ### Uploading documents for review
229
+
230
+ The top area has a file upload area where you can upload original, unredacted PDFs, alongside the '..._review_file.csv' that is produced by the redaction process. Once you have uploaded these two files, click the 'Review PDF...' button to load in the files for review. This will allow you to visualise and modify the suggested redactions using the interface below.
231
+
232
+ Optionally, you can also upload one of the '..._ocr_output.csv' files here that comes out of a redaction task, so that you can navigate the extracted text from the document.
233
+
234
+ ![Search extracted text](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
235
+
236
+ You can upload the three review files in the box (unredacted document, '..._review_file.csv' and '..._ocr_output.csv' file) before clicking 'Review PDF...', as in the image below:
237
+
238
+ ![Upload three files for review](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/upload_three_files.PNG)
239
+
240
+ **NOTE:** ensure you upload the ***unredacted*** document here and not the redacted version, otherwise you will be checking over a document that already has redaction boxes applied!
241
+
242
+ ### Page navigation
243
+
244
  You can change the page viewed either by clicking 'Previous page' or 'Next page', or by typing a specific page number in the 'Current page' box and pressing Enter on your keyboard. Each time you switch page, it will save redactions you have made on the page you are moving from, so you will not lose changes you have made.
245
 
246
+ You can also navigate to different pages by clicking on rows in the tables under 'Search suggested redactions' to the right, or 'search all extracted text' (if enabled) beneath that.
247
 
248
+ ### The document viewer pane
249
+
250
+ On the selected page, each redaction is highlighted with a box next to its suggested redaction label (e.g. person, email).
251
+
252
+ ![Document view pane](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/document_viewer_pane.PNG)
253
+
254
+ There are a number of different options to add and modify redaction boxes and page on the document viewer pane. To zoom in and out of the page, use your mouse wheel. To move around the page while zoomed, you need to be in modify mode. Scroll to the bottom of the document viewer to see the relevant controls. You should see a box icon, a hand icon, and two arrows pointing counter-clockwise and clockwise.
255
 
256
  ![Change redaction mode](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/change_review_mode.PNG)
257
 
258
+ Click on the hand icon to go into modify mode. When you click and hold on the document viewer, This will allow you to move around the page when zoomed in. To rotate the page, you can click on either of the round arrow buttons to turn in that direction.
259
+
260
+ **NOTE:** When you switch page, the viewer will stay in your selected orientation, so if it looks strange, just rotate the page again and hopefully it will look correct!
261
+
262
+ #### Modify existing redactions (hand icon)
263
+
264
+ After clicking on the hand icon, the interface allows you to modify existing redaction boxes. When in this mode, you can click and hold on an existing box to move it.
265
+
266
+ ![Modify existing redaction box](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/modify_existing_redaction_box.PNG)
267
+
268
+ Click on one of the small boxes at the edges to change the size of the box. To delete a box, click on it to highlight it, then press delete on your keyboard. Alternatively, double click on a box and click 'Remove' on the box that appears.
269
+
270
+ ![Remove existing redaction box](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/existing_redaction_box_remove.PNG)
271
+
272
+ #### Add new redaction boxes (box icon)
273
+
274
+ To change to 'add redaction boxes' mode, scroll to the bottom of the page. Click on the box icon, and your cursor will change into a crosshair. Now you can add new redaction boxes where you wish. A popup will appear when you create a new box so you can select a label and colour for the new box.
275
+
276
+ #### 'Locking in' new redaction box format
277
 
278
+ It is possible to lock in a chosen format for new redaction boxes so that you don't have the popup appearing each time. When you make a new box, select the options for your 'locked' format, and then click on the lock icon on the left side of the popup, which should turn blue.
279
 
280
+ ![Lock redaction box format](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/new_redaction_box_lock_mode.PNG)
281
 
282
+ You can now add new redaction boxes without a popup appearing. If you want to change or 'unlock' the your chosen box format, you can click on the new icon that has appeared at the bottom of the document viewer pane that looks a little like a gift tag. You can then change the defaults, or click on the lock icon again to 'unlock' the new box format - then popups will appear again each time you create a new box.
283
+
284
+ ![Change or unlock redaction box format](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/change_review_mode_with_lock.PNG)
285
+
286
+ ### Apply redactions to PDF and Save changes on current page
287
+
288
+ Once you have reviewed all the redactions in your document and you are happy with the outputs, you can click 'Apply revised redactions to PDF' to create a new '_redacted.pdf' output alongside a new '_review_file.csv' output.
289
+
290
+ If you are working on a page and haven't saved for a while, you can click 'Save changes on current page to file' to ensure that they are saved to an updated 'review_file.csv' output.
291
 
292
  ![Review modified outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_mod_outputs.PNG)
293
 
294
+ ### Selecting and removing redaction boxes using the 'Search suggested redactions' table
295
 
296
+ The table shows a list of all the suggested redactions in the document alongside the page, label, and text (if available).
297
 
298
+ ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/list_find_labels.PNG)
299
 
300
+ If you click on one of the rows in this table, you will be taken to the page of the redaction. Clicking on a redaction row on the same page *should* change the colour of redaction box to blue to help you locate it in the document viewer (just in app, not in redaction output PDFs).
301
 
302
+ ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/review_row_highlight.PNG)
 
 
 
 
 
 
 
303
 
304
+ You can choose a specific entity type to see which pages the entity is present on. If you want to go to the page specified in the table, you can click on a cell in the table and the review page will be changed to that page.
305
 
306
+ To filter the 'Search suggested redactions' table you can:
307
+ 1. Click on one of the dropdowns (Redaction category, Page, Text), and select an option, or
308
+ 2. Write text in the 'Filter' box just above the table. Click the blue box to apply the filter to the table.
309
 
310
+ Once you have filtered the table, you have a few options underneath on what you can do with the filtered rows:
311
 
312
+ - Click the 'Exclude specific row from redactions' button to remove only the redaction from the last row you clicked on from the document.
313
+ - Click the 'Exclude all items in table from redactions' button to remove all redactions visible in the table from the document. **Important:** ensure that you have clicked the blue tick icon next to the search box before doing this, or you will remove all redactions from the document. If you do end up doing this, click the 'Undo last element removal' button below to restore the redactions.
314
 
315
+ **NOTE**: After excluding redactions using either of the above options, click the 'Reset filters' button below to ensure that the dropdowns and table return to seeing all remaining redactions in the document.
 
316
 
317
+ If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
318
 
319
+ ### Navigating through the document using the 'Search all extracted text'
320
 
321
+ The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
322
 
323
+ You can navigate through the document using this table. When you click on a row, the Document viewer pane to the left will change to the selected page.
324
 
325
+ ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/select_extracted_text.PNG)
326
 
327
+ You can search through the extracted text by using the search bar just above the table, which should filter as you type. To apply the filter and 'cut' the table, click on the blue tick inside the box next to your search term. To return the table to its original content, click the button below the table 'Reset OCR output table filter'.
328
+
329
+ ![Search suggested redaction area](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/review_redactions/search_extracted_text.PNG)
330
+
331
+ ## Redacting tabular data files (XLSX/CSV) or copy and pasted text
332
+
333
+ ### Tabular data files (XLSX/CSV)
334
+
335
+ The app can be used to redact tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
336
+
337
+ To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.
338
+
339
+ ![csv upload](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/tabular_files/file_upload_csv_columns.PNG)
340
+
341
+ If you were instead to upload an xlsx file, you would see also a list of all the sheets in the xlsx file that can be redacted. The 'Select columns' area underneath will suggest a list of all columns in the file across all sheets.
342
+
343
+ ![xlsx upload](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/tabular_files/file_upload_xlsx_columns.PNG)
344
+
345
+ Once you have chosen your input file and sheets/columns to redact, you can choose the redaction method. 'Local' will use the same local model as used for documents on the first tab. 'AWS Comprehend' will give better results, at a slight cost.
346
+
347
+ When you click Redact text/data files, you will see the progress of the redaction task by file and sheet, and you will receive a csv output with the redacted data.
348
+
349
+ ### Choosing output anonymisation format
350
+ You can also choose the anonymisation format of your output results. Open the tab 'Anonymisation output format' to see the options. By default, any detected PII will be replaced with the word 'REDACTED' in the cell. You can choose one of the following options as the form of replacement for the redacted text:
351
+ - replace with 'REDACTED': Replaced by the word 'REDACTED' (default)
352
+ - replace with <ENTITY_NAME>: Replaced by e.g. 'PERSON' for people, 'EMAIL_ADDRESS' for emails etc.
353
+ - redact completely: Text is removed completely and replaced by nothing.
354
+ - hash: Replaced by a unique long ID code that is consistent with entity text. I.e. a particular name will always have the same ID code.
355
+ - mask: Replace with stars '*'.
356
+
357
+ ### Redacting copy and pasted text
358
+ You can also write open text into an input box and redact that using the same methods as described above. To do this, write or paste text into the 'Enter open text' box that appears when you open the 'Redact open text' tab. Then select a redaction method, and an anonymisation output format as described above. The redacted text will be printed in the output textbox, and will also be saved to a simple csv file in the output file box.
359
+
360
+ ![Text analysis output](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/tabular_files/text_anonymisation_outputs.PNG)
361
+
362
+ ### Redaction log outputs
363
+ A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
364
+
365
+ # ADVANCED USER GUIDE
366
+
367
+ This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
368
+
369
+ ## Table of contents
370
+
371
+ - [Merging redaction review files](#merging-redaction-review-files)
372
+ - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
373
+ - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
374
+ - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
375
+ - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
376
+ - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
377
+ - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
378
+ - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
379
+ - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
380
 
 
381
 
382
+ ## Merging redaction review files
383
 
384
  Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
385
 
 
461
 
462
  ![Outputs from Adobe import](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/export_to_adobe/img/import_from_adobe_interface_outputs.PNG)
463
 
464
+ ## Using the AWS Textract document API
465
+
466
+ This option can be enabled by your system admin, in the config file ('SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS' environment variable, and subsequent variables). Using this, you will have the option to submit whole documents in quick succession to the AWS Textract service to get extracted text outputs quickly (faster than using the 'Redact document' process described here).
467
+
468
+ ### Starting a new Textract API job
469
+
470
+ To use this feature, first upload a document file in the file input box [in the usual way](#upload-files-to-the-app) on the first tab of the app. Under AWS Textract signature detection you can select whether or not you would like to analyse signatures or not (with a [cost implication](#optional---select-signature-extraction)).
471
+
472
+ Then, open the section under the heading 'Submit whole document to AWS Textract API...'.
473
+
474
+ ![Textract document API menu](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_document_api.PNG)
475
+
476
+ Click 'Analyse document with AWS Textract API call'. After a few seconds, the job should be submitted to the AWS Textract service. The box 'Job ID to check status' should now have an ID filled in. If it is not already filled with previous jobs (up to seven days old), the table should have a row added with details of the new API job.
477
+
478
+ Click the button underneath, 'Check status of Textract job and download', to see progress on the job. Processing will continue in the background until the job is ready, so it is worth periodically clicking this button to see if the outputs are ready. In testing, and as a rough estimate, it seems like this process takes about five seconds per page. However, this has not been tested with very large documents. Once ready, the '_textract.json' output should appear below.
479
+
480
+ ### Textract API job outputs
481
+
482
+ The '_textract.json' output can be used to speed up further redaction tasks as [described previously](#optional---costs-and-time-estimation), the 'Existing Textract output file found' flag should now be ticked.
483
+
484
+ ![Textract document API initial ouputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/textract_api/textract_api_initial_outputs.PNG)
485
+
486
+ You can now easily get the '..._ocr_output.csv' redaction output based on this '_textract.json' (described in [Redaction outputs](#redaction-outputs)) by clicking on the button 'Convert Textract job outputs to OCR results'. You can now use this file e.g. for [identifying duplicate pages](#identifying-and-redacting-duplicate-pages), or for redaction review.
487
+
488
  ## Using AWS Textract and Comprehend when not running in an AWS environment
489
 
490
  AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
 
504
 
505
  The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
506
 
507
+ Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
508
+
509
+ ## Modifying existing redaction review files
510
+
511
+ You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
512
+
513
+ As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
514
+
515
+ If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
516
+
517
+ ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)
518
+
519
+ The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Finad & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
520
+
521
+ How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
522
+
523
+ Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
524
+
525
+ I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
526
+
527
+ ![Review file after modification](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/partnership_redactions_after.PNG)
528
+
529
+ We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
app.py CHANGED
@@ -4,11 +4,11 @@ import pandas as pd
4
  import gradio as gr
5
  from gradio_image_annotation import image_annotator
6
 
7
- from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SHOW_BULK_TEXTRACT_CALL_OPTIONS, TEXTRACT_BULK_ANALYSIS_BUCKET, TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER, SESSION_OUTPUT_FOLDER, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC, HOST_NAME, DEFAULT_COST_CODE, OUTPUT_COST_CODES_PATH, OUTPUT_ALLOW_LIST_PATH
8
- from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select
9
- from tools.aws_functions import upload_file_to_s3, download_file_from_s3
10
  from tools.file_redaction import choose_and_run_redactor
11
- from tools.file_conversion import prepare_image_or_pdf, get_input_file_names, convert_review_df_to_annotation_json
12
  from tools.redaction_review import apply_redactions_to_review_df_and_files, update_all_page_annotation_object_based_on_previous_page, decrease_page, increase_page, update_annotator_object_and_filter_df, update_entities_df_recogniser_entities, update_entities_df_page, update_entities_df_text, df_select_callback, convert_df_to_xfdf, convert_xfdf_to_dataframe, reset_dropdowns, exclude_selected_items_from_redaction, undo_last_removal, update_selected_review_df_row_colour, update_all_entity_df_dropdowns, df_select_callback_cost, update_other_annotator_number_from_current, update_annotator_page_from_review_df, df_select_callback_ocr, df_select_callback_textract_api
13
  from tools.data_anonymise import anonymise_data_files
14
  from tools.auth import authenticate_user
@@ -44,6 +44,19 @@ else:
44
  default_ocr_val = text_ocr_option
45
  default_pii_detector = local_pii_detector
46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
  # Create the gradio interface
48
  app = gr.Blocks(theme = gr.themes.Base(), fill_width=True)
49
 
@@ -61,6 +74,9 @@ with app:
61
  all_decision_process_table_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="all_decision_process_table", visible=False, type="pandas", wrap=True)
62
  review_file_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="review_file_df", visible=False, type="pandas", wrap=True)
63
 
 
 
 
64
  session_hash_state = gr.Textbox(label= "session_hash_state", value="", visible=False)
65
  host_name_textbox = gr.Textbox(label= "host_name_textbox", value=HOST_NAME, visible=False)
66
  s3_output_folder_state = gr.Textbox(label= "s3_output_folder_state", value="", visible=False)
@@ -105,7 +121,12 @@ with app:
105
 
106
  doc_full_file_name_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
107
  doc_file_name_no_extension_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
108
- blank_doc_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False) # Left blank for when user does not want to report file names
 
 
 
 
 
109
  doc_file_name_with_extension_textbox = gr.Textbox(label = "doc_file_name_with_extension_textbox", value="", visible=False)
110
  doc_file_name_textbox_list = gr.Dropdown(label = "doc_file_name_textbox_list", value="", allow_custom_value=True,visible=False)
111
  latest_review_file_path = gr.Textbox(label = "latest_review_file_path", value="", visible=False) # Latest review file path output from redaction
@@ -149,9 +170,9 @@ with app:
149
  s3_default_allow_list_file = gr.Textbox(label = "Default allow list file", value=S3_ALLOW_LIST_PATH, visible=False)
150
  default_allow_list_output_folder_location = gr.Textbox(label = "Output default allow list location", value=OUTPUT_ALLOW_LIST_PATH, visible=False)
151
 
152
- s3_bulk_textract_default_bucket = gr.Textbox(label = "Default Textract bulk S3 bucket", value=TEXTRACT_BULK_ANALYSIS_BUCKET, visible=False)
153
- s3_bulk_textract_input_subfolder = gr.Textbox(label = "Default Textract bulk S3 input folder", value=TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER, visible=False)
154
- s3_bulk_textract_output_subfolder = gr.Textbox(label = "Default Textract bulk S3 output folder", value=TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER, visible=False)
155
  successful_textract_api_call_number = gr.Number(precision=0, value=0, visible=False)
156
  no_redaction_method_drop = gr.Radio(label = """Placeholder for no redaction method after downloading Textract outputs""", value = no_redaction_option, choices=[no_redaction_option], visible=False)
157
  textract_only_method_drop = gr.Radio(label="""Placeholder for Textract method after downloading Textract outputs""", value = textract_option, choices=[textract_option], visible=False)
@@ -184,6 +205,7 @@ with app:
184
  cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
185
 
186
  textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=False)
 
187
  total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=False)
188
  estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost ($)", value=0, visible=False, precision=2)
189
  estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=False, precision=2)
@@ -240,10 +262,14 @@ with app:
240
  if SHOW_COSTS == "True":
241
  with gr.Accordion("Estimated costs and time taken", open = True, visible=True):
242
  with gr.Row(equal_height=True):
243
- textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=True)
244
- total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=True)
245
- estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost (£)", value=0.00, precision=2, visible=True)
246
- estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=True, precision=2)
 
 
 
 
247
 
248
  if GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True":
249
  with gr.Accordion("Apply cost code", open = True, visible=True):
@@ -253,7 +279,7 @@ with app:
253
  reset_cost_code_dataframe_button = gr.Button(value="Reset code code table filter")
254
  cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=True)
255
 
256
- if SHOW_BULK_TEXTRACT_CALL_OPTIONS == "True":
257
  with gr.Accordion("Submit whole document to AWS Textract API (quicker, max 3,000 pages per document)", open = False, visible=True):
258
  with gr.Row(equal_height=True):
259
  gr.Markdown("""Document will be submitted to AWS Textract API service to extract all text in the document. Processing will take place on (secure) AWS servers, and outputs will be stored on S3 for up to 7 days. To download the results, click 'Check status' below and they will be downloaded if ready.""")
@@ -381,7 +407,7 @@ with app:
381
  ###
382
  with gr.Tab(label="Open text or Excel/csv files"):
383
  gr.Markdown("""### Choose open text or a tabular data file (xlsx or csv) to redact.""")
384
- with gr.Accordion("Paste open text", open = False):
385
  in_text = gr.Textbox(label="Enter open text", lines=10)
386
  with gr.Accordion("Upload xlsx or csv files", open = True):
387
  in_data_files = gr.File(label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet', '.csv.gz'], height=file_input_height)
@@ -391,6 +417,9 @@ with app:
391
  in_colnames = gr.Dropdown(choices=["Choose columns to anonymise"], multiselect = True, label="Select columns that you want to anonymise (showing columns present across all files).")
392
 
393
  pii_identification_method_drop_tabular = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
 
 
 
394
 
395
  tabular_data_redact_btn = gr.Button("Redact text/data files", variant="primary")
396
 
@@ -448,10 +477,10 @@ with app:
448
  aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
449
  aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
450
 
451
- with gr.Accordion("Settings for open text or xlsx/csv files", open = False):
452
- anon_strat = gr.Radio(choices=["replace with 'REDACTED'", "replace with <ENTITY_NAME>", "redact completely", "hash", "mask", "encrypt", "fake_first_name"], label="Select an anonymisation method.", value = "replace with 'REDACTED'")
453
 
454
- log_files_output = gr.File(label="Log file output", interactive=False)
 
455
 
456
  with gr.Accordion("Combine multiple review files", open = False):
457
  multiple_review_files_in_out = gr.File(label="Combine multiple review_file.csv files together here.", file_count='multiple', file_types=['.csv'])
@@ -477,14 +506,17 @@ with app:
477
  handwrite_signature_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
478
  textract_output_found_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
479
  only_extract_text_radio.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
 
480
 
481
  # Calculate time taken
482
- total_pdf_page_count.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
483
- text_extract_method_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
484
- pii_identification_method_drop.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
485
- handwrite_signature_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
486
- textract_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
487
- only_extract_text_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
 
 
488
 
489
  # Allow user to select items from cost code dataframe for cost code
490
  if SHOW_COSTS=="True" and (GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True"):
@@ -494,27 +526,30 @@ with app:
494
  cost_code_choice_drop.select(update_cost_code_dataframe_from_dropdown_select, inputs=[cost_code_choice_drop, cost_code_dataframe_base], outputs=[cost_code_dataframe])
495
 
496
  in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
497
- success(fn = prepare_image_or_pdf, inputs=[in_doc_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, first_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool_false, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base]).\
498
- success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox])
 
499
 
500
  # Run redaction function
501
  document_redact_btn.click(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call, textract_query_number]).\
502
  success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
503
- success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
504
- outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path], api_name="redact_doc").\
505
  success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
506
 
507
  # If the app has completed a batch of pages, it will rerun the redaction process until the end of all pages in the document
508
- current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
509
- outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path]).\
510
  success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
511
 
512
  # If a file has been completed, the function will continue onto the next document
513
- latest_file_completed_text.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
514
- outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path]).\
515
  success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state]).\
516
  success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
517
- success(fn=reveal_feedback_buttons, outputs=[pdf_feedback_radio, pdf_further_details_text, pdf_submit_feedback_btn, pdf_feedback_title])
 
 
518
 
519
  # If the line level ocr results are changed by load in by user or by a new redaction task, replace the ocr results displayed in the table
520
  all_line_level_ocr_results_df_base.change(reset_ocr_base_dataframe, inputs=[all_line_level_ocr_results_df_base], outputs=[all_line_level_ocr_results_df])
@@ -532,8 +567,8 @@ with app:
532
  convert_textract_outputs_to_ocr_results.click(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
533
  success(fn= check_textract_outputs_exist, inputs=[textract_output_found_checkbox]).\
534
  success(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call]).\
535
- success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, textract_only_method_drop, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, no_redaction_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
536
- outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path])
537
 
538
  ###
539
  # REVIEW PDF REDACTIONS
@@ -542,7 +577,7 @@ with app:
542
  # Upload previous files for modifying redactions
543
  upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
544
  success(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
545
- success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base], api_name="prepare_doc").\
546
  success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
547
 
548
  # Page number controls
@@ -572,9 +607,9 @@ with app:
572
  text_entity_dropdown.select(update_entities_df_text, inputs=[text_entity_dropdown, recogniser_entity_dataframe_base, recogniser_entity_dropdown, page_entity_dropdown], outputs=[recogniser_entity_dataframe, recogniser_entity_dropdown, page_entity_dropdown])
573
 
574
  # Clicking on a cell in the recogniser entity dataframe will take you to that page, and also highlight the target redaction box in blue
575
- recogniser_entity_dataframe.select(df_select_callback, inputs=[recogniser_entity_dataframe], outputs=[annotate_current_page, selected_entity_dataframe_row]).\
576
- success(update_selected_review_df_row_colour, inputs=[selected_entity_dataframe_row, review_file_state, selected_entity_id, selected_entity_colour, page_sizes], outputs=[review_file_state, selected_entity_id, selected_entity_colour]).\
577
- success(update_annotator_page_from_review_df, inputs=[review_file_state, images_pdf_state, page_sizes, annotate_current_page, annotate_previous_page, all_image_annotations_state, annotator], outputs=[annotator, all_image_annotations_state])
578
 
579
  reset_dropdowns_btn.click(reset_dropdowns, inputs=[recogniser_entity_dataframe_base], outputs=[recogniser_entity_dropdown, text_entity_dropdown, page_entity_dropdown]).\
580
  success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
@@ -604,12 +639,12 @@ with app:
604
 
605
  # Convert review file to xfdf Adobe format
606
  convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
607
- success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder]).\
608
  success(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state, output_folder_textbox, document_cropboxes, page_sizes], outputs=[adobe_review_files_out])
609
 
610
  # Convert xfdf Adobe file back to review_file.csv
611
  convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
612
- success(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder]).\
613
  success(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state, output_folder_textbox], outputs=[output_review_files], scroll_to_output=True)
614
 
615
  ###
@@ -618,11 +653,14 @@ with app:
618
  in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets]).\
619
  success(fn=get_input_file_names, inputs=[in_data_files], outputs=[data_file_name_no_extension_textbox, data_file_name_with_extension_textbox, data_full_file_name_textbox, data_file_name_textbox_list, total_pdf_page_count])
620
 
621
- tabular_data_redact_btn.click(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, first_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state], api_name="redact_data")
 
 
622
 
 
623
  # If the output file count text box changes, keep going with redacting each data file until done
624
- text_tabular_files_done.change(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, second_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state]).\
625
- success(fn = reveal_feedback_buttons, outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title])
626
 
627
  ###
628
  # IDENTIFY DUPLICATE PAGES
@@ -654,7 +692,7 @@ with app:
654
 
655
  # Get connection details on app load
656
 
657
- if SHOW_BULK_TEXTRACT_CALL_OPTIONS == "True":
658
  app.load(get_connection_params, inputs=[output_folder_textbox, input_folder_textbox, session_output_folder_textbox, s3_bulk_textract_input_subfolder, s3_bulk_textract_output_subfolder, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder], outputs=[session_hash_state, output_folder_textbox, session_hash_textbox, input_folder_textbox, s3_bulk_textract_input_subfolder, s3_bulk_textract_output_subfolder, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder]).\
659
  success(load_in_textract_job_details, inputs=[load_s3_bulk_textract_logs_bool, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder], outputs=[textract_job_detail_df])
660
  else:
@@ -691,49 +729,71 @@ with app:
691
  # LOGGING
692
  ###
693
 
 
694
  # Log usernames and times of access to file (to know who is using the app when running on AWS)
695
  access_callback = CSVLogger_custom(dataset_file_name=log_file_name)
696
- access_callback.setup([session_hash_textbox, host_name_textbox], ACCESS_LOGS_FOLDER)
697
-
698
- session_hash_textbox.change(lambda *args: access_callback.flag(list(args)), [session_hash_textbox, host_name_textbox], None, preprocess=False).\
699
- success(fn = upload_file_to_s3, inputs=[access_logs_state, access_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
700
 
701
- # User submitted feedback for pdf redactions
702
- pdf_callback = CSVLogger_custom(dataset_file_name=log_file_name)
703
- pdf_callback.setup([pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], FEEDBACK_LOGS_FOLDER)
704
- pdf_submit_feedback_btn.click(lambda *args: pdf_callback.flag(list(args)), [pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], None, preprocess=False).\
705
- success(fn = upload_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[pdf_further_details_text])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
706
 
707
- # User submitted feedback for data redactions
708
- data_callback = CSVLogger_custom(dataset_file_name=log_file_name)
709
- data_callback.setup([data_feedback_radio, data_further_details_text, data_full_file_name_textbox], FEEDBACK_LOGS_FOLDER)
710
- data_submit_feedback_btn.click(lambda *args: data_callback.flag(list(args)), [data_feedback_radio, data_further_details_text, data_full_file_name_textbox], None, preprocess=False).\
711
- success(fn = upload_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[data_further_details_text])
712
 
713
- # Log processing time/token usage when making a query
714
- usage_callback = CSVLogger_custom(dataset_file_name=log_file_name)
715
 
716
  if DISPLAY_FILE_NAMES_IN_LOGS == 'True':
717
  usage_callback.setup([session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], USAGE_LOGS_FOLDER)
718
 
719
- latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args)), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
720
- success(fn = upload_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
721
 
722
- successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args)), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
723
- success(fn = upload_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
 
 
 
724
  else:
725
- usage_callback.setup([session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, data_full_file_name_textbox, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], USAGE_LOGS_FOLDER)
 
 
 
726
 
727
- latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args)), [session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, data_full_file_name_textbox, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
728
- success(fn = upload_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
729
 
730
- successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args)), [session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, data_full_file_name_textbox, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
731
- success(fn = upload_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
732
 
733
  if __name__ == "__main__":
734
  if RUN_DIRECT_MODE == "0":
735
 
736
- if os.environ['COGNITO_AUTH'] == "1":
737
  app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, auth=authenticate_user, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
738
  else:
739
  app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
 
4
  import gradio as gr
5
  from gradio_image_annotation import image_annotator
6
 
7
+ from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER, SESSION_OUTPUT_FOLDER, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC, HOST_NAME, DEFAULT_COST_CODE, OUTPUT_COST_CODES_PATH, OUTPUT_ALLOW_LIST_PATH, COGNITO_AUTH, SAVE_LOGS_TO_CSV, SAVE_LOGS_TO_DYNAMODB, ACCESS_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_ACCESS_LOG_HEADERS, CSV_ACCESS_LOG_HEADERS, FEEDBACK_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_FEEDBACK_LOG_HEADERS, CSV_FEEDBACK_LOG_HEADERS, USAGE_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_USAGE_LOG_HEADERS, CSV_USAGE_LOG_HEADERS
8
+ from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select, check_for_existing_local_ocr_file, reset_data_vars, reset_aws_call_vars
9
+ from tools.aws_functions import upload_file_to_s3, download_file_from_s3, upload_log_file_to_s3
10
  from tools.file_redaction import choose_and_run_redactor
11
+ from tools.file_conversion import prepare_image_or_pdf, get_input_file_names
12
  from tools.redaction_review import apply_redactions_to_review_df_and_files, update_all_page_annotation_object_based_on_previous_page, decrease_page, increase_page, update_annotator_object_and_filter_df, update_entities_df_recogniser_entities, update_entities_df_page, update_entities_df_text, df_select_callback, convert_df_to_xfdf, convert_xfdf_to_dataframe, reset_dropdowns, exclude_selected_items_from_redaction, undo_last_removal, update_selected_review_df_row_colour, update_all_entity_df_dropdowns, df_select_callback_cost, update_other_annotator_number_from_current, update_annotator_page_from_review_df, df_select_callback_ocr, df_select_callback_textract_api
13
  from tools.data_anonymise import anonymise_data_files
14
  from tools.auth import authenticate_user
 
44
  default_ocr_val = text_ocr_option
45
  default_pii_detector = local_pii_detector
46
 
47
+ SAVE_LOGS_TO_CSV = eval(SAVE_LOGS_TO_CSV)
48
+ SAVE_LOGS_TO_DYNAMODB = eval(SAVE_LOGS_TO_DYNAMODB)
49
+
50
+ if CSV_ACCESS_LOG_HEADERS: CSV_ACCESS_LOG_HEADERS = eval(CSV_ACCESS_LOG_HEADERS)
51
+ if CSV_FEEDBACK_LOG_HEADERS: CSV_FEEDBACK_LOG_HEADERS = eval(CSV_FEEDBACK_LOG_HEADERS)
52
+ if CSV_USAGE_LOG_HEADERS: CSV_USAGE_LOG_HEADERS = eval(CSV_USAGE_LOG_HEADERS)
53
+
54
+ if DYNAMODB_ACCESS_LOG_HEADERS: DYNAMODB_ACCESS_LOG_HEADERS = eval(DYNAMODB_ACCESS_LOG_HEADERS)
55
+ if DYNAMODB_FEEDBACK_LOG_HEADERS: DYNAMODB_FEEDBACK_LOG_HEADERS = eval(DYNAMODB_FEEDBACK_LOG_HEADERS)
56
+ if DYNAMODB_USAGE_LOG_HEADERS: DYNAMODB_USAGE_LOG_HEADERS = eval(DYNAMODB_USAGE_LOG_HEADERS)
57
+
58
+ print
59
+
60
  # Create the gradio interface
61
  app = gr.Blocks(theme = gr.themes.Base(), fill_width=True)
62
 
 
74
  all_decision_process_table_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="all_decision_process_table", visible=False, type="pandas", wrap=True)
75
  review_file_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="review_file_df", visible=False, type="pandas", wrap=True)
76
 
77
+ all_page_line_level_ocr_results = gr.State([])
78
+ all_page_line_level_ocr_results_with_children = gr.State([])
79
+
80
  session_hash_state = gr.Textbox(label= "session_hash_state", value="", visible=False)
81
  host_name_textbox = gr.Textbox(label= "host_name_textbox", value=HOST_NAME, visible=False)
82
  s3_output_folder_state = gr.Textbox(label= "s3_output_folder_state", value="", visible=False)
 
121
 
122
  doc_full_file_name_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
123
  doc_file_name_no_extension_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
124
+ blank_doc_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
125
+ blank_data_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "data_full_file_name_textbox", value="", visible=False)
126
+ placeholder_doc_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "doc_full_file_name_textbox", value="document", visible=False)
127
+ placeholder_data_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "data_full_file_name_textbox", value="data_file", visible=False)
128
+
129
+ # Left blank for when user does not want to report file names
130
  doc_file_name_with_extension_textbox = gr.Textbox(label = "doc_file_name_with_extension_textbox", value="", visible=False)
131
  doc_file_name_textbox_list = gr.Dropdown(label = "doc_file_name_textbox_list", value="", allow_custom_value=True,visible=False)
132
  latest_review_file_path = gr.Textbox(label = "latest_review_file_path", value="", visible=False) # Latest review file path output from redaction
 
170
  s3_default_allow_list_file = gr.Textbox(label = "Default allow list file", value=S3_ALLOW_LIST_PATH, visible=False)
171
  default_allow_list_output_folder_location = gr.Textbox(label = "Output default allow list location", value=OUTPUT_ALLOW_LIST_PATH, visible=False)
172
 
173
+ s3_bulk_textract_default_bucket = gr.Textbox(label = "Default Textract bulk S3 bucket", value=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET, visible=False)
174
+ s3_bulk_textract_input_subfolder = gr.Textbox(label = "Default Textract bulk S3 input folder", value=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER, visible=False)
175
+ s3_bulk_textract_output_subfolder = gr.Textbox(label = "Default Textract bulk S3 output folder", value=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER, visible=False)
176
  successful_textract_api_call_number = gr.Number(precision=0, value=0, visible=False)
177
  no_redaction_method_drop = gr.Radio(label = """Placeholder for no redaction method after downloading Textract outputs""", value = no_redaction_option, choices=[no_redaction_option], visible=False)
178
  textract_only_method_drop = gr.Radio(label="""Placeholder for Textract method after downloading Textract outputs""", value = textract_option, choices=[textract_option], visible=False)
 
205
  cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
206
 
207
  textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=False)
208
+ local_ocr_output_found_checkbox = gr.Checkbox(value= False, label="Existing local OCR output file found", interactive=False, visible=False)
209
  total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=False)
210
  estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost ($)", value=0, visible=False, precision=2)
211
  estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=False, precision=2)
 
262
  if SHOW_COSTS == "True":
263
  with gr.Accordion("Estimated costs and time taken", open = True, visible=True):
264
  with gr.Row(equal_height=True):
265
+ with gr.Column(scale=1):
266
+ textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=True)
267
+ local_ocr_output_found_checkbox = gr.Checkbox(value= False, label="Existing local OCR output file found", interactive=False, visible=True)
268
+ with gr.Column(scale=4):
269
+ with gr.Row(equal_height=True):
270
+ total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=True)
271
+ estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost (£)", value=0.00, precision=2, visible=True)
272
+ estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=True, precision=2)
273
 
274
  if GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True":
275
  with gr.Accordion("Apply cost code", open = True, visible=True):
 
279
  reset_cost_code_dataframe_button = gr.Button(value="Reset code code table filter")
280
  cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=True)
281
 
282
+ if SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS == "True":
283
  with gr.Accordion("Submit whole document to AWS Textract API (quicker, max 3,000 pages per document)", open = False, visible=True):
284
  with gr.Row(equal_height=True):
285
  gr.Markdown("""Document will be submitted to AWS Textract API service to extract all text in the document. Processing will take place on (secure) AWS servers, and outputs will be stored on S3 for up to 7 days. To download the results, click 'Check status' below and they will be downloaded if ready.""")
 
407
  ###
408
  with gr.Tab(label="Open text or Excel/csv files"):
409
  gr.Markdown("""### Choose open text or a tabular data file (xlsx or csv) to redact.""")
410
+ with gr.Accordion("Redact open text", open = False):
411
  in_text = gr.Textbox(label="Enter open text", lines=10)
412
  with gr.Accordion("Upload xlsx or csv files", open = True):
413
  in_data_files = gr.File(label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet', '.csv.gz'], height=file_input_height)
 
417
  in_colnames = gr.Dropdown(choices=["Choose columns to anonymise"], multiselect = True, label="Select columns that you want to anonymise (showing columns present across all files).")
418
 
419
  pii_identification_method_drop_tabular = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
420
+
421
+ with gr.Accordion("Anonymisation output format", open = False):
422
+ anon_strat = gr.Radio(choices=["replace with 'REDACTED'", "replace with <ENTITY_NAME>", "redact completely", "hash", "mask"], label="Select an anonymisation method.", value = "replace with 'REDACTED'") # , "encrypt", "fake_first_name" are also available, but are not currently included as not that useful in current form
423
 
424
  tabular_data_redact_btn = gr.Button("Redact text/data files", variant="primary")
425
 
 
477
  aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
478
  aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
479
 
480
+
 
481
 
482
+ with gr.Accordion("Log file outputs", open = False):
483
+ log_files_output = gr.File(label="Log file output", interactive=False)
484
 
485
  with gr.Accordion("Combine multiple review files", open = False):
486
  multiple_review_files_in_out = gr.File(label="Combine multiple review_file.csv files together here.", file_count='multiple', file_types=['.csv'])
 
506
  handwrite_signature_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
507
  textract_output_found_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
508
  only_extract_text_radio.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
509
+ textract_output_found_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
510
 
511
  # Calculate time taken
512
+ total_pdf_page_count.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
513
+ text_extract_method_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
514
+ pii_identification_method_drop.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
515
+ handwrite_signature_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
516
+ textract_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
517
+ only_extract_text_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
518
+ textract_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
519
+ local_ocr_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
520
 
521
  # Allow user to select items from cost code dataframe for cost code
522
  if SHOW_COSTS=="True" and (GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True"):
 
526
  cost_code_choice_drop.select(update_cost_code_dataframe_from_dropdown_select, inputs=[cost_code_choice_drop, cost_code_dataframe_base], outputs=[cost_code_dataframe])
527
 
528
  in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
529
+ success(fn = prepare_image_or_pdf, inputs=[in_doc_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, first_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool_false, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base, local_ocr_output_found_checkbox]).\
530
+ success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
531
+ success(fn=check_for_existing_local_ocr_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[local_ocr_output_found_checkbox])
532
 
533
  # Run redaction function
534
  document_redact_btn.click(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call, textract_query_number]).\
535
  success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
536
+ success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
537
+ outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children], api_name="redact_doc").\
538
  success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
539
 
540
  # If the app has completed a batch of pages, it will rerun the redaction process until the end of all pages in the document
541
+ current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
542
+ outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children]).\
543
  success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
544
 
545
  # If a file has been completed, the function will continue onto the next document
546
+ latest_file_completed_text.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
547
+ outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children]).\
548
  success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state]).\
549
  success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
550
+ success(fn=check_for_existing_local_ocr_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[local_ocr_output_found_checkbox]).\
551
+ success(fn=reveal_feedback_buttons, outputs=[pdf_feedback_radio, pdf_further_details_text, pdf_submit_feedback_btn, pdf_feedback_title]).\
552
+ success(fn = reset_aws_call_vars, outputs=[comprehend_query_number, textract_query_number])
553
 
554
  # If the line level ocr results are changed by load in by user or by a new redaction task, replace the ocr results displayed in the table
555
  all_line_level_ocr_results_df_base.change(reset_ocr_base_dataframe, inputs=[all_line_level_ocr_results_df_base], outputs=[all_line_level_ocr_results_df])
 
567
  convert_textract_outputs_to_ocr_results.click(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
568
  success(fn= check_textract_outputs_exist, inputs=[textract_output_found_checkbox]).\
569
  success(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call]).\
570
+ success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, textract_only_method_drop, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, no_redaction_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
571
+ outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children])
572
 
573
  ###
574
  # REVIEW PDF REDACTIONS
 
577
  # Upload previous files for modifying redactions
578
  upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
579
  success(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
580
+ success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base, local_ocr_output_found_checkbox], api_name="prepare_doc").\
581
  success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
582
 
583
  # Page number controls
 
607
  text_entity_dropdown.select(update_entities_df_text, inputs=[text_entity_dropdown, recogniser_entity_dataframe_base, recogniser_entity_dropdown, page_entity_dropdown], outputs=[recogniser_entity_dataframe, recogniser_entity_dropdown, page_entity_dropdown])
608
 
609
  # Clicking on a cell in the recogniser entity dataframe will take you to that page, and also highlight the target redaction box in blue
610
+ recogniser_entity_dataframe.select(df_select_callback, inputs=[recogniser_entity_dataframe], outputs=[selected_entity_dataframe_row]).\
611
+ success(update_selected_review_df_row_colour, inputs=[selected_entity_dataframe_row, review_file_state, selected_entity_id, selected_entity_colour], outputs=[review_file_state, selected_entity_id, selected_entity_colour]).\
612
+ success(update_annotator_page_from_review_df, inputs=[review_file_state, images_pdf_state, page_sizes, all_image_annotations_state, annotator, selected_entity_dataframe_row, input_folder_textbox, doc_full_file_name_textbox], outputs=[annotator, all_image_annotations_state, annotate_current_page, page_sizes, review_file_state, annotate_previous_page])
613
 
614
  reset_dropdowns_btn.click(reset_dropdowns, inputs=[recogniser_entity_dataframe_base], outputs=[recogniser_entity_dropdown, text_entity_dropdown, page_entity_dropdown]).\
615
  success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
 
639
 
640
  # Convert review file to xfdf Adobe format
641
  convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
642
+ success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder, local_ocr_output_found_checkbox]).\
643
  success(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state, output_folder_textbox, document_cropboxes, page_sizes], outputs=[adobe_review_files_out])
644
 
645
  # Convert xfdf Adobe file back to review_file.csv
646
  convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
647
+ success(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder, local_ocr_output_found_checkbox]).\
648
  success(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state, output_folder_textbox], outputs=[output_review_files], scroll_to_output=True)
649
 
650
  ###
 
653
  in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets]).\
654
  success(fn=get_input_file_names, inputs=[in_data_files], outputs=[data_file_name_no_extension_textbox, data_file_name_with_extension_textbox, data_full_file_name_textbox, data_file_name_textbox_list, total_pdf_page_count])
655
 
656
+ tabular_data_redact_btn.click(reset_data_vars, outputs=[actual_time_taken_number, log_files_output_list_state, comprehend_query_number]).\
657
+ success(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, first_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox, actual_time_taken_number], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state, actual_time_taken_number], api_name="redact_data").\
658
+ success(fn = reveal_feedback_buttons, outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title])
659
 
660
+ # Currently only supports redacting one data file at a time
661
  # If the output file count text box changes, keep going with redacting each data file until done
662
+ # text_tabular_files_done.change(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, second_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox, actual_time_taken_number], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state, actual_time_taken_number]).\
663
+ # success(fn = reveal_feedback_buttons, outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title])
664
 
665
  ###
666
  # IDENTIFY DUPLICATE PAGES
 
692
 
693
  # Get connection details on app load
694
 
695
+ if SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS == "True":
696
  app.load(get_connection_params, inputs=[output_folder_textbox, input_folder_textbox, session_output_folder_textbox, s3_bulk_textract_input_subfolder, s3_bulk_textract_output_subfolder, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder], outputs=[session_hash_state, output_folder_textbox, session_hash_textbox, input_folder_textbox, s3_bulk_textract_input_subfolder, s3_bulk_textract_output_subfolder, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder]).\
697
  success(load_in_textract_job_details, inputs=[load_s3_bulk_textract_logs_bool, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder], outputs=[textract_job_detail_df])
698
  else:
 
729
  # LOGGING
730
  ###
731
 
732
+ ### ACCESS LOGS
733
  # Log usernames and times of access to file (to know who is using the app when running on AWS)
734
  access_callback = CSVLogger_custom(dataset_file_name=log_file_name)
735
+ access_callback.setup([session_hash_textbox, host_name_textbox], ACCESS_LOGS_FOLDER)
736
+ session_hash_textbox.change(lambda *args: access_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=ACCESS_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_ACCESS_LOG_HEADERS, replacement_headers=CSV_ACCESS_LOG_HEADERS), [session_hash_textbox, host_name_textbox], None, preprocess=False).\
737
+ success(fn = upload_log_file_to_s3, inputs=[access_logs_state, access_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
 
738
 
739
+ ### FEEDBACK LOGS
740
+ if DISPLAY_FILE_NAMES_IN_LOGS == 'True':
741
+ # User submitted feedback for pdf redactions
742
+ pdf_callback = CSVLogger_custom(dataset_file_name=log_file_name)
743
+ pdf_callback.setup([pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], FEEDBACK_LOGS_FOLDER)
744
+ pdf_submit_feedback_btn.click(lambda *args: pdf_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], None, preprocess=False).\
745
+ success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[pdf_further_details_text])
746
+
747
+ # User submitted feedback for data redactions
748
+ data_callback = CSVLogger_custom(dataset_file_name=log_file_name)
749
+ data_callback.setup([data_feedback_radio, data_further_details_text, data_full_file_name_textbox], FEEDBACK_LOGS_FOLDER)
750
+ data_submit_feedback_btn.click(lambda *args: data_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [data_feedback_radio, data_further_details_text, data_full_file_name_textbox], None, preprocess=False).\
751
+ success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[data_further_details_text])
752
+ else:
753
+ # User submitted feedback for pdf redactions
754
+ pdf_callback = CSVLogger_custom(dataset_file_name=log_file_name)
755
+ pdf_callback.setup([pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], FEEDBACK_LOGS_FOLDER)
756
+ pdf_submit_feedback_btn.click(lambda *args: pdf_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [pdf_feedback_radio, pdf_further_details_text, placeholder_doc_file_name_no_extension_textbox_for_logs], None, preprocess=False).\
757
+ success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[pdf_further_details_text])
758
+
759
+ # User submitted feedback for data redactions
760
+ data_callback = CSVLogger_custom(dataset_file_name=log_file_name)
761
+ data_callback.setup([data_feedback_radio, data_further_details_text, data_full_file_name_textbox], FEEDBACK_LOGS_FOLDER)
762
+ data_submit_feedback_btn.click(lambda *args: data_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [data_feedback_radio, data_further_details_text, placeholder_data_file_name_no_extension_textbox_for_logs], None, preprocess=False).\
763
+ success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[data_further_details_text])
764
 
765
+ ### USAGE LOGS
766
+ # Log processing usage - time taken for redaction queries, and also logs for queries to Textract/Comprehend
 
 
 
767
 
768
+ usage_callback = CSVLogger_custom(dataset_file_name=log_file_name)
 
769
 
770
  if DISPLAY_FILE_NAMES_IN_LOGS == 'True':
771
  usage_callback.setup([session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], USAGE_LOGS_FOLDER)
772
 
773
+ latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
774
+ success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
775
 
776
+ text_tabular_files_done.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop_tabular, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
777
+ success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
778
+
779
+ successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
780
+ success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
781
  else:
782
+ usage_callback.setup([session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, blank_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], USAGE_LOGS_FOLDER)
783
+
784
+ latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, placeholder_doc_file_name_no_extension_textbox_for_logs, blank_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
785
+ success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
786
 
787
+ text_tabular_files_done.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, placeholder_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop_tabular, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
788
+ success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
789
 
790
+ successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, placeholder_doc_file_name_no_extension_textbox_for_logs, blank_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
791
+ success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
792
 
793
  if __name__ == "__main__":
794
  if RUN_DIRECT_MODE == "0":
795
 
796
+ if COGNITO_AUTH == "1":
797
  app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, auth=authenticate_user, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
798
  else:
799
  app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
pyproject.toml ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [build-system]
2
+ requires = ["setuptools>=61.0", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "doc_redaction" # Your application's name
7
+ version = "0.6.0" # Your application's current version
8
+ description = "Redact PDF/image-based documents, or CSV/XLSX files using a Gradio-based GUI interface" # A short description
9
+ readme = "README.md" # Path to your project's README file
10
+ requires-python = ">=3.10" # The minimum Python version required
11
+
12
+ dependencies = [
13
+ "pdfminer.six==20240706",
14
+ "pdf2image==1.17.0",
15
+ "pymupdf==1.25.3",
16
+ "opencv-python==4.10.0.84",
17
+ "presidio_analyzer==2.2.358",
18
+ "presidio_anonymizer==2.2.358",
19
+ "presidio-image-redactor==0.0.56",
20
+ "pikepdf==9.5.2",
21
+ "pandas==2.2.3",
22
+ "scikit-learn==1.6.1",
23
+ "spacy==3.8.4",
24
+ # Direct URL dependency for spacy model
25
+ "en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0.tar.gz",
26
+ "gradio==5.27.1",
27
+ "boto3==1.38.4",
28
+ "pyarrow==19.0.1",
29
+ "openpyxl==3.1.5",
30
+ "Faker==36.1.1",
31
+ "python-levenshtein==0.26.1",
32
+ "spaczz==0.6.1",
33
+ # Direct URL dependency for gradio_image_annotator wheel
34
+ "gradio_image_annotation @ https://github.com/seanpedrick-case/gradio_image_annotator/releases/download/v0.3.2/gradio_image_annotation-0.3.2-py3-none-any.whl",
35
+ "rapidfuzz==3.12.1",
36
+ "python-dotenv==1.0.1",
37
+ "numpy==1.26.4",
38
+ "awslambdaric==3.0.1"
39
+ ]
40
+
41
+ [project.urls]
42
+ Homepage = "https://seanpedrick-case.github.io/doc_redaction/README.html"
43
+ repository = "https://github.com/seanpedrick-case/doc_redaction"
44
+
45
+ [project.optional-dependencies]
46
+ dev = ["pytest"]
47
+
48
+ # Optional: You can add configuration for tools used in your project under the [tool] section
49
+ # For example, configuration for a linter like Ruff:
50
+ [tool.ruff]
51
+ line-length = 88
52
+ select = ["E", "F", "I"]
53
+
54
+ # Optional: Configuration for a formatter like Black:
55
+ [tool.black]
56
+ line-length = 88
57
+ target-version = ['py310']
requirements.txt CHANGED
@@ -10,8 +10,8 @@ pandas==2.2.3
10
  scikit-learn==1.6.1
11
  spacy==3.8.4
12
  en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0.tar.gz
13
- gradio==5.25.2
14
- boto3==1.37.29
15
  pyarrow==19.0.1
16
  openpyxl==3.1.5
17
  Faker==36.1.1
 
10
  scikit-learn==1.6.1
11
  spacy==3.8.4
12
  en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0.tar.gz
13
+ gradio==5.27.1
14
+ boto3==1.38.4
15
  pyarrow==19.0.1
16
  openpyxl==3.1.5
17
  Faker==36.1.1
tools/aws_functions.py CHANGED
@@ -3,7 +3,7 @@ import pandas as pd
3
  import boto3
4
  import tempfile
5
  import os
6
- from tools.config import AWS_REGION, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET
7
  PandasDataFrame = Type[pd.DataFrame]
8
 
9
  def get_assumed_role_info():
@@ -174,3 +174,59 @@ def upload_file_to_s3(local_file_paths:List[str], s3_key:str, s3_bucket:str=DOCU
174
  final_out_message_str = "App not set to run AWS functions"
175
 
176
  return final_out_message_str
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  import boto3
4
  import tempfile
5
  import os
6
+ from tools.config import AWS_REGION, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SAVE_LOGS_TO_CSV
7
  PandasDataFrame = Type[pd.DataFrame]
8
 
9
  def get_assumed_role_info():
 
174
  final_out_message_str = "App not set to run AWS functions"
175
 
176
  return final_out_message_str
177
+
178
+
179
+ def upload_log_file_to_s3(local_file_paths:List[str], s3_key:str, s3_bucket:str=DOCUMENT_REDACTION_BUCKET, RUN_AWS_FUNCTIONS:str = RUN_AWS_FUNCTIONS, SAVE_LOGS_TO_CSV:str=SAVE_LOGS_TO_CSV):
180
+ """
181
+ Uploads a log file from local machine to Amazon S3.
182
+
183
+ Args:
184
+ - local_file_path: Local file path(s) of the file(s) to upload.
185
+ - s3_key: Key (path) to the file in the S3 bucket.
186
+ - s3_bucket: Name of the S3 bucket.
187
+
188
+ Returns:
189
+ - Message as variable/printed to console
190
+ """
191
+ final_out_message = []
192
+ final_out_message_str = ""
193
+
194
+ if RUN_AWS_FUNCTIONS == "1" and SAVE_LOGS_TO_CSV == "True":
195
+ try:
196
+ if s3_bucket and s3_key and local_file_paths:
197
+
198
+ s3_client = boto3.client('s3', region_name=AWS_REGION)
199
+
200
+ if isinstance(local_file_paths, str):
201
+ local_file_paths = [local_file_paths]
202
+
203
+ for file in local_file_paths:
204
+ if s3_client:
205
+ #print(s3_client)
206
+ try:
207
+ # Get file name off file path
208
+ file_name = os.path.basename(file)
209
+
210
+ s3_key_full = s3_key + file_name
211
+ print("S3 key: ", s3_key_full)
212
+
213
+ s3_client.upload_file(file, s3_bucket, s3_key_full)
214
+ out_message = "File " + file_name + " uploaded successfully!"
215
+ print(out_message)
216
+
217
+ except Exception as e:
218
+ out_message = f"Error uploading file(s): {e}"
219
+ print(out_message)
220
+
221
+ final_out_message.append(out_message)
222
+ final_out_message_str = '\n'.join(final_out_message)
223
+
224
+ else: final_out_message_str = "Could not connect to AWS."
225
+ else: final_out_message_str = "At least one essential variable is empty, could not upload to S3"
226
+ except Exception as e:
227
+ final_out_message_str = "Could not upload files to S3 due to: " + str(e)
228
+ print(final_out_message_str)
229
+ else:
230
+ final_out_message_str = "App not set to run AWS functions"
231
+
232
+ return final_out_message_str
tools/aws_textract.py CHANGED
@@ -108,6 +108,174 @@ def convert_pike_pdf_page_to_bytes(pdf:object, page_num:int):
108
 
109
  return pdf_bytes
110
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
  def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_no:int):
112
  '''
113
  Convert the json response from textract to the OCRResult format used elsewhere in the code. Looks for lines, words, and signatures. Handwriting and signatures are set aside especially for later in case the user wants to override the default behaviour and redact all handwriting/signatures.
@@ -118,7 +286,7 @@ def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_
118
  handwriting_recogniser_results = []
119
  signatures = []
120
  handwriting = []
121
- ocr_results_with_children = {}
122
  text_block={}
123
 
124
  i = 1
@@ -141,7 +309,7 @@ def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_
141
  is_signature = False
142
  is_handwriting = False
143
 
144
- for text_block in text_blocks:
145
 
146
  if (text_block['BlockType'] == 'LINE') | (text_block['BlockType'] == 'SIGNATURE'): # (text_block['BlockType'] == 'WORD') |
147
 
@@ -244,36 +412,53 @@ def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_
244
  'text': line_text,
245
  'bounding_box': (line_left, line_top, line_right, line_bottom)
246
  }]
247
-
248
- ocr_results_with_children["text_line_" + str(i)] = {
 
 
 
 
 
 
 
 
 
 
 
249
  "line": i,
250
  'text': line_text,
251
  'bounding_box': (line_left, line_top, line_right, line_bottom),
252
- 'words': words
253
- }
 
254
 
255
  # Create OCRResult with absolute coordinates
256
  ocr_result = OCRResult(line_text, line_left, line_top, width_abs, height_abs)
257
  all_ocr_results.append(ocr_result)
258
 
259
- is_signature_or_handwriting = is_signature | is_handwriting
 
 
 
 
 
 
 
 
 
260
 
261
- # If it is signature or handwriting, will overwrite the default behaviour of the PII analyser
262
- if is_signature_or_handwriting:
263
- if recogniser_result not in signature_or_handwriting_recogniser_results:
264
- signature_or_handwriting_recogniser_results.append(recogniser_result)
265
 
266
- if is_signature:
267
- if recogniser_result not in signature_recogniser_results:
268
- signature_recogniser_results.append(recogniser_result)
269
 
270
- if is_handwriting:
271
- if recogniser_result not in handwriting_recogniser_results:
272
- handwriting_recogniser_results.append(recogniser_result)
273
 
274
- i += 1
275
 
276
- return all_ocr_results, signature_or_handwriting_recogniser_results, signature_recogniser_results, handwriting_recogniser_results, ocr_results_with_children
277
 
278
  def load_and_convert_textract_json(textract_json_file_path:str, log_files_output_paths:str, page_sizes_df:pd.DataFrame):
279
  """
@@ -315,7 +500,7 @@ def load_and_convert_textract_json(textract_json_file_path:str, log_files_output
315
  return {}, True, log_files_output_paths # Conversion failed
316
  else:
317
  print("Invalid Textract JSON format: 'Blocks' missing.")
318
- print("textract data:", textract_data)
319
  return {}, True, log_files_output_paths # Return empty data if JSON is not recognized
320
 
321
  def restructure_textract_output(textract_output: dict, page_sizes_df:pd.DataFrame):
 
108
 
109
  return pdf_bytes
110
 
111
+ # def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_no:int):
112
+ # '''
113
+ # Convert the json response from textract to the OCRResult format used elsewhere in the code. Looks for lines, words, and signatures. Handwriting and signatures are set aside especially for later in case the user wants to override the default behaviour and redact all handwriting/signatures.
114
+ # '''
115
+ # all_ocr_results = []
116
+ # signature_or_handwriting_recogniser_results = []
117
+ # signature_recogniser_results = []
118
+ # handwriting_recogniser_results = []
119
+ # signatures = []
120
+ # handwriting = []
121
+ # ocr_results_with_words = {}
122
+ # text_block={}
123
+
124
+ # i = 1
125
+
126
+ # # Assuming json_data is structured as a dictionary with a "pages" key
127
+ # #if "pages" in json_data:
128
+ # # Find the specific page data
129
+ # page_json_data = json_data #next((page for page in json_data["pages"] if page["page_no"] == page_no), None)
130
+
131
+ # #print("page_json_data:", page_json_data)
132
+
133
+ # if "Blocks" in page_json_data:
134
+ # # Access the data for the specific page
135
+ # text_blocks = page_json_data["Blocks"] # Access the Blocks within the page data
136
+ # # This is a new page
137
+ # elif "page_no" in page_json_data:
138
+ # text_blocks = page_json_data["data"]["Blocks"]
139
+ # else: text_blocks = []
140
+
141
+ # is_signature = False
142
+ # is_handwriting = False
143
+
144
+ # for text_block in text_blocks:
145
+
146
+ # if (text_block['BlockType'] == 'LINE') | (text_block['BlockType'] == 'SIGNATURE'): # (text_block['BlockType'] == 'WORD') |
147
+
148
+ # # Extract text and bounding box for the line
149
+ # line_bbox = text_block["Geometry"]["BoundingBox"]
150
+ # line_left = int(line_bbox["Left"] * page_width)
151
+ # line_top = int(line_bbox["Top"] * page_height)
152
+ # line_right = int((line_bbox["Left"] + line_bbox["Width"]) * page_width)
153
+ # line_bottom = int((line_bbox["Top"] + line_bbox["Height"]) * page_height)
154
+
155
+ # width_abs = int(line_bbox["Width"] * page_width)
156
+ # height_abs = int(line_bbox["Height"] * page_height)
157
+
158
+ # if text_block['BlockType'] == 'LINE':
159
+
160
+ # # Extract text and bounding box for the line
161
+ # line_text = text_block.get('Text', '')
162
+ # words = []
163
+ # current_line_handwriting_results = [] # Track handwriting results for this line
164
+
165
+ # if 'Relationships' in text_block:
166
+ # for relationship in text_block['Relationships']:
167
+ # if relationship['Type'] == 'CHILD':
168
+ # for child_id in relationship['Ids']:
169
+ # child_block = next((block for block in text_blocks if block['Id'] == child_id), None)
170
+ # if child_block and child_block['BlockType'] == 'WORD':
171
+ # word_text = child_block.get('Text', '')
172
+ # word_bbox = child_block["Geometry"]["BoundingBox"]
173
+ # confidence = child_block.get('Confidence','')
174
+ # word_left = int(word_bbox["Left"] * page_width)
175
+ # word_top = int(word_bbox["Top"] * page_height)
176
+ # word_right = int((word_bbox["Left"] + word_bbox["Width"]) * page_width)
177
+ # word_bottom = int((word_bbox["Top"] + word_bbox["Height"]) * page_height)
178
+
179
+ # # Extract BoundingBox details
180
+ # word_width = word_bbox["Width"]
181
+ # word_height = word_bbox["Height"]
182
+
183
+ # # Convert proportional coordinates to absolute coordinates
184
+ # word_width_abs = int(word_width * page_width)
185
+ # word_height_abs = int(word_height * page_height)
186
+
187
+ # words.append({
188
+ # 'text': word_text,
189
+ # 'bounding_box': (word_left, word_top, word_right, word_bottom)
190
+ # })
191
+ # # Check for handwriting
192
+ # text_type = child_block.get("TextType", '')
193
+
194
+ # if text_type == "HANDWRITING":
195
+ # is_handwriting = True
196
+ # entity_name = "HANDWRITING"
197
+ # word_end = len(word_text)
198
+
199
+ # recogniser_result = CustomImageRecognizerResult(
200
+ # entity_type=entity_name,
201
+ # text=word_text,
202
+ # score=confidence,
203
+ # start=0,
204
+ # end=word_end,
205
+ # left=word_left,
206
+ # top=word_top,
207
+ # width=word_width_abs,
208
+ # height=word_height_abs
209
+ # )
210
+
211
+ # # Add to handwriting collections immediately
212
+ # handwriting.append(recogniser_result)
213
+ # handwriting_recogniser_results.append(recogniser_result)
214
+ # signature_or_handwriting_recogniser_results.append(recogniser_result)
215
+ # current_line_handwriting_results.append(recogniser_result)
216
+
217
+ # # If handwriting or signature, add to bounding box
218
+
219
+ # elif (text_block['BlockType'] == 'SIGNATURE'):
220
+ # line_text = "SIGNATURE"
221
+ # is_signature = True
222
+ # entity_name = "SIGNATURE"
223
+ # confidence = text_block.get('Confidence', 0)
224
+ # word_end = len(line_text)
225
+
226
+ # recogniser_result = CustomImageRecognizerResult(
227
+ # entity_type=entity_name,
228
+ # text=line_text,
229
+ # score=confidence,
230
+ # start=0,
231
+ # end=word_end,
232
+ # left=line_left,
233
+ # top=line_top,
234
+ # width=width_abs,
235
+ # height=height_abs
236
+ # )
237
+
238
+ # # Add to signature collections immediately
239
+ # signatures.append(recogniser_result)
240
+ # signature_recogniser_results.append(recogniser_result)
241
+ # signature_or_handwriting_recogniser_results.append(recogniser_result)
242
+
243
+ # words = [{
244
+ # 'text': line_text,
245
+ # 'bounding_box': (line_left, line_top, line_right, line_bottom)
246
+ # }]
247
+
248
+ # ocr_results_with_words["text_line_" + str(i)] = {
249
+ # "line": i,
250
+ # 'text': line_text,
251
+ # 'bounding_box': (line_left, line_top, line_right, line_bottom),
252
+ # 'words': words
253
+ # }
254
+
255
+ # # Create OCRResult with absolute coordinates
256
+ # ocr_result = OCRResult(line_text, line_left, line_top, width_abs, height_abs)
257
+ # all_ocr_results.append(ocr_result)
258
+
259
+ # is_signature_or_handwriting = is_signature | is_handwriting
260
+
261
+ # # If it is signature or handwriting, will overwrite the default behaviour of the PII analyser
262
+ # if is_signature_or_handwriting:
263
+ # if recogniser_result not in signature_or_handwriting_recogniser_results:
264
+ # signature_or_handwriting_recogniser_results.append(recogniser_result)
265
+
266
+ # if is_signature:
267
+ # if recogniser_result not in signature_recogniser_results:
268
+ # signature_recogniser_results.append(recogniser_result)
269
+
270
+ # if is_handwriting:
271
+ # if recogniser_result not in handwriting_recogniser_results:
272
+ # handwriting_recogniser_results.append(recogniser_result)
273
+
274
+ # i += 1
275
+
276
+ # return all_ocr_results, signature_or_handwriting_recogniser_results, signature_recogniser_results, handwriting_recogniser_results, ocr_results_with_words
277
+
278
+
279
  def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_no:int):
280
  '''
281
  Convert the json response from textract to the OCRResult format used elsewhere in the code. Looks for lines, words, and signatures. Handwriting and signatures are set aside especially for later in case the user wants to override the default behaviour and redact all handwriting/signatures.
 
286
  handwriting_recogniser_results = []
287
  signatures = []
288
  handwriting = []
289
+ ocr_results_with_words = {}
290
  text_block={}
291
 
292
  i = 1
 
309
  is_signature = False
310
  is_handwriting = False
311
 
312
+ for text_block in text_blocks:
313
 
314
  if (text_block['BlockType'] == 'LINE') | (text_block['BlockType'] == 'SIGNATURE'): # (text_block['BlockType'] == 'WORD') |
315
 
 
412
  'text': line_text,
413
  'bounding_box': (line_left, line_top, line_right, line_bottom)
414
  }]
415
+ else:
416
+ line_text = ""
417
+ words=[]
418
+ line_left = 0
419
+ line_top = 0
420
+ line_right = 0
421
+ line_bottom = 0
422
+ width_abs = 0
423
+ height_abs = 0
424
+
425
+ if line_text:
426
+
427
+ ocr_results_with_words["text_line_" + str(i)] = {
428
  "line": i,
429
  'text': line_text,
430
  'bounding_box': (line_left, line_top, line_right, line_bottom),
431
+ 'words': words,
432
+ 'page': page_no
433
+ }
434
 
435
  # Create OCRResult with absolute coordinates
436
  ocr_result = OCRResult(line_text, line_left, line_top, width_abs, height_abs)
437
  all_ocr_results.append(ocr_result)
438
 
439
+ is_signature_or_handwriting = is_signature | is_handwriting
440
+
441
+ # If it is signature or handwriting, will overwrite the default behaviour of the PII analyser
442
+ if is_signature_or_handwriting:
443
+ if recogniser_result not in signature_or_handwriting_recogniser_results:
444
+ signature_or_handwriting_recogniser_results.append(recogniser_result)
445
+
446
+ if is_signature:
447
+ if recogniser_result not in signature_recogniser_results:
448
+ signature_recogniser_results.append(recogniser_result)
449
 
450
+ if is_handwriting:
451
+ if recogniser_result not in handwriting_recogniser_results:
452
+ handwriting_recogniser_results.append(recogniser_result)
 
453
 
454
+ i += 1
 
 
455
 
456
+ # Add page key to the line level results
457
+ all_ocr_results_with_page = {"page": page_no, "results": all_ocr_results}
458
+ ocr_results_with_words_with_page = {"page": page_no, "results": ocr_results_with_words}
459
 
460
+ return all_ocr_results_with_page, signature_or_handwriting_recogniser_results, signature_recogniser_results, handwriting_recogniser_results, ocr_results_with_words_with_page
461
 
 
462
 
463
  def load_and_convert_textract_json(textract_json_file_path:str, log_files_output_paths:str, page_sizes_df:pd.DataFrame):
464
  """
 
500
  return {}, True, log_files_output_paths # Conversion failed
501
  else:
502
  print("Invalid Textract JSON format: 'Blocks' missing.")
503
+ #print("textract data:", textract_data)
504
  return {}, True, log_files_output_paths # Return empty data if JSON is not recognized
505
 
506
  def restructure_textract_output(textract_output: dict, page_sizes_df:pd.DataFrame):
tools/config.py CHANGED
@@ -108,19 +108,7 @@ if AWS_SECRET_KEY: print(f'AWS_SECRET_KEY found in environment variables')
108
 
109
  DOCUMENT_REDACTION_BUCKET = get_or_create_env_var('DOCUMENT_REDACTION_BUCKET', '')
110
 
111
- SHOW_BULK_TEXTRACT_CALL_OPTIONS = get_or_create_env_var('SHOW_BULK_TEXTRACT_CALL_OPTIONS', 'False') # This feature not currently implemented
112
 
113
- TEXTRACT_BULK_ANALYSIS_BUCKET = get_or_create_env_var('TEXTRACT_BULK_ANALYSIS_BUCKET', '')
114
-
115
- TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER', 'input')
116
-
117
- TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER', 'output')
118
-
119
- LOAD_PREVIOUS_TEXTRACT_JOBS_S3 = get_or_create_env_var('LOAD_PREVIOUS_TEXTRACT_JOBS_S3', 'False') # Whether or not to load previous Textract jobs from S3
120
-
121
- TEXTRACT_JOBS_S3_LOC = get_or_create_env_var('TEXTRACT_JOBS_S3_LOC', 'output') # Subfolder in the DOCUMENT_REDACTION_BUCKET where the Textract jobs are stored
122
-
123
- TEXTRACT_JOBS_LOCAL_LOC = get_or_create_env_var('TEXTRACT_JOBS_LOCAL_LOC', 'output') # Local subfolder where the Textract jobs are stored
124
 
125
  # Custom headers e.g. if routing traffic through Cloudfront
126
  # Retrieving or setting CUSTOM_HEADER
@@ -161,6 +149,8 @@ if OUTPUT_FOLDER == "TEMP" or INPUT_FOLDER == "TEMP":
161
  # By default, logs are put into a subfolder of today's date and the host name of the instance running the app. This is to avoid at all possible the possibility of log files from one instance overwriting the logs of another instance on S3. If running the app on one system always, or just locally, it is not necessary to make the log folders so specific.
162
  # Another way to address this issue would be to write logs to another type of storage, e.g. database such as dynamodb. I may look into this in future.
163
 
 
 
164
  USE_LOG_SUBFOLDERS = get_or_create_env_var('USE_LOG_SUBFOLDERS', 'True')
165
 
166
  if USE_LOG_SUBFOLDERS == "True":
@@ -181,8 +171,28 @@ ensure_folder_exists(USAGE_LOGS_FOLDER)
181
  # Should the redacted file name be included in the logs? In some instances, the names of the files themselves could be sensitive, and should not be disclosed beyond the app. So, by default this is false.
182
  DISPLAY_FILE_NAMES_IN_LOGS = get_or_create_env_var('DISPLAY_FILE_NAMES_IN_LOGS', 'False')
183
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
184
  ###
185
- # REDACTION CONFIG
186
 
187
  # Create Tesseract and Poppler folders if you have installed them locally
188
  TESSERACT_FOLDER = get_or_create_env_var('TESSERACT_FOLDER', "") # e.g. tesseract/
@@ -226,7 +236,7 @@ ROOT_PATH = get_or_create_env_var('ROOT_PATH', '')
226
 
227
  DEFAULT_CONCURRENCY_LIMIT = get_or_create_env_var('DEFAULT_CONCURRENCY_LIMIT', '3')
228
 
229
- GET_DEFAULT_ALLOW_LIST = get_or_create_env_var('GET_DEFAULT_ALLOW_LIST', 'False')
230
 
231
  ALLOW_LIST_PATH = get_or_create_env_var('ALLOW_LIST_PATH', '') # config/default_allow_list.csv
232
 
@@ -235,19 +245,38 @@ S3_ALLOW_LIST_PATH = get_or_create_env_var('S3_ALLOW_LIST_PATH', '') # default_a
235
  if ALLOW_LIST_PATH: OUTPUT_ALLOW_LIST_PATH = ALLOW_LIST_PATH
236
  else: OUTPUT_ALLOW_LIST_PATH = 'config/default_allow_list.csv'
237
 
 
 
238
  SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
239
 
240
- GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', 'False')
241
 
242
  DEFAULT_COST_CODE = get_or_create_env_var('DEFAULT_COST_CODE', '')
243
 
244
  COST_CODES_PATH = get_or_create_env_var('COST_CODES_PATH', '') # 'config/COST_CENTRES.csv' # file should be a csv file with a single table in it that has two columns with a header. First column should contain cost codes, second column should contain a name or description for the cost code
245
 
246
- S3_COST_CODES_PATH = get_or_create_env_var('S3_COST_CODES_PATH', '') # COST_CENTRES.csv # This is a path within the DOCUMENT_REDACTION_BUCKET
247
-
 
248
  if COST_CODES_PATH: OUTPUT_COST_CODES_PATH = COST_CODES_PATH
249
- else: OUTPUT_COST_CODES_PATH = 'config/COST_CENTRES.csv'
250
 
251
  ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
252
 
253
- if ENFORCE_COST_CODES == 'True': GET_COST_CODES = 'True'
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
 
109
  DOCUMENT_REDACTION_BUCKET = get_or_create_env_var('DOCUMENT_REDACTION_BUCKET', '')
110
 
 
111
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  # Custom headers e.g. if routing traffic through Cloudfront
114
  # Retrieving or setting CUSTOM_HEADER
 
149
  # By default, logs are put into a subfolder of today's date and the host name of the instance running the app. This is to avoid at all possible the possibility of log files from one instance overwriting the logs of another instance on S3. If running the app on one system always, or just locally, it is not necessary to make the log folders so specific.
150
  # Another way to address this issue would be to write logs to another type of storage, e.g. database such as dynamodb. I may look into this in future.
151
 
152
+ SAVE_LOGS_TO_CSV = get_or_create_env_var('SAVE_LOGS_TO_CSV', 'True')
153
+
154
  USE_LOG_SUBFOLDERS = get_or_create_env_var('USE_LOG_SUBFOLDERS', 'True')
155
 
156
  if USE_LOG_SUBFOLDERS == "True":
 
171
  # Should the redacted file name be included in the logs? In some instances, the names of the files themselves could be sensitive, and should not be disclosed beyond the app. So, by default this is false.
172
  DISPLAY_FILE_NAMES_IN_LOGS = get_or_create_env_var('DISPLAY_FILE_NAMES_IN_LOGS', 'False')
173
 
174
+ # Further customisation options for CSV logs
175
+
176
+ CSV_ACCESS_LOG_HEADERS = get_or_create_env_var('CSV_ACCESS_LOG_HEADERS', '') # If blank, uses component labels
177
+ CSV_FEEDBACK_LOG_HEADERS = get_or_create_env_var('CSV_FEEDBACK_LOG_HEADERS', '') # If blank, uses component labels
178
+ CSV_USAGE_LOG_HEADERS = get_or_create_env_var('CSV_USAGE_LOG_HEADERS', '["session_hash_textbox", "doc_full_file_name_textbox", "data_full_file_name_textbox", "actual_time_taken_number", "total_page_count", "textract_query_number", "pii_detection_method", "comprehend_query_number", "cost_code", "textract_handwriting_signature", "host_name_textbox", "text_extraction_method", "is_this_a_textract_api_call"]') # If blank, uses component labels
179
+
180
+ ### DYNAMODB logs. Whether to save to DynamoDB, and the headers of the table
181
+
182
+ SAVE_LOGS_TO_DYNAMODB = get_or_create_env_var('SAVE_LOGS_TO_DYNAMODB', 'False')
183
+
184
+ ACCESS_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('ACCESS_LOG_DYNAMODB_TABLE_NAME', 'redaction_access_log')
185
+ DYNAMODB_ACCESS_LOG_HEADERS = get_or_create_env_var('DYNAMODB_ACCESS_LOG_HEADERS', '')
186
+
187
+ FEEDBACK_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('FEEDBACK_LOG_DYNAMODB_TABLE_NAME', 'redaction_feedback')
188
+ DYNAMODB_FEEDBACK_LOG_HEADERS = get_or_create_env_var('DYNAMODB_FEEDBACK_LOG_HEADERS', '')
189
+
190
+ USAGE_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('USAGE_LOG_DYNAMODB_TABLE_NAME', 'redaction_usage')
191
+ DYNAMODB_USAGE_LOG_HEADERS = get_or_create_env_var('DYNAMODB_USAGE_LOG_HEADERS', '')
192
+
193
+ ###
194
+ # REDACTION
195
  ###
 
196
 
197
  # Create Tesseract and Poppler folders if you have installed them locally
198
  TESSERACT_FOLDER = get_or_create_env_var('TESSERACT_FOLDER', "") # e.g. tesseract/
 
236
 
237
  DEFAULT_CONCURRENCY_LIMIT = get_or_create_env_var('DEFAULT_CONCURRENCY_LIMIT', '3')
238
 
239
+ GET_DEFAULT_ALLOW_LIST = get_or_create_env_var('GET_DEFAULT_ALLOW_LIST', '')
240
 
241
  ALLOW_LIST_PATH = get_or_create_env_var('ALLOW_LIST_PATH', '') # config/default_allow_list.csv
242
 
 
245
  if ALLOW_LIST_PATH: OUTPUT_ALLOW_LIST_PATH = ALLOW_LIST_PATH
246
  else: OUTPUT_ALLOW_LIST_PATH = 'config/default_allow_list.csv'
247
 
248
+ ### COST CODE OPTIONS
249
+
250
  SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
251
 
252
+ GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', 'True')
253
 
254
  DEFAULT_COST_CODE = get_or_create_env_var('DEFAULT_COST_CODE', '')
255
 
256
  COST_CODES_PATH = get_or_create_env_var('COST_CODES_PATH', '') # 'config/COST_CENTRES.csv' # file should be a csv file with a single table in it that has two columns with a header. First column should contain cost codes, second column should contain a name or description for the cost code
257
 
258
+ S3_COST_CODES_PATH = get_or_create_env_var('S3_COST_CODES_PATH', '') # COST_CENTRES.csv # This is a path within the DOCUMENT_REDACTION_BUCKET
259
+
260
+ # A default path in case s3 cost code location is provided but no local cost code location given
261
  if COST_CODES_PATH: OUTPUT_COST_CODES_PATH = COST_CODES_PATH
262
+ else: OUTPUT_COST_CODES_PATH = 'config/cost_codes.csv'
263
 
264
  ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
265
 
266
+ if ENFORCE_COST_CODES == 'True': GET_COST_CODES = 'True'
267
+
268
+ ### WHOLE DOCUMENT API OPTIONS
269
+
270
+ SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS = get_or_create_env_var('SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS', 'False') # This feature not currently implemented
271
+
272
+ TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET', '')
273
+
274
+ TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER', 'input')
275
+
276
+ TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER', 'output')
277
+
278
+ LOAD_PREVIOUS_TEXTRACT_JOBS_S3 = get_or_create_env_var('LOAD_PREVIOUS_TEXTRACT_JOBS_S3', 'False') # Whether or not to load previous Textract jobs from S3
279
+
280
+ TEXTRACT_JOBS_S3_LOC = get_or_create_env_var('TEXTRACT_JOBS_S3_LOC', 'output') # Subfolder in the DOCUMENT_REDACTION_BUCKET where the Textract jobs are stored
281
+
282
+ TEXTRACT_JOBS_LOCAL_LOC = get_or_create_env_var('TEXTRACT_JOBS_LOCAL_LOC', 'output') # Local subfolder where the Textract jobs are stored
tools/custom_csvlogger.py CHANGED
@@ -4,6 +4,10 @@ import csv
4
  import datetime
5
  import os
6
  import re
 
 
 
 
7
  from collections.abc import Sequence
8
  from multiprocessing import Lock
9
  from pathlib import Path
@@ -11,6 +15,9 @@ from typing import TYPE_CHECKING, Any
11
  from gradio_client import utils as client_utils
12
  import gradio as gr
13
  from gradio import utils, wasm_utils
 
 
 
14
 
15
  if TYPE_CHECKING:
16
  from gradio.components import Component
@@ -62,21 +69,28 @@ class CSVLogger_custom(FlaggingCallback):
62
  self.flagging_dir = Path(flagging_dir)
63
  self.first_time = True
64
 
65
- def _create_dataset_file(self, additional_headers: list[str] | None = None):
 
 
 
 
66
  os.makedirs(self.flagging_dir, exist_ok=True)
67
 
68
- if additional_headers is None:
69
- additional_headers = []
70
- headers = (
71
- [
 
 
 
 
 
 
 
72
  getattr(component, "label", None) or f"component {idx}"
73
  for idx, component in enumerate(self.components)
74
- ]
75
- + additional_headers
76
- + [
77
- "timestamp",
78
- ]
79
- )
80
  headers = utils.sanitize_list_for_csv(headers)
81
  dataset_files = list(Path(self.flagging_dir).glob("dataset*.csv"))
82
 
@@ -115,18 +129,24 @@ class CSVLogger_custom(FlaggingCallback):
115
  print("Using existing dataset file at:", self.dataset_filepath)
116
 
117
  def flag(
118
- self,
119
- flag_data: list[Any],
120
- flag_option: str | None = None,
121
- username: str | None = None,
122
- ) -> int:
 
 
 
 
 
123
  if self.first_time:
124
  additional_headers = []
125
  if flag_option is not None:
126
  additional_headers.append("flag")
127
  if username is not None:
128
  additional_headers.append("username")
129
- self._create_dataset_file(additional_headers=additional_headers)
 
130
  self.first_time = False
131
 
132
  csv_data = []
@@ -155,15 +175,131 @@ class CSVLogger_custom(FlaggingCallback):
155
  csv_data.append(flag_option)
156
  if username is not None:
157
  csv_data.append(username)
158
- csv_data.append(str(datetime.datetime.now()))
159
 
160
- with self.lock:
161
- with open(
162
- self.dataset_filepath, "a", newline="", encoding="utf-8"
163
- ) as csvfile:
164
- writer = csv.writer(csvfile)
165
- writer.writerow(utils.sanitize_list_for_csv(csv_data))
166
- with open(self.dataset_filepath, encoding="utf-8") as csvfile:
167
- line_count = len(list(csv.reader(csvfile))) - 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
168
 
169
  return line_count
 
4
  import datetime
5
  import os
6
  import re
7
+ import boto3
8
+ import botocore
9
+ import uuid
10
+ import time
11
  from collections.abc import Sequence
12
  from multiprocessing import Lock
13
  from pathlib import Path
 
15
  from gradio_client import utils as client_utils
16
  import gradio as gr
17
  from gradio import utils, wasm_utils
18
+ from tools.config import AWS_REGION, AWS_ACCESS_KEY, AWS_SECRET_KEY, RUN_AWS_FUNCTIONS
19
+ from botocore.exceptions import NoCredentialsError, TokenRetrievalError
20
+
21
 
22
  if TYPE_CHECKING:
23
  from gradio.components import Component
 
69
  self.flagging_dir = Path(flagging_dir)
70
  self.first_time = True
71
 
72
+ def _create_dataset_file(
73
+ self,
74
+ additional_headers: list[str] | None = None,
75
+ replacement_headers: list[str] | None = None
76
+ ):
77
  os.makedirs(self.flagging_dir, exist_ok=True)
78
 
79
+ if replacement_headers:
80
+ if len(replacement_headers) != len(self.components):
81
+ raise ValueError(
82
+ f"replacement_headers must have the same length as components "
83
+ f"({len(replacement_headers)} provided, {len(self.components)} expected)"
84
+ )
85
+ headers = replacement_headers + ["timestamp"]
86
+ else:
87
+ if additional_headers is None:
88
+ additional_headers = []
89
+ headers = [
90
  getattr(component, "label", None) or f"component {idx}"
91
  for idx, component in enumerate(self.components)
92
+ ] + additional_headers + ["timestamp"]
93
+
 
 
 
 
94
  headers = utils.sanitize_list_for_csv(headers)
95
  dataset_files = list(Path(self.flagging_dir).glob("dataset*.csv"))
96
 
 
129
  print("Using existing dataset file at:", self.dataset_filepath)
130
 
131
  def flag(
132
+ self,
133
+ flag_data: list[Any],
134
+ flag_option: str | None = None,
135
+ username: str | None = None,
136
+ save_to_csv: bool = True,
137
+ save_to_dynamodb: bool = False,
138
+ dynamodb_table_name: str | None = None,
139
+ dynamodb_headers: list[str] | None = None, # New: specify headers for DynamoDB
140
+ replacement_headers: list[str] | None = None
141
+ ) -> int:
142
  if self.first_time:
143
  additional_headers = []
144
  if flag_option is not None:
145
  additional_headers.append("flag")
146
  if username is not None:
147
  additional_headers.append("username")
148
+ additional_headers.append("id")
149
+ self._create_dataset_file(additional_headers=additional_headers, replacement_headers=replacement_headers)
150
  self.first_time = False
151
 
152
  csv_data = []
 
175
  csv_data.append(flag_option)
176
  if username is not None:
177
  csv_data.append(username)
 
178
 
179
+
180
+ timestamp = str(datetime.datetime.now())
181
+ csv_data.append(timestamp)
182
+
183
+ generated_id = str(uuid.uuid4())
184
+ csv_data.append(generated_id)
185
+
186
+ # Build the headers
187
+ headers = (
188
+ [getattr(component, "label", None) or f"component {idx}" for idx, component in enumerate(self.components)]
189
+ )
190
+ if flag_option is not None:
191
+ headers.append("flag")
192
+ if username is not None:
193
+ headers.append("username")
194
+ headers.append("timestamp")
195
+ headers.append("id")
196
+
197
+ line_count = -1
198
+
199
+ if save_to_csv:
200
+ with self.lock:
201
+ with open(self.dataset_filepath, "a", newline="", encoding="utf-8") as csvfile:
202
+ writer = csv.writer(csvfile)
203
+ writer.writerow(utils.sanitize_list_for_csv(csv_data))
204
+ with open(self.dataset_filepath, encoding="utf-8") as csvfile:
205
+ line_count = len(list(csv.reader(csvfile))) - 1
206
+
207
+ if save_to_dynamodb == True:
208
+
209
+ if RUN_AWS_FUNCTIONS == "1":
210
+ try:
211
+ print("Connecting to DynamoDB via existing SSO connection")
212
+ dynamodb = boto3.resource('dynamodb', region_name=AWS_REGION)
213
+ #client = boto3.client('dynamodb')
214
+
215
+ test_connection = dynamodb.meta.client.list_tables()
216
+
217
+ except Exception as e:
218
+ print("No SSO credentials found:", e)
219
+ if AWS_ACCESS_KEY and AWS_SECRET_KEY:
220
+ print("Trying DynamoDB credentials from environment variables")
221
+ dynamodb = boto3.resource('dynamodb',aws_access_key_id=AWS_ACCESS_KEY,
222
+ aws_secret_access_key=AWS_SECRET_KEY, region_name=AWS_REGION)
223
+ # client = boto3.client('dynamodb',aws_access_key_id=AWS_ACCESS_KEY,
224
+ # aws_secret_access_key=AWS_SECRET_KEY, region_name=AWS_REGION)
225
+ else:
226
+ raise Exception("AWS credentials for DynamoDB logging not found")
227
+ else:
228
+ raise Exception("AWS credentials for DynamoDB logging not found")
229
+
230
+ if dynamodb_table_name is None:
231
+ raise ValueError("You must provide a dynamodb_table_name if save_to_dynamodb is True")
232
+
233
+ if dynamodb_headers:
234
+ dynamodb_headers = dynamodb_headers
235
+ if not dynamodb_headers and replacement_headers:
236
+ dynamodb_headers = replacement_headers
237
+ elif headers:
238
+ dynamodb_headers = headers
239
+ elif not dynamodb_headers:
240
+ raise ValueError("Headers not found. You must provide dynamodb_headers or replacement_headers to create a new table.")
241
+
242
+ if flag_option is not None:
243
+ if "flag" not in dynamodb_headers:
244
+ dynamodb_headers.append("flag")
245
+ if username is not None:
246
+ if "username" not in dynamodb_headers:
247
+ dynamodb_headers.append("username")
248
+ if "timestamp" not in dynamodb_headers:
249
+ dynamodb_headers.append("timestamp")
250
+ if "id" not in dynamodb_headers:
251
+ dynamodb_headers.append("id")
252
+
253
+ # Table doesn't exist — create it
254
+ try:
255
+ table = dynamodb.Table(dynamodb_table_name)
256
+ table.load()
257
+ except botocore.exceptions.ClientError as e:
258
+ if e.response['Error']['Code'] == 'ResourceNotFoundException':
259
+
260
+ #print(f"Creating DynamoDB table '{dynamodb_table_name}'...")
261
+ #print("dynamodb_headers:", dynamodb_headers)
262
+
263
+ attribute_definitions = [
264
+ {'AttributeName': 'id', 'AttributeType': 'S'} # Only define key attributes here
265
+ ]
266
+
267
+ table = dynamodb.create_table(
268
+ TableName=dynamodb_table_name,
269
+ KeySchema=[
270
+ {'AttributeName': 'id', 'KeyType': 'HASH'} # Partition key
271
+ ],
272
+ AttributeDefinitions=attribute_definitions,
273
+ BillingMode='PAY_PER_REQUEST'
274
+ )
275
+ # Wait until the table exists
276
+ table.meta.client.get_waiter('table_exists').wait(TableName=dynamodb_table_name)
277
+ time.sleep(5)
278
+ print(f"Table '{dynamodb_table_name}' created successfully.")
279
+ else:
280
+ raise
281
+
282
+ # Prepare the DynamoDB item to upload
283
+
284
+ try:
285
+ item = {
286
+ 'id': str(generated_id), # UUID primary key
287
+ #'created_by': username if username else "unknown",
288
+ 'timestamp': timestamp,
289
+ }
290
+
291
+ #print("dynamodb_headers:", dynamodb_headers)
292
+ #print("csv_data:", csv_data)
293
+
294
+ # Map the headers to values
295
+ item.update({header: str(value) for header, value in zip(dynamodb_headers, csv_data)})
296
+
297
+ #print("item:", item)
298
+
299
+ table.put_item(Item=item)
300
+
301
+ print("Successfully uploaded log to DynamoDB")
302
+ except Exception as e:
303
+ print("Could not upload log to DynamobDB due to", e)
304
 
305
  return line_count
tools/custom_image_analyser_engine.py CHANGED
@@ -775,9 +775,52 @@ def merge_text_bounding_boxes(analyser_results:dict, characters: List[LTChar], c
775
 
776
  return analysed_bounding_boxes
777
 
778
- # Function to combine OCR results into line-level results
779
- def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:float=12.0):
780
- # Group OCR results into lines based on y_threshold
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
781
  lines = []
782
  current_line = []
783
  for result in sorted(ocr_results, key=lambda x: x.top):
@@ -796,26 +839,11 @@ def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:fl
796
  # Flatten the sorted lines back into a single list
797
  sorted_results = [result for line in lines for result in line]
798
 
799
- combined_results = []
800
- new_format_results = {}
801
  current_line = []
802
  current_bbox = None
803
- line_counter = 1
804
-
805
- def create_ocr_result_with_children(combined_results, i, current_bbox, current_line):
806
- combined_results["text_line_" + str(i)] = {
807
- "line": i,
808
- 'text': current_bbox.text,
809
- 'bounding_box': (current_bbox.left, current_bbox.top,
810
- current_bbox.left + current_bbox.width,
811
- current_bbox.top + current_bbox.height),
812
- 'words': [{'text': word.text,
813
- 'bounding_box': (word.left, word.top,
814
- word.left + word.width,
815
- word.top + word.height)}
816
- for word in current_line]
817
- }
818
- return combined_results["text_line_" + str(i)]
819
 
820
  for result in sorted_results:
821
  if not current_line:
@@ -838,26 +866,101 @@ def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:fl
838
  height=max(current_bbox.height, result.height)
839
  )
840
  current_line.append(result)
841
- else:
842
-
843
 
844
  # Commit the current line and start a new one
845
- combined_results.append(current_bbox)
846
 
847
- new_format_results["text_line_" + str(line_counter)] = create_ocr_result_with_children(new_format_results, line_counter, current_bbox, current_line)
 
848
 
849
  line_counter += 1
850
  current_line = [result]
851
  current_bbox = result
852
-
853
  # Append the last line
854
  if current_bbox:
855
- combined_results.append(current_bbox)
 
 
 
856
 
857
- new_format_results["text_line_" + str(line_counter)] = create_ocr_result_with_children(new_format_results, line_counter, current_bbox, current_line)
 
 
858
 
 
859
 
860
- return combined_results, new_format_results
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
861
 
862
  class CustomImageAnalyzerEngine:
863
  def __init__(
@@ -911,7 +1014,7 @@ class CustomImageAnalyzerEngine:
911
  def analyze_text(
912
  self,
913
  line_level_ocr_results: List[OCRResult],
914
- ocr_results_with_children: Dict[str, Dict],
915
  chosen_redact_comprehend_entities: List[str],
916
  pii_identification_method: str = "Local",
917
  comprehend_client = "",
@@ -1036,9 +1139,9 @@ class CustomImageAnalyzerEngine:
1036
  combined_results = []
1037
  for i, text_line in enumerate(line_level_ocr_results):
1038
  line_results = next((results for idx, results in all_text_line_results if idx == i), [])
1039
- if line_results and i < len(ocr_results_with_children):
1040
- child_level_key = list(ocr_results_with_children.keys())[i]
1041
- ocr_results_with_children_line_level = ocr_results_with_children[child_level_key]
1042
 
1043
  for result in line_results:
1044
  bbox_results = self.map_analyzer_results_to_bounding_boxes(
@@ -1052,7 +1155,7 @@ class CustomImageAnalyzerEngine:
1052
  )],
1053
  text_line.text,
1054
  text_analyzer_kwargs.get('allow_list', []),
1055
- ocr_results_with_children_line_level
1056
  )
1057
  combined_results.extend(bbox_results)
1058
 
@@ -1064,14 +1167,14 @@ class CustomImageAnalyzerEngine:
1064
  redaction_relevant_ocr_results: List[OCRResult],
1065
  full_text: str,
1066
  allow_list: List[str],
1067
- ocr_results_with_children_child_info: Dict[str, Dict]
1068
  ) -> List[CustomImageRecognizerResult]:
1069
  redaction_bboxes = []
1070
 
1071
  for redaction_relevant_ocr_result in redaction_relevant_ocr_results:
1072
- #print("ocr_results_with_children_child_info:", ocr_results_with_children_child_info)
1073
 
1074
- line_text = ocr_results_with_children_child_info['text']
1075
  line_length = len(line_text)
1076
  redaction_text = redaction_relevant_ocr_result.text
1077
 
@@ -1097,7 +1200,7 @@ class CustomImageAnalyzerEngine:
1097
 
1098
  # print(f"Found match: '{matched_text}' in line")
1099
 
1100
- # for word_info in ocr_results_with_children_child_info.get('words', []):
1101
  # # Check if this word is part of our match
1102
  # if any(word.lower() in word_info['text'].lower() for word in matched_words):
1103
  # matching_word_boxes.append(word_info['bounding_box'])
@@ -1106,11 +1209,11 @@ class CustomImageAnalyzerEngine:
1106
  # Find the corresponding words in the OCR results
1107
  matching_word_boxes = []
1108
 
1109
- #print("ocr_results_with_children_child_info:", ocr_results_with_children_child_info)
1110
 
1111
  current_position = 0
1112
 
1113
- for word_info in ocr_results_with_children_child_info.get('words', []):
1114
  word_text = word_info['text']
1115
  word_length = len(word_text)
1116
 
 
775
 
776
  return analysed_bounding_boxes
777
 
778
+ def recreate_page_line_level_ocr_results_with_page(page_line_level_ocr_results_with_words: dict):
779
+ reconstructed_results = []
780
+
781
+ # Assume all lines belong to the same page, so we can just read it from one item
782
+ #page = next(iter(page_line_level_ocr_results_with_words.values()))["page"]
783
+
784
+ page = page_line_level_ocr_results_with_words["page"]
785
+
786
+ for line_data in page_line_level_ocr_results_with_words["results"].values():
787
+ bbox = line_data["bounding_box"]
788
+ text = line_data["text"]
789
+
790
+ # Recreate the OCRResult (you'll need the OCRResult class imported)
791
+ line_result = OCRResult(
792
+ text=text,
793
+ left=bbox[0],
794
+ top=bbox[1],
795
+ width=bbox[2] - bbox[0],
796
+ height=bbox[3] - bbox[1],
797
+ )
798
+ reconstructed_results.append(line_result)
799
+
800
+ page_line_level_ocr_results_with_page = {"page": page, "results": reconstructed_results}
801
+
802
+ return page_line_level_ocr_results_with_page
803
+
804
+ def create_ocr_result_with_children(combined_results:dict, i:int, current_bbox:dict, current_line:list):
805
+ combined_results["text_line_" + str(i)] = {
806
+ "line": i,
807
+ 'text': current_bbox.text,
808
+ 'bounding_box': (current_bbox.left, current_bbox.top,
809
+ current_bbox.left + current_bbox.width,
810
+ current_bbox.top + current_bbox.height),
811
+ 'words': [{'text': word.text,
812
+ 'bounding_box': (word.left, word.top,
813
+ word.left + word.width,
814
+ word.top + word.height)}
815
+ for word in current_line]
816
+ }
817
+ return combined_results["text_line_" + str(i)]
818
+
819
+ def combine_ocr_results(ocr_results: dict, x_threshold: float = 50.0, y_threshold: float = 12.0, page: int = 1):
820
+ '''
821
+ Group OCR results into lines based on y_threshold. Create line level ocr results, and word level OCR results
822
+ '''
823
+
824
  lines = []
825
  current_line = []
826
  for result in sorted(ocr_results, key=lambda x: x.top):
 
839
  # Flatten the sorted lines back into a single list
840
  sorted_results = [result for line in lines for result in line]
841
 
842
+ page_line_level_ocr_results = []
843
+ page_line_level_ocr_results_with_words = {}
844
  current_line = []
845
  current_bbox = None
846
+ line_counter = 1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
847
 
848
  for result in sorted_results:
849
  if not current_line:
 
866
  height=max(current_bbox.height, result.height)
867
  )
868
  current_line.append(result)
869
+ else:
 
870
 
871
  # Commit the current line and start a new one
872
+ page_line_level_ocr_results.append(current_bbox)
873
 
874
+ page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
875
+ #page_line_level_ocr_results_with_words["text_line_" + str(line_counter)]["page"] = page
876
 
877
  line_counter += 1
878
  current_line = [result]
879
  current_bbox = result
 
880
  # Append the last line
881
  if current_bbox:
882
+ page_line_level_ocr_results.append(current_bbox)
883
+
884
+ page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
885
+ #page_line_level_ocr_results_with_words["text_line_" + str(line_counter)]["page"] = page
886
 
887
+ # Add page key to the line level results
888
+ page_line_level_ocr_results_with_page = {"page": page, "results": page_line_level_ocr_results}
889
+ page_line_level_ocr_results_with_words = {"page": page, "results": page_line_level_ocr_results_with_words}
890
 
891
+ return page_line_level_ocr_results_with_page, page_line_level_ocr_results_with_words
892
 
893
+
894
+ # Function to combine OCR results into line-level results
895
+ # def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:float=12.0):
896
+ # '''
897
+ # Group OCR results into lines based on y_threshold. Create line level ocr results, and word level OCR results
898
+ # '''
899
+
900
+ # lines = []
901
+ # current_line = []
902
+ # for result in sorted(ocr_results, key=lambda x: x.top):
903
+ # if not current_line or abs(result.top - current_line[0].top) <= y_threshold:
904
+ # current_line.append(result)
905
+ # else:
906
+ # lines.append(current_line)
907
+ # current_line = [result]
908
+ # if current_line:
909
+ # lines.append(current_line)
910
+
911
+ # # Sort each line by left position
912
+ # for line in lines:
913
+ # line.sort(key=lambda x: x.left)
914
+
915
+ # # Flatten the sorted lines back into a single list
916
+ # sorted_results = [result for line in lines for result in line]
917
+
918
+ # page_line_level_ocr_results = []
919
+ # page_line_level_ocr_results_with_words = {}
920
+ # current_line = []
921
+ # current_bbox = None
922
+ # line_counter = 1
923
+
924
+ # for result in sorted_results:
925
+ # if not current_line:
926
+ # # Start a new line
927
+ # current_line.append(result)
928
+ # current_bbox = result
929
+ # else:
930
+ # # Check if the result is on the same line (y-axis) and close horizontally (x-axis)
931
+ # last_result = current_line[-1]
932
+
933
+ # if abs(result.top - last_result.top) <= y_threshold and \
934
+ # (result.left - (last_result.left + last_result.width)) <= x_threshold:
935
+ # # Update the bounding box to include the new word
936
+ # new_right = max(current_bbox.left + current_bbox.width, result.left + result.width)
937
+ # current_bbox = OCRResult(
938
+ # text=f"{current_bbox.text} {result.text}",
939
+ # left=current_bbox.left,
940
+ # top=current_bbox.top,
941
+ # width=new_right - current_bbox.left,
942
+ # height=max(current_bbox.height, result.height)
943
+ # )
944
+ # current_line.append(result)
945
+ # else:
946
+
947
+ # # Commit the current line and start a new one
948
+ # page_line_level_ocr_results.append(current_bbox)
949
+
950
+ # page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
951
+
952
+ # line_counter += 1
953
+ # current_line = [result]
954
+ # current_bbox = result
955
+
956
+ # # Append the last line
957
+ # if current_bbox:
958
+ # page_line_level_ocr_results.append(current_bbox)
959
+
960
+ # page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
961
+
962
+
963
+ # return page_line_level_ocr_results, page_line_level_ocr_results_with_words
964
 
965
  class CustomImageAnalyzerEngine:
966
  def __init__(
 
1014
  def analyze_text(
1015
  self,
1016
  line_level_ocr_results: List[OCRResult],
1017
+ ocr_results_with_words: Dict[str, Dict],
1018
  chosen_redact_comprehend_entities: List[str],
1019
  pii_identification_method: str = "Local",
1020
  comprehend_client = "",
 
1139
  combined_results = []
1140
  for i, text_line in enumerate(line_level_ocr_results):
1141
  line_results = next((results for idx, results in all_text_line_results if idx == i), [])
1142
+ if line_results and i < len(ocr_results_with_words):
1143
+ child_level_key = list(ocr_results_with_words.keys())[i]
1144
+ ocr_results_with_words_line_level = ocr_results_with_words[child_level_key]
1145
 
1146
  for result in line_results:
1147
  bbox_results = self.map_analyzer_results_to_bounding_boxes(
 
1155
  )],
1156
  text_line.text,
1157
  text_analyzer_kwargs.get('allow_list', []),
1158
+ ocr_results_with_words_line_level
1159
  )
1160
  combined_results.extend(bbox_results)
1161
 
 
1167
  redaction_relevant_ocr_results: List[OCRResult],
1168
  full_text: str,
1169
  allow_list: List[str],
1170
+ ocr_results_with_words_child_info: Dict[str, Dict]
1171
  ) -> List[CustomImageRecognizerResult]:
1172
  redaction_bboxes = []
1173
 
1174
  for redaction_relevant_ocr_result in redaction_relevant_ocr_results:
1175
+ #print("ocr_results_with_words_child_info:", ocr_results_with_words_child_info)
1176
 
1177
+ line_text = ocr_results_with_words_child_info['text']
1178
  line_length = len(line_text)
1179
  redaction_text = redaction_relevant_ocr_result.text
1180
 
 
1200
 
1201
  # print(f"Found match: '{matched_text}' in line")
1202
 
1203
+ # for word_info in ocr_results_with_words_child_info.get('words', []):
1204
  # # Check if this word is part of our match
1205
  # if any(word.lower() in word_info['text'].lower() for word in matched_words):
1206
  # matching_word_boxes.append(word_info['bounding_box'])
 
1209
  # Find the corresponding words in the OCR results
1210
  matching_word_boxes = []
1211
 
1212
+ #print("ocr_results_with_words_child_info:", ocr_results_with_words_child_info)
1213
 
1214
  current_position = 0
1215
 
1216
+ for word_info in ocr_results_with_words_child_info.get('words', []):
1217
  word_text = word_info['text']
1218
  word_length = len(word_text)
1219
 
tools/data_anonymise.py CHANGED
@@ -1,10 +1,12 @@
1
  import re
 
2
  import secrets
3
  import base64
4
  import time
5
  import boto3
6
  import botocore
7
  import pandas as pd
 
8
 
9
  from faker import Faker
10
  from gradio import Progress
@@ -226,6 +228,7 @@ def anonymise_data_files(file_paths: List[str],
226
  comprehend_query_number:int=0,
227
  aws_access_key_textbox:str='',
228
  aws_secret_key_textbox:str='',
 
229
  progress: Progress = Progress(track_tqdm=True)):
230
  """
231
  This function anonymises data files based on the provided parameters.
@@ -252,6 +255,7 @@ def anonymise_data_files(file_paths: List[str],
252
  - comprehend_query_number (int, optional): A counter tracking the number of queries to AWS Comprehend.
253
  - aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
254
  - aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
 
255
  - progress (Progress, optional): A Progress object to track progress. Defaults to a Progress object with track_tqdm=True.
256
  """
257
 
@@ -277,9 +281,16 @@ def anonymise_data_files(file_paths: List[str],
277
  if not out_file_paths:
278
  out_file_paths = []
279
 
280
-
281
- if in_allow_list:
282
- in_allow_list_flat = in_allow_list #[item for sublist in in_allow_list for item in sublist]
 
 
 
 
 
 
 
283
  else:
284
  in_allow_list_flat = []
285
 
@@ -306,7 +317,7 @@ def anonymise_data_files(file_paths: List[str],
306
  else:
307
  comprehend_client = ""
308
  out_message = "Cannot connect to AWS Comprehend service. Please provide access keys under Textract settings on the Redaction settings tab, or choose another PII identification method."
309
- print(out_message)
310
 
311
  # Check if files and text exist
312
  if not file_paths:
@@ -314,7 +325,7 @@ def anonymise_data_files(file_paths: List[str],
314
  file_paths=['open_text']
315
  else:
316
  out_message = "Please enter text or a file to redact."
317
- return out_message, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths
318
 
319
  # If we have already redacted the last file, return the input out_message and file list to the relevant components
320
  if latest_file_completed >= len(file_paths):
@@ -322,18 +333,18 @@ def anonymise_data_files(file_paths: List[str],
322
  # Set to a very high number so as not to mess with subsequent file processing by the user
323
  latest_file_completed = 99
324
  final_out_message = '\n'.join(out_message)
325
- return final_out_message, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths
326
 
327
  file_path_loop = [file_paths[int(latest_file_completed)]]
328
 
329
- for anon_file in progress.tqdm(file_path_loop, desc="Anonymising files", unit = "file"):
330
 
331
  if anon_file=='open_text':
332
  anon_df = pd.DataFrame(data={'text':[in_text]})
333
  chosen_cols=['text']
 
334
  sheet_name = ""
335
  file_type = ""
336
- out_file_part = anon_file
337
 
338
  out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, "", log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=OUTPUT_FOLDER)
339
  else:
@@ -350,26 +361,22 @@ def anonymise_data_files(file_paths: List[str],
350
  out_message.append("No Excel sheets selected. Please select at least one to anonymise.")
351
  continue
352
 
353
- anon_xlsx = pd.ExcelFile(anon_file)
354
-
355
  # Create xlsx file:
356
- anon_xlsx_export_file_name = output_folder + out_file_part + "_redacted.xlsx"
357
-
358
- from openpyxl import Workbook
359
 
360
- wb = Workbook()
361
- wb.save(anon_xlsx_export_file_name)
362
 
363
  # Iterate through the sheet names
364
- for sheet_name in in_excel_sheets:
365
  # Read each sheet into a DataFrame
366
  if sheet_name not in anon_xlsx.sheet_names:
367
  continue
368
 
369
  anon_df = pd.read_excel(anon_file, sheet_name=sheet_name)
370
 
371
- out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, "", log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=output_folder)
372
-
373
  else:
374
  sheet_name = ""
375
  anon_df = read_file(anon_file)
@@ -380,23 +387,28 @@ def anonymise_data_files(file_paths: List[str],
380
  # Increase latest file completed count unless we are at the last file
381
  if latest_file_completed != len(file_paths):
382
  print("Completed file number:", str(latest_file_completed))
383
- latest_file_completed += 1
384
 
385
  toc = time.perf_counter()
386
- out_time = f"in {toc - tic:0.1f} seconds."
387
- print(out_time)
388
-
389
- if anon_strat == "encrypt":
390
- out_message.append(". Your decryption key is " + key_string + ".")
391
 
392
  out_message.append("Anonymisation of file '" + out_file_part + "' successfully completed in")
393
 
394
  out_message_out = '\n'.join(out_message)
395
  out_message_out = out_message_out + " " + out_time
396
 
 
 
 
397
  out_message_out = out_message_out + "\n\nGo to to the Redaction settings tab to see redaction logs. Please give feedback on the results below to help improve this app."
 
 
398
 
399
- return out_message_out, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths
400
 
401
  def anon_wrapper_func(
402
  anon_file: str,
@@ -495,7 +507,6 @@ def anon_wrapper_func(
495
  anon_df_out = anon_df_out[all_cols_original_order]
496
 
497
  # Export file
498
-
499
  # Rename anonymisation strategy for file path naming
500
  if anon_strat == "replace with 'REDACTED'": anon_strat_txt = "redact_replace"
501
  elif anon_strat == "replace with <ENTITY_NAME>": anon_strat_txt = "redact_entity_type"
@@ -507,8 +518,14 @@ def anon_wrapper_func(
507
 
508
  anon_export_file_name = anon_xlsx_export_file_name
509
 
 
 
 
 
 
 
510
  # Create a Pandas Excel writer using XlsxWriter as the engine.
511
- with pd.ExcelWriter(anon_xlsx_export_file_name, engine='openpyxl', mode='a') as writer:
512
  # Write each DataFrame to a different worksheet.
513
  anon_df_out.to_excel(writer, sheet_name=excel_sheet_name, index=None)
514
 
@@ -532,7 +549,7 @@ def anon_wrapper_func(
532
 
533
  # Print result text to output text box if just anonymising open text
534
  if anon_file=='open_text':
535
- out_message = [anon_df_out['text'][0]]
536
 
537
  return out_file_paths, out_message, key_string, log_files_output_paths
538
 
@@ -551,8 +568,16 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
551
  # DataFrame to dict
552
  df_dict = df.to_dict(orient="list")
553
 
554
- if in_allow_list:
555
- in_allow_list_flat = in_allow_list #[item for sublist in in_allow_list for item in sublist]
 
 
 
 
 
 
 
 
556
  else:
557
  in_allow_list_flat = []
558
 
@@ -577,11 +602,8 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
577
 
578
  #analyzer = nlp_analyser #AnalyzerEngine()
579
  batch_analyzer = BatchAnalyzerEngine(analyzer_engine=nlp_analyser)
580
-
581
  anonymizer = AnonymizerEngine()#conflict_resolution=ConflictResolutionStrategy.MERGE_SIMILAR_OR_CONTAINED)
582
-
583
- batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
584
-
585
  analyzer_results = []
586
 
587
  if pii_identification_method == "Local":
@@ -692,12 +714,6 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
692
  analyse_time_out = f"Analysing the text took {analyse_toc - analyse_tic:0.1f} seconds."
693
  print(analyse_time_out)
694
 
695
- # Create faker function (note that it has to receive a value)
696
- #fake = Faker("en_UK")
697
-
698
- #def fake_first_name(x):
699
- # return fake.first_name()
700
-
701
  # Set up the anonymization configuration WITHOUT DATE_TIME
702
  simple_replace_config = eval('{"DEFAULT": OperatorConfig("replace", {"new_value": "REDACTED"})}')
703
  replace_config = eval('{"DEFAULT": OperatorConfig("replace")}')
@@ -714,9 +730,13 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
714
  if anon_strat == "mask": chosen_mask_config = mask_config
715
  if anon_strat == "encrypt":
716
  chosen_mask_config = people_encrypt_config
717
- # Generate a 128-bit AES key. Then encode the key using base64 to get a string representation
718
- key = secrets.token_bytes(16) # 128 bits = 16 bytes
719
  key_string = base64.b64encode(key).decode('utf-8')
 
 
 
 
 
720
  elif anon_strat == "fake_first_name": chosen_mask_config = fake_first_name_config
721
 
722
  # I think in general people will want to keep date / times - removed Mar 2025 as I don't want to assume for people.
 
1
  import re
2
+ import os
3
  import secrets
4
  import base64
5
  import time
6
  import boto3
7
  import botocore
8
  import pandas as pd
9
+ from openpyxl import Workbook, load_workbook
10
 
11
  from faker import Faker
12
  from gradio import Progress
 
228
  comprehend_query_number:int=0,
229
  aws_access_key_textbox:str='',
230
  aws_secret_key_textbox:str='',
231
+ actual_time_taken_number:float=0,
232
  progress: Progress = Progress(track_tqdm=True)):
233
  """
234
  This function anonymises data files based on the provided parameters.
 
255
  - comprehend_query_number (int, optional): A counter tracking the number of queries to AWS Comprehend.
256
  - aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
257
  - aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
258
+ - actual_time_taken_number (float, optional): Time taken to do the redaction.
259
  - progress (Progress, optional): A Progress object to track progress. Defaults to a Progress object with track_tqdm=True.
260
  """
261
 
 
281
  if not out_file_paths:
282
  out_file_paths = []
283
 
284
+ if isinstance(in_allow_list, list):
285
+ if in_allow_list:
286
+ in_allow_list_flat = in_allow_list
287
+ else:
288
+ in_allow_list_flat = []
289
+ elif isinstance(in_allow_list, pd.DataFrame):
290
+ if not in_allow_list.empty:
291
+ in_allow_list_flat = list(in_allow_list.iloc[:, 0].unique())
292
+ else:
293
+ in_allow_list_flat = []
294
  else:
295
  in_allow_list_flat = []
296
 
 
317
  else:
318
  comprehend_client = ""
319
  out_message = "Cannot connect to AWS Comprehend service. Please provide access keys under Textract settings on the Redaction settings tab, or choose another PII identification method."
320
+ raise(out_message)
321
 
322
  # Check if files and text exist
323
  if not file_paths:
 
325
  file_paths=['open_text']
326
  else:
327
  out_message = "Please enter text or a file to redact."
328
+ raise Exception(out_message)
329
 
330
  # If we have already redacted the last file, return the input out_message and file list to the relevant components
331
  if latest_file_completed >= len(file_paths):
 
333
  # Set to a very high number so as not to mess with subsequent file processing by the user
334
  latest_file_completed = 99
335
  final_out_message = '\n'.join(out_message)
336
+ return final_out_message, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths, actual_time_taken_number
337
 
338
  file_path_loop = [file_paths[int(latest_file_completed)]]
339
 
340
+ for anon_file in progress.tqdm(file_path_loop, desc="Anonymising files", unit = "files"):
341
 
342
  if anon_file=='open_text':
343
  anon_df = pd.DataFrame(data={'text':[in_text]})
344
  chosen_cols=['text']
345
+ out_file_part = anon_file
346
  sheet_name = ""
347
  file_type = ""
 
348
 
349
  out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, "", log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=OUTPUT_FOLDER)
350
  else:
 
361
  out_message.append("No Excel sheets selected. Please select at least one to anonymise.")
362
  continue
363
 
 
 
364
  # Create xlsx file:
365
+ anon_xlsx = pd.ExcelFile(anon_file)
366
+ anon_xlsx_export_file_name = output_folder + out_file_part + "_redacted.xlsx"
 
367
 
368
+
 
369
 
370
  # Iterate through the sheet names
371
+ for sheet_name in progress.tqdm(in_excel_sheets, desc="Anonymising sheets", unit = "sheets"):
372
  # Read each sheet into a DataFrame
373
  if sheet_name not in anon_xlsx.sheet_names:
374
  continue
375
 
376
  anon_df = pd.read_excel(anon_file, sheet_name=sheet_name)
377
 
378
+ out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, anon_xlsx_export_file_name, log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=output_folder)
379
+
380
  else:
381
  sheet_name = ""
382
  anon_df = read_file(anon_file)
 
387
  # Increase latest file completed count unless we are at the last file
388
  if latest_file_completed != len(file_paths):
389
  print("Completed file number:", str(latest_file_completed))
390
+ latest_file_completed += 1
391
 
392
  toc = time.perf_counter()
393
+ out_time_float = toc - tic
394
+ out_time = f"in {out_time_float:0.1f} seconds."
395
+ print(out_time)
396
+
397
+ actual_time_taken_number += out_time_float
398
 
399
  out_message.append("Anonymisation of file '" + out_file_part + "' successfully completed in")
400
 
401
  out_message_out = '\n'.join(out_message)
402
  out_message_out = out_message_out + " " + out_time
403
 
404
+ if anon_strat == "encrypt":
405
+ out_message_out.append(". Your decryption key is " + key_string)
406
+
407
  out_message_out = out_message_out + "\n\nGo to to the Redaction settings tab to see redaction logs. Please give feedback on the results below to help improve this app."
408
+
409
+ out_message_out = re.sub(r'^\n+|^\. ', '', out_message_out).strip()
410
 
411
+ return out_message_out, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths, actual_time_taken_number
412
 
413
  def anon_wrapper_func(
414
  anon_file: str,
 
507
  anon_df_out = anon_df_out[all_cols_original_order]
508
 
509
  # Export file
 
510
  # Rename anonymisation strategy for file path naming
511
  if anon_strat == "replace with 'REDACTED'": anon_strat_txt = "redact_replace"
512
  elif anon_strat == "replace with <ENTITY_NAME>": anon_strat_txt = "redact_entity_type"
 
518
 
519
  anon_export_file_name = anon_xlsx_export_file_name
520
 
521
+ if not os.path.exists(anon_xlsx_export_file_name):
522
+ wb = Workbook()
523
+ ws = wb.active # Get the default active sheet
524
+ ws.title = excel_sheet_name
525
+ wb.save(anon_xlsx_export_file_name)
526
+
527
  # Create a Pandas Excel writer using XlsxWriter as the engine.
528
+ with pd.ExcelWriter(anon_xlsx_export_file_name, engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
529
  # Write each DataFrame to a different worksheet.
530
  anon_df_out.to_excel(writer, sheet_name=excel_sheet_name, index=None)
531
 
 
549
 
550
  # Print result text to output text box if just anonymising open text
551
  if anon_file=='open_text':
552
+ out_message = ["'" + anon_df_out['text'][0] + "'"]
553
 
554
  return out_file_paths, out_message, key_string, log_files_output_paths
555
 
 
568
  # DataFrame to dict
569
  df_dict = df.to_dict(orient="list")
570
 
571
+ if isinstance(in_allow_list, list):
572
+ if in_allow_list:
573
+ in_allow_list_flat = in_allow_list
574
+ else:
575
+ in_allow_list_flat = []
576
+ elif isinstance(in_allow_list, pd.DataFrame):
577
+ if not in_allow_list.empty:
578
+ in_allow_list_flat = list(in_allow_list.iloc[:, 0].unique())
579
+ else:
580
+ in_allow_list_flat = []
581
  else:
582
  in_allow_list_flat = []
583
 
 
602
 
603
  #analyzer = nlp_analyser #AnalyzerEngine()
604
  batch_analyzer = BatchAnalyzerEngine(analyzer_engine=nlp_analyser)
 
605
  anonymizer = AnonymizerEngine()#conflict_resolution=ConflictResolutionStrategy.MERGE_SIMILAR_OR_CONTAINED)
606
+ batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
 
 
607
  analyzer_results = []
608
 
609
  if pii_identification_method == "Local":
 
714
  analyse_time_out = f"Analysing the text took {analyse_toc - analyse_tic:0.1f} seconds."
715
  print(analyse_time_out)
716
 
 
 
 
 
 
 
717
  # Set up the anonymization configuration WITHOUT DATE_TIME
718
  simple_replace_config = eval('{"DEFAULT": OperatorConfig("replace", {"new_value": "REDACTED"})}')
719
  replace_config = eval('{"DEFAULT": OperatorConfig("replace")}')
 
730
  if anon_strat == "mask": chosen_mask_config = mask_config
731
  if anon_strat == "encrypt":
732
  chosen_mask_config = people_encrypt_config
733
+ key = secrets.token_bytes(16) # 128 bits = 16 bytes
 
734
  key_string = base64.b64encode(key).decode('utf-8')
735
+
736
+ # Now inject the key into the operator config
737
+ for entity, operator in chosen_mask_config.items():
738
+ if operator.operator_name == "encrypt":
739
+ operator.params = {"key": key_string}
740
  elif anon_strat == "fake_first_name": chosen_mask_config = fake_first_name_config
741
 
742
  # I think in general people will want to keep date / times - removed Mar 2025 as I don't want to assume for people.
tools/file_conversion.py CHANGED
@@ -21,6 +21,7 @@ from PIL import Image
21
  from scipy.spatial import cKDTree
22
  import random
23
  import string
 
24
 
25
  IMAGE_NUM_REGEX = re.compile(r'_(\d+)\.png$')
26
 
@@ -461,7 +462,8 @@ def prepare_image_or_pdf(
461
  input_folder:str=INPUT_FOLDER,
462
  prepare_images:bool=True,
463
  page_sizes:list[dict]=[],
464
- textract_output_found:bool = False,
 
465
  progress: Progress = Progress(track_tqdm=True)
466
  ) -> tuple[List[str], List[str]]:
467
  """
@@ -483,7 +485,8 @@ def prepare_image_or_pdf(
483
  output_folder (optional, str): The output folder for file save
484
  prepare_images (optional, bool): A boolean indicating whether to create images for each PDF page. Defaults to True.
485
  page_sizes(optional, List[dict]): A list of dicts containing information about page sizes in various formats.
486
- textract_output_found (optional, bool): A boolean indicating whether textract output has already been found . Defaults to False.
 
487
  progress (optional, Progress): Progress tracker for the operation
488
 
489
 
@@ -535,7 +538,7 @@ def prepare_image_or_pdf(
535
  final_out_message = '\n'.join(out_message)
536
  else:
537
  final_out_message = out_message
538
- return final_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df
539
 
540
  progress(0.1, desc='Preparing file')
541
 
@@ -617,11 +620,10 @@ def prepare_image_or_pdf(
617
 
618
  elif file_extension in ['.csv']:
619
  if '_review_file' in file_path_without_ext:
620
- #print("file_path:", file_path)
621
  review_file_csv = read_file(file_path)
622
  all_annotations_object = convert_review_df_to_annotation_json(review_file_csv, image_file_paths, page_sizes)
623
  json_from_csv = True
624
- print("Converted CSV review file to image annotation object")
625
  elif '_ocr_output' in file_path_without_ext:
626
  all_line_level_ocr_results_df = read_file(file_path)
627
  json_from_csv = False
@@ -639,8 +641,8 @@ def prepare_image_or_pdf(
639
  # Assuming file_path is a NamedString or similar
640
  all_annotations_object = json.loads(file_path) # Use loads for string content
641
 
642
- # Assume it's a textract json
643
- elif (file_extension in ['.json']) and (prepare_for_review != True):
644
  print("Saving Textract output")
645
  # Copy it to the output folder so it can be used later.
646
  output_textract_json_file_name = file_path_without_ext
@@ -654,6 +656,20 @@ def prepare_image_or_pdf(
654
  textract_output_found = True
655
  continue
656
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
657
  # NEW IF STATEMENT
658
  # If you have an annotations object from the above code
659
  if all_annotations_object:
@@ -773,7 +789,40 @@ def prepare_image_or_pdf(
773
 
774
  number_of_pages = len(page_sizes)#len(image_file_paths)
775
 
776
- return combined_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
777
 
778
  def convert_text_pdf_to_img_pdf(in_file_path:str, out_text_file_path:List[str], image_dpi:float=image_dpi, output_folder:str=OUTPUT_FOLDER, input_folder:str=INPUT_FOLDER):
779
  file_path_without_ext = get_file_name_without_type(in_file_path)
@@ -850,121 +899,246 @@ def remove_duplicate_images_with_blank_boxes(data: List[dict]) -> List[dict]:
850
 
851
  return result
852
 
853
- def divide_coordinates_by_page_sizes(review_file_df:pd.DataFrame, page_sizes_df:pd.DataFrame, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"):
854
-
855
- '''Convert data to same coordinate system. If all coordinates all greater than one, this is a absolute image coordinates - change back to relative coordinates.'''
856
-
857
- review_file_df_out = review_file_df
858
-
859
- if xmin in review_file_df.columns and not review_file_df.empty:
860
- coord_cols = [xmin, xmax, ymin, ymax]
861
- for col in coord_cols:
862
- review_file_df.loc[:, col] = pd.to_numeric(review_file_df[col], errors="coerce")
863
-
864
- review_file_df_orig = review_file_df.copy().loc[(review_file_df[xmin] <= 1) & (review_file_df[xmax] <= 1) & (review_file_df[ymin] <= 1) & (review_file_df[ymax] <= 1),:]
865
-
866
- #print("review_file_df_orig:", review_file_df_orig)
867
-
868
- review_file_df_div = review_file_df.loc[(review_file_df[xmin] > 1) & (review_file_df[xmax] > 1) & (review_file_df[ymin] > 1) & (review_file_df[ymax] > 1),:]
869
-
870
- #print("review_file_df_div:", review_file_df_div)
871
-
872
- review_file_df_div.loc[:, "page"] = pd.to_numeric(review_file_df_div["page"], errors="coerce")
873
 
874
- if "image_width" not in review_file_df_div.columns and not page_sizes_df.empty:
 
875
 
876
- page_sizes_df["image_width"] = page_sizes_df["image_width"].replace("<NA>", pd.NA)
877
- page_sizes_df["image_height"] = page_sizes_df["image_height"].replace("<NA>", pd.NA)
878
- review_file_df_div = review_file_df_div.merge(page_sizes_df[["page", "image_width", "image_height", "mediabox_width", "mediabox_height"]], on="page", how="left")
 
 
879
 
880
- if "image_width" in review_file_df_div.columns:
881
- if review_file_df_div["image_width"].isna().all(): # Check if all are NaN values. If so, assume we only have mediabox coordinates available
882
- review_file_df_div["image_width"] = review_file_df_div["image_width"].fillna(review_file_df_div["mediabox_width"]).infer_objects()
883
- review_file_df_div["image_height"] = review_file_df_div["image_height"].fillna(review_file_df_div["mediabox_height"]).infer_objects()
 
884
 
885
- convert_type_cols = ["image_width", "image_height", xmin, xmax, ymin, ymax]
886
- review_file_df_div[convert_type_cols] = review_file_df_div[convert_type_cols].apply(pd.to_numeric, errors="coerce")
 
 
887
 
888
- review_file_df_div[xmin] = review_file_df_div[xmin] / review_file_df_div["image_width"]
889
- review_file_df_div[xmax] = review_file_df_div[xmax] / review_file_df_div["image_width"]
890
- review_file_df_div[ymin] = review_file_df_div[ymin] / review_file_df_div["image_height"]
891
- review_file_df_div[ymax] = review_file_df_div[ymax] / review_file_df_div["image_height"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
892
 
893
- # Concatenate the original and modified DataFrames
894
- dfs_to_concat = [df for df in [review_file_df_orig, review_file_df_div] if not df.empty]
895
- if dfs_to_concat: # Ensure there's at least one non-empty DataFrame
896
- review_file_df_out = pd.concat(dfs_to_concat)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
897
  else:
898
- review_file_df_out = review_file_df # Return an original DataFrame instead of raising an error
899
 
900
- # Only sort if the DataFrame is not empty and contains the required columns
901
- required_sort_columns = {"page", xmin, ymin}
902
- if not review_file_df_out.empty and required_sort_columns.issubset(review_file_df_out.columns):
903
- review_file_df_out.sort_values(["page", ymin, xmin], inplace=True)
904
 
905
- review_file_df_out.drop(["image_width", "image_height", "mediabox_width", "mediabox_height"], axis=1, errors="ignore")
 
906
 
907
- return review_file_df_out
 
 
 
 
 
908
 
909
- def multiply_coordinates_by_page_sizes(review_file_df: pd.DataFrame, page_sizes_df: pd.DataFrame, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"):
910
 
 
 
 
 
 
 
 
 
 
911
 
912
- if xmin in review_file_df.columns and not review_file_df.empty:
913
 
914
- coord_cols = [xmin, xmax, ymin, ymax]
915
- for col in coord_cols:
916
- review_file_df.loc[:, col] = pd.to_numeric(review_file_df[col], errors="coerce")
 
917
 
918
- # Separate absolute vs relative coordinates
919
- review_file_df_orig = review_file_df.loc[
920
- (review_file_df[xmin] > 1) & (review_file_df[xmax] > 1) &
921
- (review_file_df[ymin] > 1) & (review_file_df[ymax] > 1), :].copy()
922
 
923
- review_file_df = review_file_df.loc[
924
- (review_file_df[xmin] <= 1) & (review_file_df[xmax] <= 1) &
925
- (review_file_df[ymin] <= 1) & (review_file_df[ymax] <= 1), :].copy()
 
 
 
 
926
 
927
- if review_file_df.empty:
928
- return review_file_df_orig # If nothing is left, return the original absolute-coordinates DataFrame
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
929
 
930
- review_file_df.loc[:, "page"] = pd.to_numeric(review_file_df["page"], errors="coerce")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
931
 
932
- if "image_width" not in review_file_df.columns and not page_sizes_df.empty:
933
- page_sizes_df[['image_width', 'image_height']] = page_sizes_df[['image_width','image_height']].replace("<NA>", pd.NA) # Ensure proper NA handling
934
- review_file_df = review_file_df.merge(page_sizes_df, on="page", how="left")
 
935
 
936
- if "image_width" in review_file_df.columns:
937
- # Split into rows with/without image size info
938
- review_file_df_not_na = review_file_df.loc[review_file_df["image_width"].notna()].copy()
939
- review_file_df_na = review_file_df.loc[review_file_df["image_width"].isna()].copy()
940
 
941
- if not review_file_df_not_na.empty:
942
- convert_type_cols = ["image_width", "image_height", xmin, xmax, ymin, ymax]
943
- review_file_df_not_na[convert_type_cols] = review_file_df_not_na[convert_type_cols].apply(pd.to_numeric, errors="coerce")
 
944
 
945
- # Multiply coordinates by image sizes
946
- review_file_df_not_na[xmin] *= review_file_df_not_na["image_width"]
947
- review_file_df_not_na[xmax] *= review_file_df_not_na["image_width"]
948
- review_file_df_not_na[ymin] *= review_file_df_not_na["image_height"]
949
- review_file_df_not_na[ymax] *= review_file_df_not_na["image_height"]
950
 
951
- # Concatenate the modified and unmodified data
952
- review_file_df = pd.concat([df for df in [review_file_df_not_na, review_file_df_na] if not df.empty])
 
953
 
954
- # Merge with the original absolute-coordinates DataFrame
955
- dfs_to_concat = [df for df in [review_file_df_orig, review_file_df] if not df.empty]
956
- if dfs_to_concat: # Ensure there's at least one non-empty DataFrame
957
- review_file_df = pd.concat(dfs_to_concat)
958
- else:
959
- review_file_df = pd.DataFrame() # Return an empty DataFrame instead of raising an error
960
 
961
- # Only sort if the DataFrame is not empty and contains the required columns
962
- required_sort_columns = {"page", "xmin", "ymin"}
963
- if not review_file_df.empty and required_sort_columns.issubset(review_file_df.columns):
964
- review_file_df.sort_values(["page", "xmin", "ymin"], inplace=True)
965
 
966
- return review_file_df
 
 
 
 
967
 
 
968
 
969
  def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
970
  '''
@@ -1018,7 +1192,6 @@ def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
1018
 
1019
  return merged_df
1020
 
1021
-
1022
  def do_proximity_match_all_pages_for_text(df1:pd.DataFrame, df2:pd.DataFrame, threshold:float=0.03):
1023
  '''
1024
  Match text from one dataframe to another based on proximity matching of coordinates across all pages.
@@ -1142,12 +1315,12 @@ def convert_annotation_data_to_dataframe(all_annotations: List[Dict[str, Any]]):
1142
  # prevents this from being necessary.
1143
 
1144
  # 7. Ensure essential columns exist and set column order
1145
- essential_box_cols = ["xmin", "xmax", "ymin", "ymax", "text", "id"]
1146
  for col in essential_box_cols:
1147
  if col not in final_df.columns:
1148
  final_df[col] = pd.NA # Add column with NA if it wasn't present in any box
1149
 
1150
- base_cols = ["image", "page"]
1151
  extra_box_cols = [col for col in final_df.columns if col not in base_cols and col not in essential_box_cols]
1152
  final_col_order = base_cols + essential_box_cols + sorted(extra_box_cols)
1153
 
@@ -1156,6 +1329,8 @@ def convert_annotation_data_to_dataframe(all_annotations: List[Dict[str, Any]]):
1156
  # but it's good practice if columns could be missing for other reasons.
1157
  final_df = final_df.reindex(columns=final_col_order, fill_value=pd.NA)
1158
 
 
 
1159
  return final_df
1160
 
1161
  def create_annotation_dicts_from_annotation_df(
@@ -1185,7 +1360,8 @@ def create_annotation_dicts_from_annotation_df(
1185
  available_cols = [col for col in box_cols if col in all_image_annotations_df.columns]
1186
 
1187
  if 'text' in all_image_annotations_df.columns:
1188
- all_image_annotations_df.loc[all_image_annotations_df['text'].isnull(), 'text'] = ''
 
1189
 
1190
  if not available_cols:
1191
  print(f"Warning: None of the expected box columns ({box_cols}) found in DataFrame.")
@@ -1226,85 +1402,84 @@ def create_annotation_dicts_from_annotation_df(
1226
 
1227
  return result
1228
 
1229
- def convert_annotation_json_to_review_df(all_annotations: List[dict],
1230
- redaction_decision_output: pd.DataFrame = pd.DataFrame(),
1231
- page_sizes: List[dict] = [],
1232
- do_proximity_match: bool = True) -> pd.DataFrame:
 
 
1233
  '''
1234
  Convert the annotation json data to a dataframe format.
1235
  Add on any text from the initial review_file dataframe by joining based on 'id' if available
1236
  in both sources, otherwise falling back to joining on pages/co-ordinates (if option selected).
 
 
 
1237
  '''
1238
 
1239
  # 1. Convert annotations to DataFrame
1240
- # Ensure convert_annotation_data_to_dataframe populates the 'id' column
1241
- # if 'id' exists in the dictionaries within all_annotations.
1242
-
1243
  review_file_df = convert_annotation_data_to_dataframe(all_annotations)
1244
 
1245
- # Only keep rows in review_df where there are coordinates
1246
- review_file_df.dropna(subset='xmin', axis=0, inplace=True)
 
1247
 
1248
  # Exit early if the initial conversion results in an empty DataFrame
1249
  if review_file_df.empty:
1250
  # Define standard columns for an empty return DataFrame
1251
- check_columns = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text", "id"]
1252
- # Ensure 'id' is included if it might have been expected
1253
- return pd.DataFrame(columns=[col for col in check_columns if col != 'id' or 'id' in review_file_df.columns])
1254
-
1255
- # 2. Handle page sizes if provided
1256
- if not page_sizes:
1257
- page_sizes_df = pd.DataFrame(page_sizes) # Ensure it's a DataFrame
1258
- # Safely convert page column to numeric
1259
- page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
1260
- page_sizes_df.dropna(subset=["page"], inplace=True) # Drop rows where conversion failed
1261
- page_sizes_df["page"] = page_sizes_df["page"].astype(int) # Convert to int after handling errors/NaNs
 
 
 
1262
 
1263
 
1264
- # Apply coordinate division if page_sizes_df is not empty after processing
 
 
 
 
1265
  if not page_sizes_df.empty:
1266
- # Ensure 'page' column in review_file_df is numeric for merging
1267
- if 'page' in review_file_df.columns:
1268
- review_file_df['page'] = pd.to_numeric(review_file_df['page'], errors='coerce')
1269
- # Drop rows with invalid pages before division
1270
- review_file_df.dropna(subset=['page'], inplace=True)
1271
- review_file_df['page'] = review_file_df['page'].astype(int)
1272
- review_file_df = divide_coordinates_by_page_sizes(review_file_df, page_sizes_df)
1273
-
1274
- print("review_file_df after coord divide:", review_file_df)
1275
-
1276
- # Also apply to redaction_decision_output if it's not empty and has page numbers
1277
- if not redaction_decision_output.empty and 'page' in redaction_decision_output.columns:
1278
- redaction_decision_output['page'] = pd.to_numeric(redaction_decision_output['page'], errors='coerce')
1279
- # Drop rows with invalid pages before division
1280
- redaction_decision_output.dropna(subset=['page'], inplace=True)
1281
- redaction_decision_output['page'] = redaction_decision_output['page'].astype(int)
1282
- redaction_decision_output = divide_coordinates_by_page_sizes(redaction_decision_output, page_sizes_df)
1283
-
1284
- print("redaction_decision_output after coord divide:", redaction_decision_output)
1285
- else:
1286
- print("Warning: Page sizes DataFrame became empty after processing, skipping coordinate division.")
1287
 
1288
 
1289
  # 3. Join additional data from redaction_decision_output if provided
 
 
1290
  if not redaction_decision_output.empty:
1291
- # --- NEW LOGIC: Prioritize joining by 'id' ---
1292
- id_col_exists_in_review = 'id' in review_file_df.columns
1293
- id_col_exists_in_redaction = 'id' in redaction_decision_output.columns
1294
- joined_by_id = False # Flag to track if ID join was successful
 
 
1295
 
1296
  if id_col_exists_in_review and id_col_exists_in_redaction:
1297
  #print("Attempting to join data based on 'id' column.")
1298
  try:
1299
- # Ensure 'id' columns are of compatible types (e.g., string) to avoid merge errors
1300
  review_file_df['id'] = review_file_df['id'].astype(str)
1301
- # Make a copy to avoid SettingWithCopyWarning if redaction_decision_output is used elsewhere
 
1302
  redaction_copy = redaction_decision_output.copy()
1303
  redaction_copy['id'] = redaction_copy['id'].astype(str)
1304
 
1305
- # Select columns to merge from redaction output.
1306
- # Primarily interested in 'text', but keep 'id' for the merge key.
1307
- # Add other columns from redaction_copy if needed.
1308
  cols_to_merge = ['id']
1309
  if 'text' in redaction_copy.columns:
1310
  cols_to_merge.append('text')
@@ -1312,82 +1487,130 @@ def convert_annotation_json_to_review_df(all_annotations: List[dict],
1312
  print("Warning: 'text' column not found in redaction_decision_output. Cannot merge text using 'id'.")
1313
 
1314
  # Perform a left merge to keep all annotations and add matching text
1315
- # Suffixes prevent collision if 'text' already exists and we want to compare/choose
1316
- original_cols = review_file_df.columns.tolist()
 
 
1317
  merged_df = pd.merge(
1318
  review_file_df,
1319
  redaction_copy[cols_to_merge],
1320
  on='id',
1321
  how='left',
1322
- suffixes=('', '_redaction') # Suffix applied to columns from right df if names clash
1323
  )
1324
 
1325
- # Update the original 'text' column. Prioritize text from redaction output.
1326
- # If redaction output had 'text', a 'text_redaction' column now exists.
1327
- if 'text_redaction' in merged_df.columns:
1328
- if 'text' not in merged_df.columns: # If review_file_df didn't have text initially
1329
- merged_df['text'] = merged_df['text_redaction']
1330
- else:
1331
- # Use text from redaction where available, otherwise keep original text
1332
- merged_df['text'] = merged_df['text_redaction'].combine_first(merged_df['text'])
 
 
 
1333
 
1334
- # Remove the temporary column
1335
- merged_df = merged_df.drop(columns=['text_redaction'])
1336
 
1337
- # Ensure final columns match original expectation + potentially new 'text'
1338
- final_cols = original_cols
1339
- if 'text' not in final_cols and 'text' in merged_df.columns:
1340
- final_cols.append('text') # Make sure text column is kept if newly added
1341
- # Reorder/select columns if necessary, ensuring 'id' is kept
1342
- review_file_df = merged_df[[col for col in final_cols if col in merged_df.columns] + (['id'] if 'id' not in final_cols else [])]
1343
 
1344
-
1345
- #print("Successfully joined data using 'id'.")
1346
- joined_by_id = True
1347
 
1348
  except Exception as e:
1349
- print(f"Error during 'id'-based merge: {e}. Falling back to proximity match if enabled.")
1350
- # Fall through to proximity match below if an error occurred
1351
-
1352
- # --- Fallback to proximity match ---
1353
- if not joined_by_id and do_proximity_match:
1354
- if not id_col_exists_in_review or not id_col_exists_in_redaction:
1355
- print("Could not join by 'id' (column missing in one or both sources).")
1356
- print("Performing proximity match to add text data.")
1357
- # Match text to review file using proximity
1358
-
1359
- review_file_df = do_proximity_match_all_pages_for_text(df1=review_file_df.copy(), df2=redaction_decision_output.copy())
1360
- elif not joined_by_id and not do_proximity_match:
1361
- print("Skipping joining text data (ID join not possible, proximity match disabled).")
1362
- # --- End of join logic ---
1363
-
1364
- # 4. Ensure required columns exist, filling with blank if they don't
1365
- # Define base required columns, 'id' might or might not be present initially
1366
- required_columns = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text"]
1367
- # Add 'id' to required list if it exists in the dataframe at this point
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1368
  if 'id' in review_file_df.columns:
1369
- required_columns.append('id')
 
 
1370
 
1371
- for col in required_columns:
 
1372
  if col not in review_file_df.columns:
1373
- # Decide default value based on column type (e.g., '' for text, np.nan for numeric?)
1374
- # Using '' for simplicity here.
1375
- review_file_df[col] = ''
1376
 
1377
  # Select and order the final set of columns
1378
- review_file_df = review_file_df[required_columns]
 
 
1379
 
1380
  # 5. Final processing and sorting
1381
- # If colours are saved as list, convert to tuple
1382
  if 'color' in review_file_df.columns:
1383
- review_file_df["color"] = review_file_df["color"].apply(lambda x: tuple(x) if isinstance(x, list) else x)
 
 
1384
 
1385
  # Sort the results
1386
- sort_columns = ['page', 'ymin', 'xmin', 'label']
1387
  # Ensure sort columns exist before sorting
 
1388
  valid_sort_columns = [col for col in sort_columns if col in review_file_df.columns]
1389
- if valid_sort_columns:
1390
- review_file_df = review_file_df.sort_values(valid_sort_columns)
 
 
 
 
 
 
 
 
 
 
1391
 
1392
  return review_file_df
1393
 
@@ -1472,20 +1695,18 @@ def fill_missing_box_ids(data_input: dict) -> dict:
1472
 
1473
  def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12) -> pd.DataFrame:
1474
  """
1475
- Generates unique alphanumeric IDs for rows in a DataFrame column
1476
- where the value is missing (NaN, None) or an empty string.
1477
 
1478
  Args:
1479
  df (pd.DataFrame): The input Pandas DataFrame.
1480
  column_name (str): The name of the column to check and fill (defaults to 'id').
1481
  This column will be added if it doesn't exist.
1482
  length (int): The desired length of the generated IDs (defaults to 12).
1483
- Cannot exceed the limits that guarantee uniqueness based
1484
- on the number of IDs needed and character set size.
1485
 
1486
  Returns:
1487
  pd.DataFrame: The DataFrame with missing/empty IDs filled in the specified column.
1488
- Note: The function modifies the DataFrame in place.
1489
  """
1490
 
1491
  # --- Input Validation ---
@@ -1497,43 +1718,59 @@ def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12
1497
  raise ValueError("'length' must be a positive integer.")
1498
 
1499
  # --- Ensure Column Exists ---
 
1500
  if column_name not in df.columns:
1501
  print(f"Column '{column_name}' not found. Adding it to the DataFrame.")
1502
- df[column_name] = np.nan # Initialize with NaN
 
 
 
 
 
1503
 
1504
  # --- Identify Rows Needing IDs ---
1505
- # Check for NaN, None, or empty strings ('')
1506
- # Convert to string temporarily for robust empty string check, handle potential errors
1507
- try:
1508
- df[column_name] = df[column_name].astype(str) #handles NaN/None conversion, .str.strip() removes whitespace
1509
- is_missing_or_empty = (
1510
- df[column_name].isna()
1511
- #| (df[column_name].astype(str).str.strip() == '')
1512
- #| (df[column_name] == "nan")
1513
- | (df[column_name].astype(str).str.len() != length)
1514
- )
1515
- except Exception as e:
1516
- # Fallback if conversion to string fails (e.g., column contains complex objects)
1517
- print(f"Warning: Could not perform reliable empty string check on column '{column_name}' due to data type issues. Checking for NaN/None only. Error: {e}")
1518
- is_missing_or_empty = df[column_name].isna()
1519
 
1520
  rows_to_fill_index = df.index[is_missing_or_empty]
1521
  num_needed = len(rows_to_fill_index)
1522
 
1523
  if num_needed == 0:
1524
- #print(f"No missing or empty values found in column '{column_name}'.")
 
 
 
 
 
 
1525
  return df
1526
 
1527
  print(f"Found {num_needed} rows requiring a unique ID in column '{column_name}'.")
1528
 
1529
  # --- Get Existing IDs to Ensure Uniqueness ---
1530
- try:
1531
- # Get all non-missing, non-empty string values from the column
1532
- existing_ids = set(df.loc[~is_missing_or_empty, column_name].astype(str))
1533
- except Exception as e:
1534
- print(f"Warning: Could not reliably get all existing string IDs from column '{column_name}' due to data type issues. Uniqueness check might be less strict. Error: {e}")
1535
- # Fallback: Get only non-NaN IDs, potential type issues ignored
1536
- existing_ids = set(df.loc[df[column_name].notna(), column_name])
 
 
 
 
 
1537
 
1538
 
1539
  # --- Generate Unique IDs ---
@@ -1543,93 +1780,232 @@ def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12
1543
 
1544
  max_possible_ids = len(character_set) ** length
1545
  if num_needed > max_possible_ids:
1546
- raise ValueError(f"Cannot generate {num_needed} unique IDs with length {length}. Maximum possible is {max_possible_ids}.")
1547
- # Add a check for practical limits if needed, e.g., if num_needed is very close to max_possible_ids, generation could be slow.
 
 
1548
 
1549
  #print(f"Generating {num_needed} unique IDs of length {length}...")
1550
  for i in range(num_needed):
1551
  attempts = 0
1552
  while True:
1553
  candidate_id = ''.join(random.choices(character_set, k=length))
1554
- # Check against *all* existing IDs and *newly* generated ones
1555
  if candidate_id not in existing_ids and candidate_id not in generated_ids_set:
1556
  generated_ids_set.add(candidate_id)
1557
  new_ids_list.append(candidate_id)
1558
  break # Found a unique ID
1559
  attempts += 1
1560
- if attempts > num_needed * 100 and attempts > 1000 : # Safety break for unlikely infinite loop
1561
- raise RuntimeError(f"Failed to generate a unique ID after {attempts} attempts. Check length and character set or existing IDs.")
1562
 
1563
- # Optional progress update for large numbers
1564
- if (i + 1) % 1000 == 0:
1565
- print(f"Generated {i+1}/{num_needed} IDs...")
1566
 
1567
 
1568
  # --- Assign New IDs ---
1569
  # Use the previously identified index to assign the new IDs correctly
 
 
 
 
1570
  df.loc[rows_to_fill_index, column_name] = new_ids_list
1571
- #print(f"Successfully filled {len(new_ids_list)} missing values in column '{column_name}'.")
 
 
 
1572
 
1573
- # The DataFrame 'df' has been modified in place
1574
  return df
1575
 
1576
- def convert_review_df_to_annotation_json(review_file_df:pd.DataFrame,
1577
- image_paths:List[Image.Image],
1578
- page_sizes:List[dict]=[]) -> List[dict]:
1579
- '''
1580
- Convert a review csv to a json file for use by the Gradio Annotation object.
1581
- '''
1582
- # Make sure all relevant cols are float
1583
- float_cols = ["page", "xmin", "xmax", "ymin", "ymax"]
1584
- for col in float_cols:
1585
- review_file_df.loc[:, col] = pd.to_numeric(review_file_df.loc[:, col], errors='coerce')
1586
-
1587
- # Convert relative co-ordinates into image coordinates for the image annotation output object
1588
- if page_sizes:
1589
- page_sizes_df = pd.DataFrame(page_sizes)
1590
- page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
1591
 
1592
- review_file_df = multiply_coordinates_by_page_sizes(review_file_df, page_sizes_df)
1593
-
1594
- review_file_df = fill_missing_ids(review_file_df)
1595
 
1596
- if 'id' not in review_file_df.columns:
1597
- review_file_df['id'] = ''
1598
- review_file_df['id'] = review_file_df['id'].astype(str)
1599
-
1600
- # Keep only necessary columns
1601
- review_file_df = review_file_df[["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "id", "text"]].drop_duplicates(subset=["image", "page", "xmin", "ymin", "xmax", "ymax", "label", "id"])
 
1602
 
1603
- # If colours are saved as list, convert to tuple
1604
- review_file_df.loc[:, "color"] = review_file_df.loc[:,"color"].apply(lambda x: tuple(x) if isinstance(x, list) else x)
 
 
1605
 
1606
- # Group the DataFrame by the 'image' column
1607
- grouped_csv_pages = review_file_df.groupby('page')
1608
 
1609
- # Create a list to hold the JSON data
1610
- json_data = []
 
 
 
 
 
 
 
 
 
 
 
1611
 
1612
- for page_no, pdf_image_path in enumerate(page_sizes_df["image_path"]):
1613
-
1614
- reported_page_number = int(page_no + 1)
1615
 
1616
- if reported_page_number in review_file_df["page"].values:
1617
 
1618
- # Convert each relevant group to a list of box dictionaries
1619
- selected_csv_pages = grouped_csv_pages.get_group(reported_page_number)
1620
- annotation_boxes = selected_csv_pages.drop(columns=['image', 'page']).to_dict(orient='records')
1621
-
1622
- annotation = {
1623
- "image": pdf_image_path,
1624
- "boxes": annotation_boxes
1625
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1626
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1627
  else:
1628
- annotation = {}
1629
- annotation["image"] = pdf_image_path
1630
- annotation["boxes"] = []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1631
 
1632
- # Append the structured data to the json_data list
1633
- json_data.append(annotation)
 
 
 
 
 
 
 
1634
 
1635
- return json_data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  from scipy.spatial import cKDTree
22
  import random
23
  import string
24
+ import warnings # To warn about potential type changes
25
 
26
  IMAGE_NUM_REGEX = re.compile(r'_(\d+)\.png$')
27
 
 
462
  input_folder:str=INPUT_FOLDER,
463
  prepare_images:bool=True,
464
  page_sizes:list[dict]=[],
465
+ textract_output_found:bool = False,
466
+ local_ocr_output_found:bool = False,
467
  progress: Progress = Progress(track_tqdm=True)
468
  ) -> tuple[List[str], List[str]]:
469
  """
 
485
  output_folder (optional, str): The output folder for file save
486
  prepare_images (optional, bool): A boolean indicating whether to create images for each PDF page. Defaults to True.
487
  page_sizes(optional, List[dict]): A list of dicts containing information about page sizes in various formats.
488
+ textract_output_found (optional, bool): A boolean indicating whether Textract analysis output has already been found. Defaults to False.
489
+ local_ocr_output_found (optional, bool): A boolean indicating whether local OCR analysis output has already been found. Defaults to False.
490
  progress (optional, Progress): Progress tracker for the operation
491
 
492
 
 
538
  final_out_message = '\n'.join(out_message)
539
  else:
540
  final_out_message = out_message
541
+ return final_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df, local_ocr_output_found
542
 
543
  progress(0.1, desc='Preparing file')
544
 
 
620
 
621
  elif file_extension in ['.csv']:
622
  if '_review_file' in file_path_without_ext:
 
623
  review_file_csv = read_file(file_path)
624
  all_annotations_object = convert_review_df_to_annotation_json(review_file_csv, image_file_paths, page_sizes)
625
  json_from_csv = True
626
+ #print("Converted CSV review file to image annotation object")
627
  elif '_ocr_output' in file_path_without_ext:
628
  all_line_level_ocr_results_df = read_file(file_path)
629
  json_from_csv = False
 
641
  # Assuming file_path is a NamedString or similar
642
  all_annotations_object = json.loads(file_path) # Use loads for string content
643
 
644
+ # Save Textract file to folder
645
+ elif (file_extension in ['.json']) and '_textract' in file_path_without_ext: #(prepare_for_review != True):
646
  print("Saving Textract output")
647
  # Copy it to the output folder so it can be used later.
648
  output_textract_json_file_name = file_path_without_ext
 
656
  textract_output_found = True
657
  continue
658
 
659
+ elif (file_extension in ['.json']) and '_ocr_results_with_words' in file_path_without_ext: #(prepare_for_review != True):
660
+ print("Saving local OCR output")
661
+ # Copy it to the output folder so it can be used later.
662
+ output_ocr_results_with_words_json_file_name = file_path_without_ext
663
+ if not file_path.endswith("_ocr_results_with_words.json"): output_ocr_results_with_words_json_file_name = file_path_without_ext + "_ocr_results_with_words.json"
664
+ else: output_ocr_results_with_words_json_file_name = file_path_without_ext + ".json"
665
+
666
+ out_ocr_results_with_words_path = os.path.join(output_folder, output_ocr_results_with_words_json_file_name)
667
+
668
+ # Use shutil to copy the file directly
669
+ shutil.copy2(file_path, out_ocr_results_with_words_path) # Preserves metadata
670
+ local_ocr_output_found = True
671
+ continue
672
+
673
  # NEW IF STATEMENT
674
  # If you have an annotations object from the above code
675
  if all_annotations_object:
 
789
 
790
  number_of_pages = len(page_sizes)#len(image_file_paths)
791
 
792
+ return combined_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df, local_ocr_output_found
793
+
794
+ def load_and_convert_ocr_results_with_words_json(ocr_results_with_words_json_file_path:str, log_files_output_paths:str, page_sizes_df:pd.DataFrame):
795
+ """
796
+ Loads Textract JSON from a file, detects if conversion is needed, and converts if necessary.
797
+ """
798
+
799
+ if not os.path.exists(ocr_results_with_words_json_file_path):
800
+ print("No existing OCR results file found.")
801
+ return [], True, log_files_output_paths # Return empty dict and flag indicating missing file
802
+
803
+ no_ocr_results_with_words_file = False
804
+ print("Found existing OCR results json results file.")
805
+
806
+ # Track log files
807
+ if ocr_results_with_words_json_file_path not in log_files_output_paths:
808
+ log_files_output_paths.append(ocr_results_with_words_json_file_path)
809
+
810
+ try:
811
+ with open(ocr_results_with_words_json_file_path, 'r', encoding='utf-8') as json_file:
812
+ ocr_results_with_words_data = json.load(json_file)
813
+ except json.JSONDecodeError:
814
+ print("Error: Failed to parse OCR results JSON file. Returning empty data.")
815
+ return [], True, log_files_output_paths # Indicate failure
816
+
817
+ # Check if conversion is needed
818
+ if "page" and "results" in ocr_results_with_words_data[0]:
819
+ print("JSON already in the correct format for app. No changes needed.")
820
+ return ocr_results_with_words_data, False, log_files_output_paths # No conversion required
821
+
822
+ else:
823
+ print("Invalid OCR result JSON format: 'page' or 'results' key missing.")
824
+ #print("OCR results with words data:", ocr_results_with_words_data)
825
+ return [], True, log_files_output_paths # Return empty data if JSON is not recognized
826
 
827
  def convert_text_pdf_to_img_pdf(in_file_path:str, out_text_file_path:List[str], image_dpi:float=image_dpi, output_folder:str=OUTPUT_FOLDER, input_folder:str=INPUT_FOLDER):
828
  file_path_without_ext = get_file_name_without_type(in_file_path)
 
899
 
900
  return result
901
 
902
+ def divide_coordinates_by_page_sizes(
903
+ review_file_df: pd.DataFrame,
904
+ page_sizes_df: pd.DataFrame,
905
+ xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
906
+ ) -> pd.DataFrame:
907
+ """
908
+ Optimized function to convert absolute image coordinates (>1) to relative coordinates (<=1).
 
 
 
 
 
 
 
 
 
 
 
 
 
909
 
910
+ Identifies rows with absolute coordinates, merges page size information,
911
+ divides coordinates by dimensions, and combines with already-relative rows.
912
 
913
+ Args:
914
+ review_file_df: Input DataFrame with potentially mixed coordinate systems.
915
+ page_sizes_df: DataFrame with page dimensions ('page', 'image_width',
916
+ 'image_height', 'mediabox_width', 'mediabox_height').
917
+ xmin, xmax, ymin, ymax: Names of the coordinate columns.
918
 
919
+ Returns:
920
+ DataFrame with coordinates converted to relative system, sorted.
921
+ """
922
+ if review_file_df.empty or xmin not in review_file_df.columns:
923
+ return review_file_df # Return early if empty or key column missing
924
 
925
+ # --- Initial Type Conversion ---
926
+ coord_cols = [xmin, xmax, ymin, ymax]
927
+ cols_to_convert = coord_cols + ["page"]
928
+ temp_df = review_file_df.copy() # Work on a copy initially
929
 
930
+ for col in cols_to_convert:
931
+ if col in temp_df.columns:
932
+ temp_df[col] = pd.to_numeric(temp_df[col], errors="coerce")
933
+ else:
934
+ # If essential 'page' or coord column missing, cannot proceed meaningfully
935
+ if col == 'page' or col in coord_cols:
936
+ print(f"Warning: Required column '{col}' not found in review_file_df. Returning original DataFrame.")
937
+ return review_file_df
938
+
939
+ # --- Identify Absolute Coordinates ---
940
+ # Create mask for rows where *all* coordinates are potentially absolute (> 1)
941
+ # Handle potential NaNs introduced by to_numeric - treat NaN as not absolute.
942
+ is_absolute_mask = (
943
+ (temp_df[xmin] > 1) & (temp_df[xmin].notna()) &
944
+ (temp_df[xmax] > 1) & (temp_df[xmax].notna()) &
945
+ (temp_df[ymin] > 1) & (temp_df[ymin].notna()) &
946
+ (temp_df[ymax] > 1) & (temp_df[ymax].notna())
947
+ )
948
 
949
+ # --- Separate DataFrames ---
950
+ df_rel = temp_df[~is_absolute_mask] # Rows already relative or with NaN/mixed coords
951
+ df_abs = temp_df[is_absolute_mask].copy() # Absolute rows - COPY here to allow modifications
952
+
953
+ # --- Process Absolute Coordinates ---
954
+ if not df_abs.empty:
955
+ # Merge page sizes if necessary
956
+ if "image_width" not in df_abs.columns and not page_sizes_df.empty:
957
+ ps_df_copy = page_sizes_df.copy() # Work on a copy of page sizes
958
+
959
+ # Ensure page is numeric for merge key matching
960
+ ps_df_copy['page'] = pd.to_numeric(ps_df_copy['page'], errors='coerce')
961
+
962
+ # Columns to merge from page_sizes
963
+ merge_cols = ['page', 'image_width', 'image_height', 'mediabox_width', 'mediabox_height']
964
+ available_merge_cols = [col for col in merge_cols if col in ps_df_copy.columns]
965
+
966
+ # Prepare dimension columns in the copy
967
+ for col in ['image_width', 'image_height', 'mediabox_width', 'mediabox_height']:
968
+ if col in ps_df_copy.columns:
969
+ # Replace "<NA>" string if present
970
+ if ps_df_copy[col].dtype == 'object':
971
+ ps_df_copy[col] = ps_df_copy[col].replace("<NA>", pd.NA)
972
+ # Convert to numeric
973
+ ps_df_copy[col] = pd.to_numeric(ps_df_copy[col], errors='coerce')
974
+
975
+ # Perform the merge
976
+ if 'page' in available_merge_cols: # Check if page exists for merging
977
+ df_abs = df_abs.merge(
978
+ ps_df_copy[available_merge_cols],
979
+ on="page",
980
+ how="left"
981
+ )
982
+ else:
983
+ print("Warning: 'page' column not found in page_sizes_df. Cannot merge dimensions.")
984
+
985
+
986
+ # Fallback to mediabox dimensions if image dimensions are missing
987
+ if "image_width" in df_abs.columns and "mediabox_width" in df_abs.columns:
988
+ # Check if image_width mostly missing - use .isna().all() or check percentage
989
+ if df_abs["image_width"].isna().all():
990
+ print("Falling back to mediabox dimensions as image_width is entirely missing.")
991
+ df_abs["image_width"] = df_abs["image_width"].fillna(df_abs["mediabox_width"])
992
+ df_abs["image_height"] = df_abs["image_height"].fillna(df_abs["mediabox_height"])
993
+ else:
994
+ # Optional: Fill only missing image dims if some exist?
995
+ # df_abs["image_width"].fillna(df_abs["mediabox_width"], inplace=True)
996
+ # df_abs["image_height"].fillna(df_abs["mediabox_height"], inplace=True)
997
+ pass # Current logic only falls back if ALL image_width are NaN
998
+
999
+ # Ensure divisor columns are numeric before division
1000
+ divisors_numeric = True
1001
+ for col in ["image_width", "image_height"]:
1002
+ if col in df_abs.columns:
1003
+ df_abs[col] = pd.to_numeric(df_abs[col], errors='coerce')
1004
+ else:
1005
+ print(f"Warning: Dimension column '{col}' missing. Cannot perform division.")
1006
+ divisors_numeric = False
1007
+
1008
+
1009
+ # Perform division if dimensions are available and numeric
1010
+ if divisors_numeric and "image_width" in df_abs.columns and "image_height" in df_abs.columns:
1011
+ # Use np.errstate to suppress warnings about division by zero or NaN if desired
1012
+ with np.errstate(divide='ignore', invalid='ignore'):
1013
+ df_abs[xmin] = df_abs[xmin] / df_abs["image_width"]
1014
+ df_abs[xmax] = df_abs[xmax] / df_abs["image_width"]
1015
+ df_abs[ymin] = df_abs[ymin] / df_abs["image_height"]
1016
+ df_abs[ymax] = df_abs[ymax] / df_abs["image_height"]
1017
+ # Replace potential infinities with NaN (optional, depending on desired outcome)
1018
+ df_abs.replace([np.inf, -np.inf], np.nan, inplace=True)
1019
  else:
1020
+ print("Skipping coordinate division due to missing or non-numeric dimension columns.")
1021
 
 
 
 
 
1022
 
1023
+ # --- Combine Relative and Processed Absolute DataFrames ---
1024
+ dfs_to_concat = [df for df in [df_rel, df_abs] if not df.empty]
1025
 
1026
+ if dfs_to_concat:
1027
+ final_df = pd.concat(dfs_to_concat, ignore_index=True)
1028
+ else:
1029
+ # If both splits were empty, return an empty DF with original columns
1030
+ print("Warning: Both relative and absolute splits resulted in empty DataFrames.")
1031
+ final_df = pd.DataFrame(columns=review_file_df.columns)
1032
 
 
1033
 
1034
+ # --- Final Sort ---
1035
+ required_sort_columns = {"page", xmin, ymin}
1036
+ if not final_df.empty and required_sort_columns.issubset(final_df.columns):
1037
+ # Ensure sort columns are numeric before sorting
1038
+ final_df['page'] = pd.to_numeric(final_df['page'], errors='coerce')
1039
+ final_df[ymin] = pd.to_numeric(final_df[ymin], errors='coerce')
1040
+ final_df[xmin] = pd.to_numeric(final_df[xmin], errors='coerce')
1041
+ # Sort by page, ymin, xmin (note order compared to multiply function)
1042
+ final_df.sort_values(["page", ymin, xmin], inplace=True, na_position='last')
1043
 
 
1044
 
1045
+ # --- Clean Up Columns ---
1046
+ # Correctly drop columns and reassign the result
1047
+ cols_to_drop = ["image_width", "image_height", "mediabox_width", "mediabox_height"]
1048
+ final_df = final_df.drop(columns=cols_to_drop, errors="ignore")
1049
 
1050
+ return final_df
 
 
 
1051
 
1052
+ def multiply_coordinates_by_page_sizes(
1053
+ review_file_df: pd.DataFrame,
1054
+ page_sizes_df: pd.DataFrame,
1055
+ xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
1056
+ ):
1057
+ """
1058
+ Optimized function to convert relative coordinates to absolute based on page sizes.
1059
 
1060
+ Separates relative (<=1) and absolute (>1) coordinates, merges page sizes
1061
+ for relative coordinates, calculates absolute pixel values, and recombines.
1062
+ """
1063
+ if review_file_df.empty or xmin not in review_file_df.columns:
1064
+ return review_file_df # Return early if empty or key column missing
1065
+
1066
+ coord_cols = [xmin, xmax, ymin, ymax]
1067
+ # Initial type conversion for coordinates and page
1068
+ for col in coord_cols + ["page"]:
1069
+ if col in review_file_df.columns:
1070
+ # Use astype for potentially faster conversion if confident,
1071
+ # but to_numeric is safer for mixed types/errors
1072
+ review_file_df[col] = pd.to_numeric(review_file_df[col], errors="coerce")
1073
+
1074
+ # --- Identify relative coordinates ---
1075
+ # Create mask for rows where *all* coordinates are potentially relative (<= 1)
1076
+ # Handle potential NaNs introduced by to_numeric - treat NaN as not relative here.
1077
+ is_relative_mask = (
1078
+ (review_file_df[xmin].le(1) & review_file_df[xmin].notna()) &
1079
+ (review_file_df[xmax].le(1) & review_file_df[xmax].notna()) &
1080
+ (review_file_df[ymin].le(1) & review_file_df[ymin].notna()) &
1081
+ (review_file_df[ymax].le(1) & review_file_df[ymax].notna())
1082
+ )
1083
 
1084
+ # Separate DataFrames (minimal copies)
1085
+ df_abs = review_file_df[~is_relative_mask].copy() # Keep absolute rows separately
1086
+ df_rel = review_file_df[is_relative_mask].copy() # Work only with relative rows
1087
+
1088
+ if df_rel.empty:
1089
+ # If no relative coordinates, just sort and return absolute ones (if any)
1090
+ if not df_abs.empty and {"page", xmin, ymin}.issubset(df_abs.columns):
1091
+ df_abs.sort_values(["page", xmin, ymin], inplace=True, na_position='last')
1092
+ return df_abs
1093
+
1094
+ # --- Process relative coordinates ---
1095
+ if "image_width" not in df_rel.columns and not page_sizes_df.empty:
1096
+ # Prepare page_sizes_df for merge
1097
+ page_sizes_df = page_sizes_df.copy() # Avoid modifying original page_sizes_df
1098
+ page_sizes_df['page'] = pd.to_numeric(page_sizes_df['page'], errors='coerce')
1099
+ # Ensure proper NA handling for image dimensions
1100
+ page_sizes_df[['image_width', 'image_height']] = page_sizes_df[['image_width','image_height']].replace("<NA>", pd.NA)
1101
+ page_sizes_df['image_width'] = pd.to_numeric(page_sizes_df['image_width'], errors='coerce')
1102
+ page_sizes_df['image_height'] = pd.to_numeric(page_sizes_df['image_height'], errors='coerce')
1103
+
1104
+ # Merge page sizes
1105
+ df_rel = df_rel.merge(
1106
+ page_sizes_df[['page', 'image_width', 'image_height']],
1107
+ on="page",
1108
+ how="left"
1109
+ )
1110
 
1111
+ # Multiply coordinates where image dimensions are available
1112
+ if "image_width" in df_rel.columns:
1113
+ # Create mask for rows in df_rel that have valid image dimensions
1114
+ has_size_mask = df_rel["image_width"].notna() & df_rel["image_height"].notna()
1115
 
1116
+ # Apply multiplication using .loc and the mask (vectorized and efficient)
1117
+ # Ensure columns are numeric before multiplication (might be redundant if types are good)
1118
+ # df_rel.loc[has_size_mask, coord_cols + ['image_width', 'image_height']] = df_rel.loc[has_size_mask, coord_cols + ['image_width', 'image_height']].apply(pd.to_numeric, errors='coerce')
 
1119
 
1120
+ df_rel.loc[has_size_mask, xmin] *= df_rel.loc[has_size_mask, "image_width"]
1121
+ df_rel.loc[has_size_mask, xmax] *= df_rel.loc[has_size_mask, "image_width"]
1122
+ df_rel.loc[has_size_mask, ymin] *= df_rel.loc[has_size_mask, "image_height"]
1123
+ df_rel.loc[has_size_mask, ymax] *= df_rel.loc[has_size_mask, "image_height"]
1124
 
 
 
 
 
 
1125
 
1126
+ # --- Combine absolute and processed relative DataFrames ---
1127
+ # Use list comprehension to handle potentially empty DataFrames
1128
+ dfs_to_concat = [df for df in [df_abs, df_rel] if not df.empty]
1129
 
1130
+ if not dfs_to_concat:
1131
+ return pd.DataFrame() # Return empty if both are empty
 
 
 
 
1132
 
1133
+ final_df = pd.concat(dfs_to_concat, ignore_index=True) # ignore_index is good practice after filtering/concat
 
 
 
1134
 
1135
+ # --- Final Sort ---
1136
+ required_sort_columns = {"page", xmin, ymin}
1137
+ if not final_df.empty and required_sort_columns.issubset(final_df.columns):
1138
+ # Handle potential NaNs in sort columns gracefully
1139
+ final_df.sort_values(["page", xmin, ymin], inplace=True, na_position='last')
1140
 
1141
+ return final_df
1142
 
1143
  def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
1144
  '''
 
1192
 
1193
  return merged_df
1194
 
 
1195
  def do_proximity_match_all_pages_for_text(df1:pd.DataFrame, df2:pd.DataFrame, threshold:float=0.03):
1196
  '''
1197
  Match text from one dataframe to another based on proximity matching of coordinates across all pages.
 
1315
  # prevents this from being necessary.
1316
 
1317
  # 7. Ensure essential columns exist and set column order
1318
+ essential_box_cols = ["xmin", "xmax", "ymin", "ymax", "text", "id", "label"]
1319
  for col in essential_box_cols:
1320
  if col not in final_df.columns:
1321
  final_df[col] = pd.NA # Add column with NA if it wasn't present in any box
1322
 
1323
+ base_cols = ["image"]
1324
  extra_box_cols = [col for col in final_df.columns if col not in base_cols and col not in essential_box_cols]
1325
  final_col_order = base_cols + essential_box_cols + sorted(extra_box_cols)
1326
 
 
1329
  # but it's good practice if columns could be missing for other reasons.
1330
  final_df = final_df.reindex(columns=final_col_order, fill_value=pd.NA)
1331
 
1332
+ final_df = final_df.dropna(subset=["xmin", "xmax", "ymin", "ymax", "text", "id", "label"])
1333
+
1334
  return final_df
1335
 
1336
  def create_annotation_dicts_from_annotation_df(
 
1360
  available_cols = [col for col in box_cols if col in all_image_annotations_df.columns]
1361
 
1362
  if 'text' in all_image_annotations_df.columns:
1363
+ all_image_annotations_df['text'] = all_image_annotations_df['text'].fillna('')
1364
+ #all_image_annotations_df.loc[all_image_annotations_df['text'].isnull(), 'text'] = ''
1365
 
1366
  if not available_cols:
1367
  print(f"Warning: None of the expected box columns ({box_cols}) found in DataFrame.")
 
1402
 
1403
  return result
1404
 
1405
+ def convert_annotation_json_to_review_df(
1406
+ all_annotations: List[dict],
1407
+ redaction_decision_output: pd.DataFrame = pd.DataFrame(),
1408
+ page_sizes: List[dict] = [],
1409
+ do_proximity_match: bool = True
1410
+ ) -> pd.DataFrame:
1411
  '''
1412
  Convert the annotation json data to a dataframe format.
1413
  Add on any text from the initial review_file dataframe by joining based on 'id' if available
1414
  in both sources, otherwise falling back to joining on pages/co-ordinates (if option selected).
1415
+
1416
+ Refactored for improved efficiency, prioritizing ID-based join and conditionally applying
1417
+ coordinate division and proximity matching.
1418
  '''
1419
 
1420
  # 1. Convert annotations to DataFrame
 
 
 
1421
  review_file_df = convert_annotation_data_to_dataframe(all_annotations)
1422
 
1423
+ # Only keep rows in review_df where there are coordinates (assuming xmin is representative)
1424
+ # Use .notna() for robustness with potential None or NaN values
1425
+ review_file_df.dropna(subset=['xmin', 'ymin', 'xmax', 'ymax'], how='any', inplace=True)
1426
 
1427
  # Exit early if the initial conversion results in an empty DataFrame
1428
  if review_file_df.empty:
1429
  # Define standard columns for an empty return DataFrame
1430
+ # Ensure 'id' is included if it was potentially expected based on input structure
1431
+ # We don't know the columns from convert_annotation_data_to_dataframe without seeing it,
1432
+ # but let's assume a standard set and add 'id' if it appeared.
1433
+ standard_cols = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text"]
1434
+ if 'id' in review_file_df.columns:
1435
+ standard_cols.append('id')
1436
+ return pd.DataFrame(columns=standard_cols)
1437
+
1438
+ # Ensure 'id' column exists for logic flow, even if empty
1439
+ if 'id' not in review_file_df.columns:
1440
+ review_file_df['id'] = ''
1441
+ # Do the same for redaction_decision_output if it's not empty
1442
+ if not redaction_decision_output.empty and 'id' not in redaction_decision_output.columns:
1443
+ redaction_decision_output['id'] = ''
1444
 
1445
 
1446
+ # 2. Process page sizes if provided - needed potentially for coordinate division later
1447
+ # Process this once upfront if the data is available
1448
+ page_sizes_df = pd.DataFrame() # Initialize as empty
1449
+ if page_sizes:
1450
+ page_sizes_df = pd.DataFrame(page_sizes)
1451
  if not page_sizes_df.empty:
1452
+ # Safely convert page column to numeric and then int
1453
+ page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
1454
+ page_sizes_df.dropna(subset=["page"], inplace=True)
1455
+ if not page_sizes_df.empty: # Check again after dropping NaNs
1456
+ page_sizes_df["page"] = page_sizes_df["page"].astype(int)
1457
+ else:
1458
+ print("Warning: Page sizes DataFrame became empty after processing, coordinate division will be skipped.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1459
 
1460
 
1461
  # 3. Join additional data from redaction_decision_output if provided
1462
+ text_added_successfully = False # Flag to track if text was added by any method
1463
+
1464
  if not redaction_decision_output.empty:
1465
+ # --- Attempt to join data based on 'id' column first ---
1466
+
1467
+ # Check if 'id' columns are present and have non-null values in *both* dataframes
1468
+ id_col_exists_in_review = 'id' in review_file_df.columns and not review_file_df['id'].isnull().all() and not (review_file_df['id'] == '').all()
1469
+ id_col_exists_in_redaction = 'id' in redaction_decision_output.columns and not redaction_decision_output['id'].isnull().all() and not (redaction_decision_output['id'] == '').all()
1470
+
1471
 
1472
  if id_col_exists_in_review and id_col_exists_in_redaction:
1473
  #print("Attempting to join data based on 'id' column.")
1474
  try:
1475
+ # Ensure 'id' columns are of string type for robust merging
1476
  review_file_df['id'] = review_file_df['id'].astype(str)
1477
+ # Make a copy if needed, but try to avoid if redaction_decision_output isn't modified later
1478
+ # Let's use a copy for safety as in the original code
1479
  redaction_copy = redaction_decision_output.copy()
1480
  redaction_copy['id'] = redaction_copy['id'].astype(str)
1481
 
1482
+ # Select columns to merge from redaction output. Prioritize 'text'.
 
 
1483
  cols_to_merge = ['id']
1484
  if 'text' in redaction_copy.columns:
1485
  cols_to_merge.append('text')
 
1487
  print("Warning: 'text' column not found in redaction_decision_output. Cannot merge text using 'id'.")
1488
 
1489
  # Perform a left merge to keep all annotations and add matching text
1490
+ # Use a suffix for the text column from the right DataFrame
1491
+ original_text_col_exists = 'text' in review_file_df.columns
1492
+ merge_suffix = '_redaction' if original_text_col_exists else ''
1493
+
1494
  merged_df = pd.merge(
1495
  review_file_df,
1496
  redaction_copy[cols_to_merge],
1497
  on='id',
1498
  how='left',
1499
+ suffixes=('', merge_suffix)
1500
  )
1501
 
1502
+ # Update the 'text' column if a new one was brought in
1503
+ if 'text' + merge_suffix in merged_df.columns:
1504
+ redaction_text_col = 'text' + merge_suffix
1505
+ if original_text_col_exists:
1506
+ # Combine: Use text from redaction where available, otherwise keep original
1507
+ merged_df['text'] = merged_df[redaction_text_col].combine_first(merged_df['text'])
1508
+ # Drop the temporary column
1509
+ merged_df = merged_df.drop(columns=[redaction_text_col])
1510
+ else:
1511
+ # Redaction output had text, but review_file_df didn't. Rename the new column.
1512
+ merged_df = merged_df.rename(columns={redaction_text_col: 'text'})
1513
 
1514
+ text_added_successfully = True # Indicate text was potentially added
 
1515
 
1516
+ review_file_df = merged_df # Update the main DataFrame
 
 
 
 
 
1517
 
1518
+ #print("Successfully attempted to join data using 'id'.") # Note: Text might not have been in redaction data
 
 
1519
 
1520
  except Exception as e:
1521
+ print(f"Error during 'id'-based merge: {e}. Checking for proximity match fallback.")
1522
+ # Fall through to proximity match logic below
1523
+
1524
+ # --- Fallback to proximity match if ID join wasn't possible/successful and enabled ---
1525
+ # Note: If id_col_exists_in_review or id_col_exists_in_redaction was False,
1526
+ # the block above was skipped, and we naturally fall here.
1527
+ # If an error occurred in the try block, joined_by_id would implicitly be False
1528
+ # because text_added_successfully wasn't set to True.
1529
+
1530
+ # Only attempt proximity match if text wasn't added by ID join and proximity is requested
1531
+ if not text_added_successfully and do_proximity_match:
1532
+ print("Attempting proximity match to add text data.")
1533
+
1534
+ # Ensure 'page' columns are numeric before coordinate division and proximity match
1535
+ # (Assuming divide_coordinates_by_page_sizes and do_proximity_match_all_pages_for_text need this)
1536
+ if 'page' in review_file_df.columns:
1537
+ review_file_df['page'] = pd.to_numeric(review_file_df['page'], errors='coerce').fillna(-1).astype(int) # Use -1 for NaN pages
1538
+ review_file_df = review_file_df[review_file_df['page'] != -1] # Drop rows where page conversion failed
1539
+ if not redaction_decision_output.empty and 'page' in redaction_decision_output.columns:
1540
+ redaction_decision_output['page'] = pd.to_numeric(redaction_decision_output['page'], errors='coerce').fillna(-1).astype(int)
1541
+ redaction_decision_output = redaction_decision_output[redaction_decision_output['page'] != -1]
1542
+
1543
+ # Perform coordinate division IF page_sizes were processed and DataFrame is not empty
1544
+ if not page_sizes_df.empty:
1545
+ # Apply coordinate division *before* proximity match
1546
+ review_file_df = divide_coordinates_by_page_sizes(review_file_df, page_sizes_df)
1547
+ if not redaction_decision_output.empty:
1548
+ redaction_decision_output = divide_coordinates_by_page_sizes(redaction_decision_output, page_sizes_df)
1549
+
1550
+ # Now perform the proximity match
1551
+ # Note: Potential DataFrame copies happen inside do_proximity_match based on its implementation
1552
+ if not redaction_decision_output.empty:
1553
+ try:
1554
+ review_file_df = do_proximity_match_all_pages_for_text(
1555
+ df1=review_file_df, # Pass directly, avoid caller copy if possible by modifying function signature
1556
+ df2=redaction_decision_output # Pass directly
1557
+ )
1558
+ # Assuming do_proximity_match_all_pages_for_text adds/updates the 'text' column
1559
+ if 'text' in review_file_df.columns:
1560
+ text_added_successfully = True
1561
+ print("Proximity match completed.")
1562
+ except Exception as e:
1563
+ print(f"Error during proximity match: {e}. Text data may not be added.")
1564
+
1565
+ elif not text_added_successfully and not do_proximity_match:
1566
+ print("Skipping joining text data (ID join not possible/failed, proximity match disabled).")
1567
+
1568
+ # 4. Ensure required columns exist and are ordered
1569
+ # Define base required columns. 'id' and 'text' are conditionally added.
1570
+ required_columns_base = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax"]
1571
+ final_columns = required_columns_base[:] # Start with base columns
1572
+
1573
+ # Add 'id' and 'text' if they exist in the DataFrame at this point
1574
  if 'id' in review_file_df.columns:
1575
+ final_columns.append('id')
1576
+ if 'text' in review_file_df.columns:
1577
+ final_columns.append('text') # Add text column if it was created/merged
1578
 
1579
+ # Add any missing required columns with a default value (e.g., blank string)
1580
+ for col in final_columns:
1581
  if col not in review_file_df.columns:
1582
+ # Use appropriate default based on expected type, '' for text/id, np.nan for coords?
1583
+ # Sticking to '' as in original for simplicity, but consider data types.
1584
+ review_file_df[col] = '' # Or np.nan for numerical, but coords already checked by dropna
1585
 
1586
  # Select and order the final set of columns
1587
+ # Ensure all selected columns actually exist after adding defaults
1588
+ review_file_df = review_file_df[[col for col in final_columns if col in review_file_df.columns]]
1589
+
1590
 
1591
  # 5. Final processing and sorting
1592
+ # Convert colours from list to tuple if necessary - apply is okay here unless lists are vast
1593
  if 'color' in review_file_df.columns:
1594
+ # Check if the column actually contains lists before applying lambda
1595
+ if review_file_df['color'].apply(lambda x: isinstance(x, list)).any():
1596
+ review_file_df["color"] = review_file_df["color"].apply(lambda x: tuple(x) if isinstance(x, list) else x)
1597
 
1598
  # Sort the results
 
1599
  # Ensure sort columns exist before sorting
1600
+ sort_columns = ['page', 'ymin', 'xmin', 'label']
1601
  valid_sort_columns = [col for col in sort_columns if col in review_file_df.columns]
1602
+ if valid_sort_columns and not review_file_df.empty: # Only sort non-empty df
1603
+ # Convert potential numeric sort columns to appropriate types if necessary
1604
+ # (e.g., 'page', 'ymin', 'xmin') to ensure correct sorting.
1605
+ # dropna(subset=[...], inplace=True) earlier should handle NaNs in coords.
1606
+ # page conversion already done before proximity match.
1607
+ try:
1608
+ review_file_df = review_file_df.sort_values(valid_sort_columns)
1609
+ except TypeError as e:
1610
+ print(f"Warning: Could not sort DataFrame due to type error in sort columns: {e}")
1611
+ # Proceed without sorting
1612
+
1613
+ review_file_df = review_file_df.dropna(subset=["xmin", "xmax", "ymin", "ymax", "text", "id", "label"])
1614
 
1615
  return review_file_df
1616
 
 
1695
 
1696
  def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12) -> pd.DataFrame:
1697
  """
1698
+ Optimized: Generates unique alphanumeric IDs for rows in a DataFrame column
1699
+ where the value is missing (NaN, None) or an empty/whitespace string.
1700
 
1701
  Args:
1702
  df (pd.DataFrame): The input Pandas DataFrame.
1703
  column_name (str): The name of the column to check and fill (defaults to 'id').
1704
  This column will be added if it doesn't exist.
1705
  length (int): The desired length of the generated IDs (defaults to 12).
 
 
1706
 
1707
  Returns:
1708
  pd.DataFrame: The DataFrame with missing/empty IDs filled in the specified column.
1709
+ Note: The function modifies the DataFrame directly (in-place).
1710
  """
1711
 
1712
  # --- Input Validation ---
 
1718
  raise ValueError("'length' must be a positive integer.")
1719
 
1720
  # --- Ensure Column Exists ---
1721
+ original_dtype = None
1722
  if column_name not in df.columns:
1723
  print(f"Column '{column_name}' not found. Adding it to the DataFrame.")
1724
+ # Initialize with None (which Pandas often treats as NaN but allows object dtype)
1725
+ df[column_name] = None
1726
+ # Set original_dtype to object so it likely becomes string later
1727
+ original_dtype = object
1728
+ else:
1729
+ original_dtype = df[column_name].dtype
1730
 
1731
  # --- Identify Rows Needing IDs ---
1732
+ # 1. Check for actual null values (NaN, None, NaT)
1733
+ is_null = df[column_name].isna()
1734
+
1735
+ # 2. Check for empty or whitespace-only strings AFTER converting potential values to string
1736
+ # Only apply string checks on rows that are *not* null to avoid errors/warnings
1737
+ # Fill NaN temporarily for string operations, then check length or equality
1738
+ is_empty_str = pd.Series(False, index=df.index) # Default to False
1739
+ if not is_null.all(): # Only check strings if there are non-null values
1740
+ temp_str_col = df.loc[~is_null, column_name].astype(str).str.strip()
1741
+ is_empty_str.loc[~is_null] = (temp_str_col == '')
1742
+
1743
+ # Combine the conditions
1744
+ is_missing_or_empty = is_null | is_empty_str
 
1745
 
1746
  rows_to_fill_index = df.index[is_missing_or_empty]
1747
  num_needed = len(rows_to_fill_index)
1748
 
1749
  if num_needed == 0:
1750
+ # Ensure final column type is consistent if nothing was done
1751
+ if pd.api.types.is_object_dtype(original_dtype) or pd.api.types.is_string_dtype(original_dtype):
1752
+ pass # Likely already object or string
1753
+ else:
1754
+ # If original was numeric/etc., but might contain strings now? Unlikely here.
1755
+ pass # Or convert to object: df[column_name] = df[column_name].astype(object)
1756
+ # print(f"No missing or empty values found requiring IDs in column '{column_name}'.")
1757
  return df
1758
 
1759
  print(f"Found {num_needed} rows requiring a unique ID in column '{column_name}'.")
1760
 
1761
  # --- Get Existing IDs to Ensure Uniqueness ---
1762
+ # Consider only rows that are *not* missing/empty
1763
+ valid_rows = df.loc[~is_missing_or_empty, column_name]
1764
+ # Drop any remaining nulls (shouldn't be any based on mask, but belts and braces)
1765
+ valid_rows = valid_rows.dropna()
1766
+ # Convert to string *only* if not already string/object, then filter out empty strings again
1767
+ if not pd.api.types.is_object_dtype(valid_rows.dtype) and not pd.api.types.is_string_dtype(valid_rows.dtype):
1768
+ existing_ids = set(valid_rows.astype(str).str.strip())
1769
+ else: # Already string or object, just strip and convert to set
1770
+ existing_ids = set(valid_rows.astype(str).str.strip()) # astype(str) handles mixed types in object column
1771
+
1772
+ # Remove empty string from existing IDs if it's there after stripping
1773
+ existing_ids.discard('')
1774
 
1775
 
1776
  # --- Generate Unique IDs ---
 
1780
 
1781
  max_possible_ids = len(character_set) ** length
1782
  if num_needed > max_possible_ids:
1783
+ raise ValueError(f"Cannot generate {num_needed} unique IDs with length {length}. Maximum possible is {max_possible_ids}.")
1784
+
1785
+ # Pre-calculate safety break limit
1786
+ max_attempts_per_id = max(1000, num_needed * 10) # Adjust multiplier as needed
1787
 
1788
  #print(f"Generating {num_needed} unique IDs of length {length}...")
1789
  for i in range(num_needed):
1790
  attempts = 0
1791
  while True:
1792
  candidate_id = ''.join(random.choices(character_set, k=length))
1793
+ # Check against *all* known existing IDs and *newly* generated ones
1794
  if candidate_id not in existing_ids and candidate_id not in generated_ids_set:
1795
  generated_ids_set.add(candidate_id)
1796
  new_ids_list.append(candidate_id)
1797
  break # Found a unique ID
1798
  attempts += 1
1799
+ if attempts > max_attempts_per_id : # Safety break
1800
+ raise RuntimeError(f"Failed to generate a unique ID after {attempts} attempts. Check length, character set, or density of existing IDs.")
1801
 
1802
+ # Optional progress update
1803
+ # if (i + 1) % 1000 == 0:
1804
+ # print(f"Generated {i+1}/{num_needed} IDs...")
1805
 
1806
 
1807
  # --- Assign New IDs ---
1808
  # Use the previously identified index to assign the new IDs correctly
1809
+ # Assigning string IDs might change the column's dtype to 'object'
1810
+ if not pd.api.types.is_object_dtype(original_dtype) and not pd.api.types.is_string_dtype(original_dtype):
1811
+ warnings.warn(f"Column '{column_name}' dtype might change from '{original_dtype}' to 'object' due to string ID assignment.", UserWarning)
1812
+
1813
  df.loc[rows_to_fill_index, column_name] = new_ids_list
1814
+ print(f"Successfully assigned {len(new_ids_list)} new unique IDs to column '{column_name}'.")
1815
+
1816
+ # Optional: Convert the entire column to string type at the end for consistency
1817
+ # df[column_name] = df[column_name].astype(str)
1818
 
 
1819
  return df
1820
 
1821
+ def convert_review_df_to_annotation_json(
1822
+ review_file_df: pd.DataFrame,
1823
+ image_paths: List[str], # List of image file paths
1824
+ page_sizes: List[Dict], # List of dicts like [{'page': 1, 'image_path': '...', 'image_width': W, 'image_height': H}, ...]
1825
+ xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax" # Coordinate column names
1826
+ ) -> List[Dict]:
1827
+ """
1828
+ Optimized function to convert review DataFrame to Gradio Annotation JSON format.
 
 
 
 
 
 
 
1829
 
1830
+ Ensures absolute coordinates, handles missing IDs, deduplicates based on key fields,
1831
+ selects final columns, and structures data per image/page based on page_sizes.
 
1832
 
1833
+ Args:
1834
+ review_file_df: Input DataFrame with annotation data.
1835
+ image_paths: List of image file paths (Note: currently unused if page_sizes provides paths).
1836
+ page_sizes: REQUIRED list of dictionaries, each containing 'page',
1837
+ 'image_path', 'image_width', and 'image_height'. Defines
1838
+ output structure and dimensions for coordinate conversion.
1839
+ xmin, xmax, ymin, ymax: Names of the coordinate columns.
1840
 
1841
+ Returns:
1842
+ List of dictionaries suitable for Gradio Annotation output, one dict per image/page.
1843
+ """
1844
+ review_file_df = review_file_df.dropna(subset=["xmin", "xmax", "ymin", "ymax", "text", "id", "label"])
1845
 
1846
+ if not page_sizes:
1847
+ raise ValueError("page_sizes argument is required and cannot be empty.")
1848
 
1849
+ # --- Prepare Page Sizes DataFrame ---
1850
+ try:
1851
+ page_sizes_df = pd.DataFrame(page_sizes)
1852
+ required_ps_cols = {'page', 'image_path', 'image_width', 'image_height'}
1853
+ if not required_ps_cols.issubset(page_sizes_df.columns):
1854
+ missing = required_ps_cols - set(page_sizes_df.columns)
1855
+ raise ValueError(f"page_sizes is missing required keys: {missing}")
1856
+ # Convert page sizes columns to appropriate numeric types early
1857
+ page_sizes_df['page'] = pd.to_numeric(page_sizes_df['page'], errors='coerce')
1858
+ page_sizes_df['image_width'] = pd.to_numeric(page_sizes_df['image_width'], errors='coerce')
1859
+ page_sizes_df['image_height'] = pd.to_numeric(page_sizes_df['image_height'], errors='coerce')
1860
+ # Use nullable Int64 for page number consistency
1861
+ page_sizes_df['page'] = page_sizes_df['page'].astype('Int64')
1862
 
1863
+ except Exception as e:
1864
+ raise ValueError(f"Error processing page_sizes: {e}") from e
 
1865
 
 
1866
 
1867
+ # Handle empty input DataFrame gracefully
1868
+ if review_file_df.empty:
1869
+ print("Input review_file_df is empty. Proceeding to generate JSON structure with empty boxes.")
1870
+ # Ensure essential columns exist even if empty for later steps
1871
+ for col in [xmin, xmax, ymin, ymax, "page", "label", "color", "id", "text"]:
1872
+ if col not in review_file_df.columns:
1873
+ review_file_df[col] = pd.NA
1874
+ else:
1875
+ # --- Coordinate Conversion (if needed) ---
1876
+ coord_cols_to_check = [c for c in [xmin, xmax, ymin, ymax] if c in review_file_df.columns]
1877
+ needs_multiplication = False
1878
+ if coord_cols_to_check:
1879
+ temp_df_numeric = review_file_df[coord_cols_to_check].apply(pd.to_numeric, errors='coerce')
1880
+ if temp_df_numeric.le(1).any().any(): # Check if any numeric coord <= 1 exists
1881
+ needs_multiplication = True
1882
+
1883
+ if needs_multiplication:
1884
+ #print("Relative coordinates detected or suspected, running multiplication...")
1885
+ review_file_df = multiply_coordinates_by_page_sizes(
1886
+ review_file_df.copy(), # Pass a copy to avoid modifying original outside function
1887
+ page_sizes_df,
1888
+ xmin, xmax, ymin, ymax
1889
+ )
1890
+ else:
1891
+ #print("No relative coordinates detected or required columns missing, skipping multiplication.")
1892
+ # Still ensure essential coordinate/page columns are numeric if they exist
1893
+ cols_to_convert = [c for c in [xmin, xmax, ymin, ymax, "page"] if c in review_file_df.columns]
1894
+ for col in cols_to_convert:
1895
+ review_file_df[col] = pd.to_numeric(review_file_df[col], errors='coerce')
1896
 
1897
+ # Handle potential case where multiplication returns an empty DF
1898
+ if review_file_df.empty:
1899
+ print("DataFrame became empty after coordinate processing.")
1900
+ # Re-add essential columns if they were lost
1901
+ for col in [xmin, xmax, ymin, ymax, "page", "label", "color", "id", "text"]:
1902
+ if col not in review_file_df.columns:
1903
+ review_file_df[col] = pd.NA
1904
+
1905
+ # --- Fill Missing IDs ---
1906
+ review_file_df = fill_missing_ids(review_file_df.copy()) # Pass a copy
1907
+
1908
+ # --- Deduplicate Based on Key Fields ---
1909
+ base_dedupe_cols = ["page", xmin, ymin, xmax, ymax, "label", "id"]
1910
+ # Identify which deduplication columns actually exist in the DataFrame
1911
+ cols_for_dedupe = [col for col in base_dedupe_cols if col in review_file_df.columns]
1912
+ # Add 'image' column for deduplication IF it exists (matches original logic intent)
1913
+ if "image" in review_file_df.columns:
1914
+ cols_for_dedupe.append("image")
1915
+
1916
+ # Ensure placeholder columns exist if they are needed for deduplication
1917
+ # (e.g., 'label', 'id' should be present after fill_missing_ids)
1918
+ for col in ['label', 'id']:
1919
+ if col in cols_for_dedupe and col not in review_file_df.columns:
1920
+ # This might indicate an issue in fill_missing_ids or prior steps
1921
+ print(f"Warning: Column '{col}' needed for dedupe but not found. Adding NA.")
1922
+ review_file_df[col] = "" # Add default empty string
1923
+
1924
+ if cols_for_dedupe: # Only attempt dedupe if we have columns to check
1925
+ #print(f"Deduplicating based on columns: {cols_for_dedupe}")
1926
+ # Convert relevant columns to string before dedupe to avoid type issues with mixed data (optional, depends on data)
1927
+ # for col in cols_for_dedupe:
1928
+ # review_file_df[col] = review_file_df[col].astype(str)
1929
+ review_file_df = review_file_df.drop_duplicates(subset=cols_for_dedupe)
1930
  else:
1931
+ print("Skipping deduplication: No valid columns found to deduplicate by.")
1932
+
1933
+
1934
+ # --- Select and Prepare Final Output Columns ---
1935
+ required_final_cols = ["page", "label", "color", xmin, ymin, xmax, ymax, "id", "text"]
1936
+ # Identify which of the desired final columns exist in the (now potentially deduplicated) DataFrame
1937
+ available_final_cols = [col for col in required_final_cols if col in review_file_df.columns]
1938
+
1939
+ # Ensure essential output columns exist, adding defaults if missing AFTER deduplication
1940
+ for col in required_final_cols:
1941
+ if col not in review_file_df.columns:
1942
+ print(f"Adding missing final column '{col}' with default value.")
1943
+ if col in ['label', 'id', 'text']:
1944
+ review_file_df[col] = "" # Default empty string
1945
+ elif col == 'color':
1946
+ review_file_df[col] = None # Default None or a default color tuple
1947
+ else: # page, coordinates
1948
+ review_file_df[col] = pd.NA # Default NA for numeric/page
1949
+ available_final_cols.append(col) # Add to list of available columns
1950
+
1951
+ # Select only the final desired columns in the correct order
1952
+ review_file_df = review_file_df[available_final_cols]
1953
+
1954
+ # --- Final Formatting ---
1955
+ if not review_file_df.empty:
1956
+ # Convert list colors to tuples (important for some downstream uses)
1957
+ if 'color' in review_file_df.columns:
1958
+ review_file_df['color'] = review_file_df['color'].apply(
1959
+ lambda x: tuple(x) if isinstance(x, list) else x
1960
+ )
1961
+ # Ensure page column is nullable integer type for reliable grouping
1962
+ if 'page' in review_file_df.columns:
1963
+ review_file_df['page'] = review_file_df['page'].astype('Int64')
1964
+
1965
+ # --- Group Annotations by Page ---
1966
+ if 'page' in review_file_df.columns:
1967
+ grouped_annotations = review_file_df.groupby('page')
1968
+ group_keys = set(grouped_annotations.groups.keys()) # Use set for faster lookups
1969
+ else:
1970
+ # Cannot group if page column is missing
1971
+ print("Error: 'page' column missing, cannot group annotations.")
1972
+ grouped_annotations = None
1973
+ group_keys = set()
1974
+
1975
 
1976
+ # --- Build JSON Structure ---
1977
+ json_data = []
1978
+ output_cols_for_boxes = [col for col in ["label", "color", xmin, ymin, xmax, ymax, "id", "text"] if col in review_file_df.columns]
1979
+
1980
+ # Iterate through page_sizes_df to define the structure (one entry per image path)
1981
+ for _, row in page_sizes_df.iterrows():
1982
+ page_num = row['page'] # Already Int64
1983
+ pdf_image_path = row['image_path']
1984
+ annotation_boxes = [] # Default to empty list
1985
 
1986
+ # Check if the page exists in the grouped annotations (using the faster set lookup)
1987
+ # Check pd.notna because page_num could be <NA> if conversion failed
1988
+ if pd.notna(page_num) and page_num in group_keys and grouped_annotations:
1989
+ try:
1990
+ page_group_df = grouped_annotations.get_group(page_num)
1991
+ # Convert the group to list of dicts, selecting only needed box properties
1992
+ # Handle potential NaN coordinates before conversion to JSON
1993
+ annotation_boxes = page_group_df[output_cols_for_boxes].replace({np.nan: None}).to_dict(orient='records')
1994
+
1995
+ # Optional: Round coordinates here if needed AFTER potential multiplication
1996
+ # for box in annotation_boxes:
1997
+ # for coord in [xmin, ymin, xmax, ymax]:
1998
+ # if coord in box and box[coord] is not None:
1999
+ # box[coord] = round(float(box[coord]), 2) # Example: round to 2 decimals
2000
+
2001
+ except KeyError:
2002
+ print(f"Warning: Group key {page_num} not found despite being in group_keys (should not happen).")
2003
+ annotation_boxes = [] # Keep empty
2004
+
2005
+ # Append the structured data for this image/page
2006
+ json_data.append({
2007
+ "image": pdf_image_path,
2008
+ "boxes": annotation_boxes
2009
+ })
2010
+
2011
+ return json_data
tools/file_redaction.py CHANGED
@@ -20,8 +20,8 @@ from gradio import Progress
20
  from collections import defaultdict # For efficient grouping
21
 
22
  from tools.config import OUTPUT_FOLDER, IMAGES_DPI, MAX_IMAGE_PIXELS, RUN_AWS_FUNCTIONS, AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION, PAGE_BREAK_VALUE, MAX_TIME_VALUE, LOAD_TRUNCATED_IMAGES, INPUT_FOLDER
23
- from tools.custom_image_analyser_engine import CustomImageAnalyzerEngine, OCRResult, combine_ocr_results, CustomImageRecognizerResult, run_page_text_redaction, merge_text_bounding_boxes
24
- from tools.file_conversion import convert_annotation_json_to_review_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords, is_pdf, is_pdf_or_image, prepare_image_or_pdf, divide_coordinates_by_page_sizes, multiply_coordinates_by_page_sizes, convert_annotation_data_to_dataframe, divide_coordinates_by_page_sizes, create_annotation_dicts_from_annotation_df, remove_duplicate_images_with_blank_boxes, fill_missing_ids, fill_missing_box_ids
25
  from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
26
  from tools.helper_functions import get_file_name_without_type, clean_unicode_text, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option
27
  from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult, load_and_convert_textract_json
@@ -101,6 +101,8 @@ def choose_and_run_redactor(file_paths:List[str],
101
  input_folder:str=INPUT_FOLDER,
102
  total_textract_query_number:int=0,
103
  ocr_file_path:str="",
 
 
104
  prepare_images:bool=True,
105
  progress=gr.Progress(track_tqdm=True)):
106
  '''
@@ -149,7 +151,9 @@ def choose_and_run_redactor(file_paths:List[str],
149
  - review_file_path (str, optional): The latest review file path created by the app
150
  - input_folder (str, optional): The custom input path, if provided
151
  - total_textract_query_number (int, optional): The number of textract queries up until this point.
152
- - ocr_file_path (str, optional): The latest ocr file path created by the app
 
 
153
  - prepare_images (bool, optional): Boolean to determine whether to load images for the PDF.
154
  - progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
155
 
@@ -179,9 +183,16 @@ def choose_and_run_redactor(file_paths:List[str],
179
  out_file_paths = []
180
  estimate_total_processing_time = 0
181
  estimated_time_taken_state = 0
 
 
 
 
 
182
  # If not the first time around, and the current page loop has been set to a huge number (been through all pages), reset current page to 0
183
  elif (first_loop_state == False) & (current_loop_page == 999):
184
  current_loop_page = 0
 
 
185
 
186
  # Choose the correct file to prepare
187
  if isinstance(file_paths, str): file_paths_list = [os.path.abspath(file_paths)]
@@ -219,6 +230,8 @@ def choose_and_run_redactor(file_paths:List[str],
219
  elif out_message:
220
  combined_out_message = combined_out_message + '\n' + out_message
221
 
 
 
222
  # Only send across review file if redaction has been done
223
  if pii_identification_method != no_redaction_option:
224
 
@@ -226,10 +239,15 @@ def choose_and_run_redactor(file_paths:List[str],
226
  #review_file_path = [x for x in out_file_paths if "review_file" in x]
227
  if review_file_path: review_out_file_paths.append(review_file_path)
228
 
 
 
 
 
 
229
  estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
230
  print("Estimated total processing time:", str(estimate_total_processing_time))
231
 
232
- return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path
233
 
234
  #if first_loop_state == False:
235
  # Prepare documents and images as required if they don't already exist
@@ -258,9 +276,8 @@ def choose_and_run_redactor(file_paths:List[str],
258
 
259
 
260
  # Call prepare_image_or_pdf only if needed
261
- if prepare_images_flag is not None:# and first_loop_state==True:
262
- #print("Calling preparation function. prepare_images_flag:", prepare_images_flag)
263
- out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df = prepare_image_or_pdf(
264
  file_paths_loop, text_extraction_method, 0, out_message, True,
265
  annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
266
  output_folder, prepare_images=prepare_images_flag, page_sizes=page_sizes, input_folder=input_folder
@@ -275,11 +292,15 @@ def choose_and_run_redactor(file_paths:List[str],
275
  page_sizes = page_sizes_df.to_dict(orient="records")
276
 
277
  number_of_pages = pymupdf_doc.page_count
 
278
 
279
  # If we have reached the last page, return message and outputs
280
  if current_loop_page >= number_of_pages:
281
  print("Reached last page of document:", current_loop_page)
282
 
 
 
 
283
  # Set to a very high number so as not to mix up with subsequent file processing by the user
284
  current_loop_page = 999
285
  if out_message:
@@ -292,7 +313,7 @@ def choose_and_run_redactor(file_paths:List[str],
292
  #review_file_path = [x for x in out_file_paths if "review_file" in x]
293
  if review_file_path: review_out_file_paths.append(review_file_path)
294
 
295
- return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path
296
 
297
  # Load/create allow list
298
  # If string, assume file path
@@ -333,7 +354,7 @@ def choose_and_run_redactor(file_paths:List[str],
333
  # Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
334
  if pii_identification_method == aws_pii_detector:
335
  if aws_access_key_textbox and aws_secret_key_textbox:
336
- print("Connecting to Comprehend using AWS access key and secret keys from textboxes.")
337
  comprehend_client = boto3.client('comprehend',
338
  aws_access_key_id=aws_access_key_textbox,
339
  aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
@@ -356,7 +377,7 @@ def choose_and_run_redactor(file_paths:List[str],
356
  # Try to connect to AWS Textract Client if using that text extraction method
357
  if text_extraction_method == textract_option:
358
  if aws_access_key_textbox and aws_secret_key_textbox:
359
- print("Connecting to Textract using AWS access key and secret keys from textboxes.")
360
  textract_client = boto3.client('textract',
361
  aws_access_key_id=aws_access_key_textbox,
362
  aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
@@ -401,7 +422,7 @@ def choose_and_run_redactor(file_paths:List[str],
401
  is_a_pdf = is_pdf(file_path) == True
402
  if is_a_pdf == False and text_extraction_method == text_ocr_option:
403
  # If user has not submitted a pdf, assume it's an image
404
- print("File is not a pdf, assuming that image analysis needs to be used.")
405
  text_extraction_method = tesseract_ocr_option
406
  else:
407
  out_message = "No file selected"
@@ -422,7 +443,7 @@ def choose_and_run_redactor(file_paths:List[str],
422
 
423
  print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
424
 
425
- pymupdf_doc, all_pages_decision_process_table, out_file_paths, new_textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number = redact_image_pdf(file_path,
426
  pdf_image_file_paths,
427
  language,
428
  chosen_redact_entities,
@@ -448,7 +469,9 @@ def choose_and_run_redactor(file_paths:List[str],
448
  max_fuzzy_spelling_mistakes_num,
449
  match_fuzzy_whole_phrase_bool,
450
  page_sizes_df,
451
- text_extraction_only,
 
 
452
  log_files_output_paths=log_files_output_paths,
453
  output_folder=output_folder)
454
 
@@ -599,7 +622,10 @@ def choose_and_run_redactor(file_paths:List[str],
599
  if not review_file_path: review_out_file_paths = [prepared_pdf_file_paths[-1]]
600
  else: review_out_file_paths = [prepared_pdf_file_paths[-1], review_file_path]
601
 
602
- return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path
 
 
 
603
 
604
  def convert_pikepdf_coords_to_pymupdf(pymupdf_page:Page, pikepdf_bbox, type="pikepdf_annot"):
605
  '''
@@ -862,17 +888,6 @@ def convert_pikepdf_annotations_to_result_annotation_box(page:Page, annot:dict,
862
 
863
  rect = Rect(pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2)
864
 
865
- # if image or image_dimensions:
866
- # print("Dividing result by image coordinates")
867
-
868
- # image_x1, image_y1, image_x2, image_y2 = convert_pymupdf_to_image_coords(page, pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2, image, image_dimensions=image_dimensions)
869
-
870
- # img_annotation_box["xmin"] = image_x1
871
- # img_annotation_box["ymin"] = image_y1
872
- # img_annotation_box["xmax"] = image_x2
873
- # img_annotation_box["ymax"] = image_y2
874
-
875
- # else:
876
  convert_df = pd.DataFrame({
877
  "page": [page_no],
878
  "xmin": [pymupdf_x1],
@@ -1016,9 +1031,6 @@ def redact_page_with_pymupdf(page:Page, page_annotations:dict, image:Image=None,
1016
 
1017
  img_annotation_box = fill_missing_box_ids(img_annotation_box)
1018
 
1019
- #print("image_dimensions:", image_dimensions)
1020
- #print("annot:", annot)
1021
-
1022
  all_image_annotation_boxes.append(img_annotation_box)
1023
 
1024
  # Redact the annotations from the document
@@ -1178,7 +1190,9 @@ def redact_image_pdf(file_path:str,
1178
  max_fuzzy_spelling_mistakes_num:int=1,
1179
  match_fuzzy_whole_phrase_bool:bool=True,
1180
  page_sizes_df:pd.DataFrame=pd.DataFrame(),
1181
- text_extraction_only:bool=False,
 
 
1182
  page_break_val:int=int(PAGE_BREAK_VALUE),
1183
  log_files_output_paths:List=[],
1184
  max_time:int=int(MAX_TIME_VALUE),
@@ -1250,7 +1264,6 @@ def redact_image_pdf(file_path:str,
1250
  print(out_message_warning)
1251
  #raise Exception(out_message)
1252
 
1253
-
1254
  number_of_pages = pymupdf_doc.page_count
1255
  print("Number of pages:", str(number_of_pages))
1256
 
@@ -1268,14 +1281,24 @@ def redact_image_pdf(file_path:str,
1268
  textract_data, is_missing, log_files_output_paths = load_and_convert_textract_json(textract_json_file_path, log_files_output_paths, page_sizes_df)
1269
  original_textract_data = textract_data.copy()
1270
 
 
 
 
 
 
 
 
 
 
 
1271
  ###
1272
  if current_loop_page == 0: page_loop_start = 0
1273
  else: page_loop_start = current_loop_page
1274
 
1275
  progress_bar = tqdm(range(page_loop_start, number_of_pages), unit="pages remaining", desc="Redacting pages")
1276
 
1277
- all_pages_decision_process_table_list = [all_pages_decision_process_table]
1278
  all_line_level_ocr_results_df_list = [all_line_level_ocr_results_df]
 
1279
 
1280
  # Go through each page
1281
  for page_no in progress_bar:
@@ -1283,10 +1306,9 @@ def redact_image_pdf(file_path:str,
1283
  handwriting_or_signature_boxes = []
1284
  page_signature_recogniser_results = []
1285
  page_handwriting_recogniser_results = []
 
1286
  page_break_return = False
1287
  reported_page_number = str(page_no + 1)
1288
-
1289
- #print("page_sizes_df for row:", page_sizes_df.loc[page_sizes_df["page"] == (page_no + 1)])
1290
 
1291
  # Try to find image location
1292
  try:
@@ -1328,14 +1350,50 @@ def redact_image_pdf(file_path:str,
1328
 
1329
  # Step 1: Perform OCR. Either with Tesseract, or with AWS Textract
1330
 
1331
- # If using Tesseract, need to check if we have page as image_path
1332
  if text_extraction_method == tesseract_ocr_option:
1333
  #print("image_path:", image_path)
1334
  #print("print(type(image_path)):", print(type(image_path)))
1335
  #if not isinstance(image_path, image_path.image_path) or not isinstance(image_path, str): raise Exception("image_path object for page", reported_page_number, "not found, cannot perform local OCR analysis.")
1336
 
1337
- page_word_level_ocr_results = image_analyser.perform_ocr(image_path)
1338
- page_line_level_ocr_results, page_line_level_ocr_results_with_children = combine_ocr_results(page_word_level_ocr_results)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1339
 
1340
  # Check if page exists in existing textract data. If not, send to service to analyse
1341
  if text_extraction_method == textract_option:
@@ -1399,16 +1457,28 @@ def redact_image_pdf(file_path:str,
1399
  # If the page exists, retrieve the data
1400
  text_blocks = next(page['data'] for page in textract_data["pages"] if page['page_no'] == reported_page_number)
1401
 
1402
-
1403
- page_line_level_ocr_results, handwriting_or_signature_boxes, page_signature_recogniser_results, page_handwriting_recogniser_results, page_line_level_ocr_results_with_children = json_to_ocrresult(text_blocks, page_width, page_height, reported_page_number)
 
 
 
 
 
 
 
 
 
 
 
 
1404
 
1405
  if pii_identification_method != no_redaction_option:
1406
  # Step 2: Analyse text and identify PII
1407
  if chosen_redact_entities or chosen_redact_comprehend_entities:
1408
 
1409
  page_redaction_bounding_boxes, comprehend_query_number_new = image_analyser.analyze_text(
1410
- page_line_level_ocr_results,
1411
- page_line_level_ocr_results_with_children,
1412
  chosen_redact_comprehend_entities = chosen_redact_comprehend_entities,
1413
  pii_identification_method = pii_identification_method,
1414
  comprehend_client=comprehend_client,
@@ -1423,7 +1493,7 @@ def redact_image_pdf(file_path:str,
1423
  else: page_redaction_bounding_boxes = []
1424
 
1425
  # Merge redaction bounding boxes that are close together
1426
- page_merged_redaction_bboxes = merge_img_bboxes(page_redaction_bounding_boxes, page_line_level_ocr_results_with_children, page_signature_recogniser_results, page_handwriting_recogniser_results, handwrite_signature_checkbox)
1427
 
1428
  else: page_merged_redaction_bboxes = []
1429
 
@@ -1449,7 +1519,6 @@ def redact_image_pdf(file_path:str,
1449
  # Assume image_path is an image
1450
  image = image_path
1451
 
1452
- print("image:", image)
1453
 
1454
  fill = (0, 0, 0) # Fill colour for redactions
1455
  draw = ImageDraw.Draw(image)
@@ -1510,19 +1579,6 @@ def redact_image_pdf(file_path:str,
1510
  decision_process_table = fill_missing_ids(decision_process_table)
1511
  #decision_process_table.to_csv("output/decision_process_table_with_ids.csv")
1512
 
1513
-
1514
- # Convert to DataFrame and add to ongoing logging table
1515
- line_level_ocr_results_df = pd.DataFrame([{
1516
- 'page': reported_page_number,
1517
- 'text': result.text,
1518
- 'left': result.left,
1519
- 'top': result.top,
1520
- 'width': result.width,
1521
- 'height': result.height
1522
- } for result in page_line_level_ocr_results])
1523
-
1524
- all_line_level_ocr_results_df_list.append(line_level_ocr_results_df)
1525
-
1526
  toc = time.perf_counter()
1527
 
1528
  time_taken = toc - tic
@@ -1547,6 +1603,8 @@ def redact_image_pdf(file_path:str,
1547
  # Append new annotation if it doesn't exist
1548
  annotations_all_pages.append(page_image_annotations)
1549
 
 
 
1550
  if text_extraction_method == textract_option:
1551
  if original_textract_data != textract_data:
1552
  # Write the updated existing textract data back to the JSON file
@@ -1556,12 +1614,21 @@ def redact_image_pdf(file_path:str,
1556
  if textract_json_file_path not in log_files_output_paths:
1557
  log_files_output_paths.append(textract_json_file_path)
1558
 
 
 
 
 
 
 
 
 
 
1559
  all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
1560
  all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
1561
 
1562
  current_loop_page += 1
1563
 
1564
- return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number
1565
 
1566
  # If it's an image file
1567
  if is_pdf(file_path) == False:
@@ -1594,10 +1661,20 @@ def redact_image_pdf(file_path:str,
1594
  if textract_json_file_path not in log_files_output_paths:
1595
  log_files_output_paths.append(textract_json_file_path)
1596
 
 
 
 
 
 
 
 
 
 
 
1597
  all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
1598
  all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
1599
 
1600
- return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number
1601
 
1602
  if text_extraction_method == textract_option:
1603
  # Write the updated existing textract data back to the JSON file
@@ -1609,15 +1686,24 @@ def redact_image_pdf(file_path:str,
1609
  if textract_json_file_path not in log_files_output_paths:
1610
  log_files_output_paths.append(textract_json_file_path)
1611
 
 
 
 
 
 
 
 
 
 
1612
  all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
1613
  all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
1614
 
1615
- # Convert decision table to relative coordinates
1616
  all_pages_decision_process_table = divide_coordinates_by_page_sizes(all_pages_decision_process_table, page_sizes_df, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax")
1617
 
1618
  all_line_level_ocr_results_df = divide_coordinates_by_page_sizes(all_line_level_ocr_results_df, page_sizes_df, xmin="left", xmax="width", ymin="top", ymax="height")
1619
 
1620
- return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number
1621
 
1622
 
1623
  ###
@@ -1631,8 +1717,6 @@ def get_text_container_characters(text_container:LTTextContainer):
1631
  for line in text_container
1632
  if isinstance(line, LTTextLine) or isinstance(line, LTTextLineHorizontal)
1633
  for char in line]
1634
-
1635
- #print("Initial characters:", characters)
1636
 
1637
  return characters
1638
  return []
@@ -1762,9 +1846,6 @@ def create_text_redaction_process_results(analyser_results, analysed_bounding_bo
1762
  analysed_bounding_boxes_df_new = pd.concat([analysed_bounding_boxes_df_new, analysed_bounding_boxes_df_text], axis = 1)
1763
  analysed_bounding_boxes_df_new['page'] = page_num + 1
1764
 
1765
- #analysed_bounding_boxes_df_new = fill_missing_ids(analysed_bounding_boxes_df_new)
1766
- analysed_bounding_boxes_df_new.to_csv("output/analysed_bounding_boxes_df_new_with_ids.csv")
1767
-
1768
  decision_process_table = pd.concat([decision_process_table, analysed_bounding_boxes_df_new], axis = 0).drop('result', axis=1)
1769
 
1770
  return decision_process_table
@@ -1772,7 +1853,6 @@ def create_text_redaction_process_results(analyser_results, analysed_bounding_bo
1772
  def create_pikepdf_annotations_for_bounding_boxes(analysed_bounding_boxes):
1773
  pikepdf_redaction_annotations_on_page = []
1774
  for analysed_bounding_box in analysed_bounding_boxes:
1775
- #print("analysed_bounding_box:", analysed_bounding_boxes)
1776
 
1777
  bounding_box = analysed_bounding_box["boundingBox"]
1778
  annotation = Dictionary(
@@ -1997,7 +2077,6 @@ def redact_text_pdf(
1997
  pass
1998
  #print("Not redacting page:", page_no)
1999
 
2000
- #print("page_image_annotations after page", reported_page_number, "are", page_image_annotations)
2001
 
2002
  # Join extracted text outputs for all lines together
2003
  if not page_text_ocr_outputs.empty:
 
20
  from collections import defaultdict # For efficient grouping
21
 
22
  from tools.config import OUTPUT_FOLDER, IMAGES_DPI, MAX_IMAGE_PIXELS, RUN_AWS_FUNCTIONS, AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION, PAGE_BREAK_VALUE, MAX_TIME_VALUE, LOAD_TRUNCATED_IMAGES, INPUT_FOLDER
23
+ from tools.custom_image_analyser_engine import CustomImageAnalyzerEngine, OCRResult, combine_ocr_results, CustomImageRecognizerResult, run_page_text_redaction, merge_text_bounding_boxes, recreate_page_line_level_ocr_results_with_page
24
+ from tools.file_conversion import convert_annotation_json_to_review_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords, is_pdf, is_pdf_or_image, prepare_image_or_pdf, divide_coordinates_by_page_sizes, multiply_coordinates_by_page_sizes, convert_annotation_data_to_dataframe, divide_coordinates_by_page_sizes, create_annotation_dicts_from_annotation_df, remove_duplicate_images_with_blank_boxes, fill_missing_ids, fill_missing_box_ids, load_and_convert_ocr_results_with_words_json
25
  from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
26
  from tools.helper_functions import get_file_name_without_type, clean_unicode_text, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option
27
  from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult, load_and_convert_textract_json
 
101
  input_folder:str=INPUT_FOLDER,
102
  total_textract_query_number:int=0,
103
  ocr_file_path:str="",
104
+ all_page_line_level_ocr_results = [],
105
+ all_page_line_level_ocr_results_with_words = [],
106
  prepare_images:bool=True,
107
  progress=gr.Progress(track_tqdm=True)):
108
  '''
 
151
  - review_file_path (str, optional): The latest review file path created by the app
152
  - input_folder (str, optional): The custom input path, if provided
153
  - total_textract_query_number (int, optional): The number of textract queries up until this point.
154
+ - ocr_file_path (str, optional): The latest ocr file path created by the app.
155
+ - all_page_line_level_ocr_results (list, optional): All line level text on the page with bounding boxes.
156
+ - all_page_line_level_ocr_results_with_words (list, optional): All word level text on the page with bounding boxes.
157
  - prepare_images (bool, optional): Boolean to determine whether to load images for the PDF.
158
  - progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
159
 
 
183
  out_file_paths = []
184
  estimate_total_processing_time = 0
185
  estimated_time_taken_state = 0
186
+ comprehend_query_number = 0
187
+ total_textract_query_number = 0
188
+ elif current_loop_page == 0:
189
+ comprehend_query_number = 0
190
+ total_textract_query_number = 0
191
  # If not the first time around, and the current page loop has been set to a huge number (been through all pages), reset current page to 0
192
  elif (first_loop_state == False) & (current_loop_page == 999):
193
  current_loop_page = 0
194
+ total_textract_query_number = 0
195
+ comprehend_query_number = 0
196
 
197
  # Choose the correct file to prepare
198
  if isinstance(file_paths, str): file_paths_list = [os.path.abspath(file_paths)]
 
230
  elif out_message:
231
  combined_out_message = combined_out_message + '\n' + out_message
232
 
233
+ combined_out_message = re.sub(r'^\n+', '', combined_out_message).strip()
234
+
235
  # Only send across review file if redaction has been done
236
  if pii_identification_method != no_redaction_option:
237
 
 
239
  #review_file_path = [x for x in out_file_paths if "review_file" in x]
240
  if review_file_path: review_out_file_paths.append(review_file_path)
241
 
242
+ if not isinstance(pymupdf_doc, list):
243
+ number_of_pages = pymupdf_doc.page_count
244
+ if total_textract_query_number > number_of_pages:
245
+ total_textract_query_number = number_of_pages
246
+
247
  estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
248
  print("Estimated total processing time:", str(estimate_total_processing_time))
249
 
250
+ return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
251
 
252
  #if first_loop_state == False:
253
  # Prepare documents and images as required if they don't already exist
 
276
 
277
 
278
  # Call prepare_image_or_pdf only if needed
279
+ if prepare_images_flag is not None:
280
+ out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df, local_ocr_output_found_checkbox = prepare_image_or_pdf(
 
281
  file_paths_loop, text_extraction_method, 0, out_message, True,
282
  annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
283
  output_folder, prepare_images=prepare_images_flag, page_sizes=page_sizes, input_folder=input_folder
 
292
  page_sizes = page_sizes_df.to_dict(orient="records")
293
 
294
  number_of_pages = pymupdf_doc.page_count
295
+
296
 
297
  # If we have reached the last page, return message and outputs
298
  if current_loop_page >= number_of_pages:
299
  print("Reached last page of document:", current_loop_page)
300
 
301
+ if total_textract_query_number > number_of_pages:
302
+ total_textract_query_number = number_of_pages
303
+
304
  # Set to a very high number so as not to mix up with subsequent file processing by the user
305
  current_loop_page = 999
306
  if out_message:
 
313
  #review_file_path = [x for x in out_file_paths if "review_file" in x]
314
  if review_file_path: review_out_file_paths.append(review_file_path)
315
 
316
+ return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
317
 
318
  # Load/create allow list
319
  # If string, assume file path
 
354
  # Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
355
  if pii_identification_method == aws_pii_detector:
356
  if aws_access_key_textbox and aws_secret_key_textbox:
357
+ print("Connecting to Comprehend using AWS access key and secret keys from user input.")
358
  comprehend_client = boto3.client('comprehend',
359
  aws_access_key_id=aws_access_key_textbox,
360
  aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
 
377
  # Try to connect to AWS Textract Client if using that text extraction method
378
  if text_extraction_method == textract_option:
379
  if aws_access_key_textbox and aws_secret_key_textbox:
380
+ print("Connecting to Textract using AWS access key and secret keys from user input.")
381
  textract_client = boto3.client('textract',
382
  aws_access_key_id=aws_access_key_textbox,
383
  aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
 
422
  is_a_pdf = is_pdf(file_path) == True
423
  if is_a_pdf == False and text_extraction_method == text_ocr_option:
424
  # If user has not submitted a pdf, assume it's an image
425
+ print("File is not a PDF, assuming that image analysis needs to be used.")
426
  text_extraction_method = tesseract_ocr_option
427
  else:
428
  out_message = "No file selected"
 
443
 
444
  print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
445
 
446
+ pymupdf_doc, all_pages_decision_process_table, out_file_paths, new_textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words = redact_image_pdf(file_path,
447
  pdf_image_file_paths,
448
  language,
449
  chosen_redact_entities,
 
469
  max_fuzzy_spelling_mistakes_num,
470
  match_fuzzy_whole_phrase_bool,
471
  page_sizes_df,
472
+ text_extraction_only,
473
+ all_page_line_level_ocr_results,
474
+ all_page_line_level_ocr_results_with_words,
475
  log_files_output_paths=log_files_output_paths,
476
  output_folder=output_folder)
477
 
 
622
  if not review_file_path: review_out_file_paths = [prepared_pdf_file_paths[-1]]
623
  else: review_out_file_paths = [prepared_pdf_file_paths[-1], review_file_path]
624
 
625
+ if total_textract_query_number > number_of_pages:
626
+ total_textract_query_number = number_of_pages
627
+
628
+ return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
629
 
630
  def convert_pikepdf_coords_to_pymupdf(pymupdf_page:Page, pikepdf_bbox, type="pikepdf_annot"):
631
  '''
 
888
 
889
  rect = Rect(pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2)
890
 
 
 
 
 
 
 
 
 
 
 
 
891
  convert_df = pd.DataFrame({
892
  "page": [page_no],
893
  "xmin": [pymupdf_x1],
 
1031
 
1032
  img_annotation_box = fill_missing_box_ids(img_annotation_box)
1033
 
 
 
 
1034
  all_image_annotation_boxes.append(img_annotation_box)
1035
 
1036
  # Redact the annotations from the document
 
1190
  max_fuzzy_spelling_mistakes_num:int=1,
1191
  match_fuzzy_whole_phrase_bool:bool=True,
1192
  page_sizes_df:pd.DataFrame=pd.DataFrame(),
1193
+ text_extraction_only:bool=False,
1194
+ all_page_line_level_ocr_results = [],
1195
+ all_page_line_level_ocr_results_with_words = [],
1196
  page_break_val:int=int(PAGE_BREAK_VALUE),
1197
  log_files_output_paths:List=[],
1198
  max_time:int=int(MAX_TIME_VALUE),
 
1264
  print(out_message_warning)
1265
  #raise Exception(out_message)
1266
 
 
1267
  number_of_pages = pymupdf_doc.page_count
1268
  print("Number of pages:", str(number_of_pages))
1269
 
 
1281
  textract_data, is_missing, log_files_output_paths = load_and_convert_textract_json(textract_json_file_path, log_files_output_paths, page_sizes_df)
1282
  original_textract_data = textract_data.copy()
1283
 
1284
+ print("Successfully loaded in Textract analysis results from file")
1285
+
1286
+ # If running local OCR option, check if file already exists. If it does, load in existing data
1287
+ if text_extraction_method == tesseract_ocr_option:
1288
+ all_page_line_level_ocr_results_with_words_json_file_path = output_folder + file_name + "_ocr_results_with_words.json"
1289
+ all_page_line_level_ocr_results_with_words, is_missing, log_files_output_paths = load_and_convert_ocr_results_with_words_json(all_page_line_level_ocr_results_with_words_json_file_path, log_files_output_paths, page_sizes_df)
1290
+ original_all_page_line_level_ocr_results_with_words = all_page_line_level_ocr_results_with_words.copy()
1291
+
1292
+ print("Loaded in local OCR analysis results from file")
1293
+
1294
  ###
1295
  if current_loop_page == 0: page_loop_start = 0
1296
  else: page_loop_start = current_loop_page
1297
 
1298
  progress_bar = tqdm(range(page_loop_start, number_of_pages), unit="pages remaining", desc="Redacting pages")
1299
 
 
1300
  all_line_level_ocr_results_df_list = [all_line_level_ocr_results_df]
1301
+ all_pages_decision_process_table_list = [all_pages_decision_process_table]
1302
 
1303
  # Go through each page
1304
  for page_no in progress_bar:
 
1306
  handwriting_or_signature_boxes = []
1307
  page_signature_recogniser_results = []
1308
  page_handwriting_recogniser_results = []
1309
+ page_line_level_ocr_results_with_words = []
1310
  page_break_return = False
1311
  reported_page_number = str(page_no + 1)
 
 
1312
 
1313
  # Try to find image location
1314
  try:
 
1350
 
1351
  # Step 1: Perform OCR. Either with Tesseract, or with AWS Textract
1352
 
1353
+ # If using Tesseract
1354
  if text_extraction_method == tesseract_ocr_option:
1355
  #print("image_path:", image_path)
1356
  #print("print(type(image_path)):", print(type(image_path)))
1357
  #if not isinstance(image_path, image_path.image_path) or not isinstance(image_path, str): raise Exception("image_path object for page", reported_page_number, "not found, cannot perform local OCR analysis.")
1358
 
1359
+ # Check for existing page_line_level_ocr_results_with_words object:
1360
+
1361
+ # page_line_level_ocr_results = (
1362
+ # all_page_line_level_ocr_results.get('results', [])
1363
+ # if all_page_line_level_ocr_results.get('page') == reported_page_number
1364
+ # else []
1365
+ # )
1366
+
1367
+ if all_page_line_level_ocr_results_with_words:
1368
+ # Find the first dict where 'page' matches
1369
+
1370
+ #print("all_page_line_level_ocr_results_with_words:", all_page_line_level_ocr_results_with_words)
1371
+
1372
+ print("All pages available:", [item.get('page') for item in all_page_line_level_ocr_results_with_words])
1373
+ #print("Looking for page:", reported_page_number)
1374
+
1375
+ matching_page = next(
1376
+ (item for item in all_page_line_level_ocr_results_with_words if int(item.get('page', -1)) == int(reported_page_number)),
1377
+ None
1378
+ )
1379
+
1380
+ #print("matching_page:", matching_page)
1381
+
1382
+ page_line_level_ocr_results_with_words = matching_page if matching_page else []
1383
+ else: page_line_level_ocr_results_with_words = []
1384
+
1385
+ if page_line_level_ocr_results_with_words:
1386
+ print("Found OCR results for page in existing OCR with words object")
1387
+ page_line_level_ocr_results = recreate_page_line_level_ocr_results_with_page(page_line_level_ocr_results_with_words)
1388
+ else:
1389
+ page_word_level_ocr_results = image_analyser.perform_ocr(image_path)
1390
+
1391
+ print("page_word_level_ocr_results:", page_word_level_ocr_results)
1392
+ page_line_level_ocr_results, page_line_level_ocr_results_with_words = combine_ocr_results(page_word_level_ocr_results, page=reported_page_number)
1393
+
1394
+ all_page_line_level_ocr_results_with_words.append(page_line_level_ocr_results_with_words)
1395
+
1396
+ print("All pages available:", [item.get('page') for item in all_page_line_level_ocr_results_with_words])
1397
 
1398
  # Check if page exists in existing textract data. If not, send to service to analyse
1399
  if text_extraction_method == textract_option:
 
1457
  # If the page exists, retrieve the data
1458
  text_blocks = next(page['data'] for page in textract_data["pages"] if page['page_no'] == reported_page_number)
1459
 
1460
+ page_line_level_ocr_results, handwriting_or_signature_boxes, page_signature_recogniser_results, page_handwriting_recogniser_results, page_line_level_ocr_results_with_words = json_to_ocrresult(text_blocks, page_width, page_height, reported_page_number)
1461
+
1462
+ # Convert to DataFrame and add to ongoing logging table
1463
+ line_level_ocr_results_df = pd.DataFrame([{
1464
+ 'page': page_line_level_ocr_results['page'],
1465
+ 'text': result.text,
1466
+ 'left': result.left,
1467
+ 'top': result.top,
1468
+ 'width': result.width,
1469
+ 'height': result.height
1470
+ } for result in page_line_level_ocr_results['results']])
1471
+
1472
+ all_line_level_ocr_results_df_list.append(line_level_ocr_results_df)
1473
+
1474
 
1475
  if pii_identification_method != no_redaction_option:
1476
  # Step 2: Analyse text and identify PII
1477
  if chosen_redact_entities or chosen_redact_comprehend_entities:
1478
 
1479
  page_redaction_bounding_boxes, comprehend_query_number_new = image_analyser.analyze_text(
1480
+ page_line_level_ocr_results['results'],
1481
+ page_line_level_ocr_results_with_words['results'],
1482
  chosen_redact_comprehend_entities = chosen_redact_comprehend_entities,
1483
  pii_identification_method = pii_identification_method,
1484
  comprehend_client=comprehend_client,
 
1493
  else: page_redaction_bounding_boxes = []
1494
 
1495
  # Merge redaction bounding boxes that are close together
1496
+ page_merged_redaction_bboxes = merge_img_bboxes(page_redaction_bounding_boxes, page_line_level_ocr_results_with_words['results'], page_signature_recogniser_results, page_handwriting_recogniser_results, handwrite_signature_checkbox)
1497
 
1498
  else: page_merged_redaction_bboxes = []
1499
 
 
1519
  # Assume image_path is an image
1520
  image = image_path
1521
 
 
1522
 
1523
  fill = (0, 0, 0) # Fill colour for redactions
1524
  draw = ImageDraw.Draw(image)
 
1579
  decision_process_table = fill_missing_ids(decision_process_table)
1580
  #decision_process_table.to_csv("output/decision_process_table_with_ids.csv")
1581
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1582
  toc = time.perf_counter()
1583
 
1584
  time_taken = toc - tic
 
1603
  # Append new annotation if it doesn't exist
1604
  annotations_all_pages.append(page_image_annotations)
1605
 
1606
+
1607
+
1608
  if text_extraction_method == textract_option:
1609
  if original_textract_data != textract_data:
1610
  # Write the updated existing textract data back to the JSON file
 
1614
  if textract_json_file_path not in log_files_output_paths:
1615
  log_files_output_paths.append(textract_json_file_path)
1616
 
1617
+ if text_extraction_method == tesseract_ocr_option:
1618
+ if original_all_page_line_level_ocr_results_with_words != all_page_line_level_ocr_results_with_words:
1619
+ # Write the updated existing textract data back to the JSON file
1620
+ with open(all_page_line_level_ocr_results_with_words_json_file_path, 'w') as json_file:
1621
+ json.dump(all_page_line_level_ocr_results_with_words, json_file, separators=(",", ":")) # indent=4 makes the JSON file pretty-printed
1622
+
1623
+ if all_page_line_level_ocr_results_with_words_json_file_path not in log_files_output_paths:
1624
+ log_files_output_paths.append(all_page_line_level_ocr_results_with_words_json_file_path)
1625
+
1626
  all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
1627
  all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
1628
 
1629
  current_loop_page += 1
1630
 
1631
+ return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
1632
 
1633
  # If it's an image file
1634
  if is_pdf(file_path) == False:
 
1661
  if textract_json_file_path not in log_files_output_paths:
1662
  log_files_output_paths.append(textract_json_file_path)
1663
 
1664
+ if text_extraction_method == tesseract_ocr_option:
1665
+ if original_all_page_line_level_ocr_results_with_words != all_page_line_level_ocr_results_with_words:
1666
+ # Write the updated existing textract data back to the JSON file
1667
+ with open(all_page_line_level_ocr_results_with_words_json_file_path, 'w') as json_file:
1668
+ json.dump(all_page_line_level_ocr_results_with_words, json_file, separators=(",", ":")) # indent=4 makes the JSON file pretty-printed
1669
+
1670
+ if all_page_line_level_ocr_results_with_words_json_file_path not in log_files_output_paths:
1671
+ log_files_output_paths.append(all_page_line_level_ocr_results_with_words_json_file_path)
1672
+
1673
+
1674
  all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
1675
  all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
1676
 
1677
+ return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
1678
 
1679
  if text_extraction_method == textract_option:
1680
  # Write the updated existing textract data back to the JSON file
 
1686
  if textract_json_file_path not in log_files_output_paths:
1687
  log_files_output_paths.append(textract_json_file_path)
1688
 
1689
+ if text_extraction_method == tesseract_ocr_option:
1690
+ if original_all_page_line_level_ocr_results_with_words != all_page_line_level_ocr_results_with_words:
1691
+ # Write the updated existing textract data back to the JSON file
1692
+ with open(all_page_line_level_ocr_results_with_words_json_file_path, 'w') as json_file:
1693
+ json.dump(all_page_line_level_ocr_results_with_words, json_file, separators=(",", ":")) # indent=4 makes the JSON file pretty-printed
1694
+
1695
+ if all_page_line_level_ocr_results_with_words_json_file_path not in log_files_output_paths:
1696
+ log_files_output_paths.append(all_page_line_level_ocr_results_with_words_json_file_path)
1697
+
1698
  all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
1699
  all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
1700
 
1701
+ # Convert decision table and ocr results to relative coordinates
1702
  all_pages_decision_process_table = divide_coordinates_by_page_sizes(all_pages_decision_process_table, page_sizes_df, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax")
1703
 
1704
  all_line_level_ocr_results_df = divide_coordinates_by_page_sizes(all_line_level_ocr_results_df, page_sizes_df, xmin="left", xmax="width", ymin="top", ymax="height")
1705
 
1706
+ return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
1707
 
1708
 
1709
  ###
 
1717
  for line in text_container
1718
  if isinstance(line, LTTextLine) or isinstance(line, LTTextLineHorizontal)
1719
  for char in line]
 
 
1720
 
1721
  return characters
1722
  return []
 
1846
  analysed_bounding_boxes_df_new = pd.concat([analysed_bounding_boxes_df_new, analysed_bounding_boxes_df_text], axis = 1)
1847
  analysed_bounding_boxes_df_new['page'] = page_num + 1
1848
 
 
 
 
1849
  decision_process_table = pd.concat([decision_process_table, analysed_bounding_boxes_df_new], axis = 0).drop('result', axis=1)
1850
 
1851
  return decision_process_table
 
1853
  def create_pikepdf_annotations_for_bounding_boxes(analysed_bounding_boxes):
1854
  pikepdf_redaction_annotations_on_page = []
1855
  for analysed_bounding_box in analysed_bounding_boxes:
 
1856
 
1857
  bounding_box = analysed_bounding_box["boundingBox"]
1858
  annotation = Dictionary(
 
2077
  pass
2078
  #print("Not redacting page:", page_no)
2079
 
 
2080
 
2081
  # Join extracted text outputs for all lines together
2082
  if not page_text_ocr_outputs.empty:
tools/helper_functions.py CHANGED
@@ -9,7 +9,7 @@ import unicodedata
9
  from typing import List
10
  from math import ceil
11
  from gradio_image_annotation import image_annotator
12
- from tools.config import CUSTOM_HEADER_VALUE, CUSTOM_HEADER, OUTPUT_FOLDER, INPUT_FOLDER, SESSION_OUTPUT_FOLDER, AWS_USER_POOL_ID, TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC
13
 
14
  # Names for options labels
15
  text_ocr_option = "Local model - selectable text"
@@ -39,6 +39,12 @@ def reset_ocr_results_state():
39
  def reset_review_vars():
40
  return pd.DataFrame(), pd.DataFrame()
41
 
 
 
 
 
 
 
42
  def load_in_default_allow_list(allow_list_file_path):
43
  if isinstance(allow_list_file_path, str):
44
  allow_list_file_path = [allow_list_file_path]
@@ -201,9 +207,6 @@ def put_columns_in_df(in_file:List[str]):
201
  df = pd.read_excel(file_name, sheet_name=sheet_name)
202
 
203
  # Process the DataFrame (e.g., print its contents)
204
- print(f"Sheet Name: {sheet_name}")
205
- print(df.head()) # Print the first few rows
206
-
207
  new_choices.extend(list(df.columns))
208
 
209
  all_sheet_names.extend(new_sheet_names)
@@ -226,7 +229,17 @@ def check_for_existing_textract_file(doc_file_name_no_extension_textbox:str, out
226
  textract_output_path = os.path.join(output_folder, doc_file_name_no_extension_textbox + "_textract.json")
227
 
228
  if os.path.exists(textract_output_path):
229
- print("Existing Textract file found.")
 
 
 
 
 
 
 
 
 
 
230
  return True
231
 
232
  else:
@@ -306,8 +319,8 @@ async def get_connection_params(request: gr.Request,
306
  output_folder_textbox:str=OUTPUT_FOLDER,
307
  input_folder_textbox:str=INPUT_FOLDER,
308
  session_output_folder:str=SESSION_OUTPUT_FOLDER,
309
- textract_document_upload_input_folder:str=TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER,
310
- textract_document_upload_output_folder:str=TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER,
311
  s3_textract_document_logs_subfolder:str=TEXTRACT_JOBS_S3_LOC,
312
  local_textract_document_logs_subfolder:str=TEXTRACT_JOBS_LOCAL_LOC):
313
 
@@ -477,9 +490,10 @@ def calculate_time_taken(number_of_pages:str,
477
  pii_identification_method:str,
478
  textract_output_found_checkbox:bool,
479
  only_extract_text_radio:bool,
 
480
  convert_page_time:float=0.5,
481
- textract_page_time:float=1,
482
- comprehend_page_time:float=1,
483
  local_text_extraction_page_time:float=0.3,
484
  local_pii_redaction_page_time:float=0.5,
485
  local_ocr_extraction_page_time:float=1.5,
@@ -494,7 +508,9 @@ def calculate_time_taken(number_of_pages:str,
494
  - number_of_pages: The number of pages in the uploaded document(s).
495
  - text_extract_method_radio: The method of text extraction.
496
  - pii_identification_method_drop: The method of personally-identifiable information removal.
 
497
  - only_extract_text_radio (bool, optional): Option to only extract text from the document rather than redact.
 
498
  - textract_page_time (float, optional): Approximate time to query AWS Textract.
499
  - comprehend_page_time (float, optional): Approximate time to query text on a page with AWS Comprehend.
500
  - local_text_redaction_page_time (float, optional): Approximate time to extract text on a page with the local text redaction option.
@@ -522,7 +538,8 @@ def calculate_time_taken(number_of_pages:str,
522
  if textract_output_found_checkbox != True:
523
  page_extraction_time_taken = number_of_pages * textract_page_time
524
  elif text_extract_method_radio == local_ocr_option:
525
- page_extraction_time_taken = number_of_pages * local_ocr_extraction_page_time
 
526
  elif text_extract_method_radio == text_ocr_option:
527
  page_conversion_time_taken = number_of_pages * local_text_extraction_page_time
528
 
 
9
  from typing import List
10
  from math import ceil
11
  from gradio_image_annotation import image_annotator
12
+ from tools.config import CUSTOM_HEADER_VALUE, CUSTOM_HEADER, OUTPUT_FOLDER, INPUT_FOLDER, SESSION_OUTPUT_FOLDER, AWS_USER_POOL_ID, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC
13
 
14
  # Names for options labels
15
  text_ocr_option = "Local model - selectable text"
 
39
  def reset_review_vars():
40
  return pd.DataFrame(), pd.DataFrame()
41
 
42
+ def reset_data_vars():
43
+ return 0, [], 0
44
+
45
+ def reset_aws_call_vars():
46
+ return 0, 0
47
+
48
  def load_in_default_allow_list(allow_list_file_path):
49
  if isinstance(allow_list_file_path, str):
50
  allow_list_file_path = [allow_list_file_path]
 
207
  df = pd.read_excel(file_name, sheet_name=sheet_name)
208
 
209
  # Process the DataFrame (e.g., print its contents)
 
 
 
210
  new_choices.extend(list(df.columns))
211
 
212
  all_sheet_names.extend(new_sheet_names)
 
229
  textract_output_path = os.path.join(output_folder, doc_file_name_no_extension_textbox + "_textract.json")
230
 
231
  if os.path.exists(textract_output_path):
232
+ print("Existing Textract analysis output file found.")
233
+ return True
234
+
235
+ else:
236
+ return False
237
+
238
+ def check_for_existing_local_ocr_file(doc_file_name_no_extension_textbox:str, output_folder:str=OUTPUT_FOLDER):
239
+ local_ocr_output_path = os.path.join(output_folder, doc_file_name_no_extension_textbox + "_ocr_results_with_words.json")
240
+
241
+ if os.path.exists(local_ocr_output_path):
242
+ print("Existing local OCR analysis output file found.")
243
  return True
244
 
245
  else:
 
319
  output_folder_textbox:str=OUTPUT_FOLDER,
320
  input_folder_textbox:str=INPUT_FOLDER,
321
  session_output_folder:str=SESSION_OUTPUT_FOLDER,
322
+ textract_document_upload_input_folder:str=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER,
323
+ textract_document_upload_output_folder:str=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER,
324
  s3_textract_document_logs_subfolder:str=TEXTRACT_JOBS_S3_LOC,
325
  local_textract_document_logs_subfolder:str=TEXTRACT_JOBS_LOCAL_LOC):
326
 
 
490
  pii_identification_method:str,
491
  textract_output_found_checkbox:bool,
492
  only_extract_text_radio:bool,
493
+ local_ocr_output_found_checkbox:bool,
494
  convert_page_time:float=0.5,
495
+ textract_page_time:float=1.2,
496
+ comprehend_page_time:float=1.2,
497
  local_text_extraction_page_time:float=0.3,
498
  local_pii_redaction_page_time:float=0.5,
499
  local_ocr_extraction_page_time:float=1.5,
 
508
  - number_of_pages: The number of pages in the uploaded document(s).
509
  - text_extract_method_radio: The method of text extraction.
510
  - pii_identification_method_drop: The method of personally-identifiable information removal.
511
+ - textract_output_found_checkbox (bool, optional): Boolean indicating if AWS Textract text extraction outputs have been found.
512
  - only_extract_text_radio (bool, optional): Option to only extract text from the document rather than redact.
513
+ - local_ocr_output_found_checkbox (bool, optional): Boolean indicating if local OCR text extraction outputs have been found.
514
  - textract_page_time (float, optional): Approximate time to query AWS Textract.
515
  - comprehend_page_time (float, optional): Approximate time to query text on a page with AWS Comprehend.
516
  - local_text_redaction_page_time (float, optional): Approximate time to extract text on a page with the local text redaction option.
 
538
  if textract_output_found_checkbox != True:
539
  page_extraction_time_taken = number_of_pages * textract_page_time
540
  elif text_extract_method_radio == local_ocr_option:
541
+ if local_ocr_output_found_checkbox != True:
542
+ page_extraction_time_taken = number_of_pages * local_ocr_extraction_page_time
543
  elif text_extract_method_radio == text_ocr_option:
544
  page_conversion_time_taken = number_of_pages * local_text_extraction_page_time
545
 
tools/redaction_review.py CHANGED
@@ -6,12 +6,11 @@ import numpy as np
6
  from xml.etree.ElementTree import Element, SubElement, tostring, parse
7
  from xml.dom import minidom
8
  import uuid
9
- from typing import List
10
  from gradio_image_annotation import image_annotator
11
  from gradio_image_annotation.image_annotator import AnnotatedImageData
12
  from pymupdf import Document, Rect
13
  import pymupdf
14
- #from fitz
15
  from PIL import ImageDraw, Image
16
 
17
  from tools.config import OUTPUT_FOLDER, CUSTOM_BOX_COLOUR, MAX_IMAGE_PIXELS, INPUT_FOLDER
@@ -55,7 +54,6 @@ def update_zoom(current_zoom_level:int, annotate_current_page:int, decrease:bool
55
 
56
  return current_zoom_level, annotate_current_page
57
 
58
-
59
  def update_dropdown_list_based_on_dataframe(df:pd.DataFrame, column:str) -> List["str"]:
60
  '''
61
  Gather unique elements from a string pandas Series, then append 'ALL' to the start and return the list.
@@ -166,49 +164,205 @@ def update_recogniser_dataframes(page_image_annotator_object:AnnotatedImageData,
166
 
167
  return recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_out, recogniser_entities_drop, text_entities_drop, page_entities_drop
168
 
169
- def undo_last_removal(backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base):
170
  return backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
171
 
172
- def update_annotator_page_from_review_df(review_df: pd.DataFrame,
173
- image_file_paths:List[str],
174
- page_sizes:List[dict],
175
- current_page:int,
176
- previous_page:int,
177
- current_image_annotations_state:List[str],
178
- current_page_annotator:object):
 
 
 
179
  '''
180
- Update the visible annotation object with the latest review file information
 
181
  '''
182
- out_image_annotations_state = current_image_annotations_state
183
- out_current_page_annotator = current_page_annotator
184
- gradio_annotator_current_page_number = current_page
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
185
 
 
186
  if not review_df.empty:
187
- #print("review_df just before convert_review_df:", review_df)
188
- # First, check that the image on the current page is valid, replace with what exists in page_sizes object if not
189
- if not gradio_annotator_current_page_number: gradio_annotator_current_page_number = 0
 
190
 
191
- # Check bounding values for current page and page max
192
- if gradio_annotator_current_page_number > 0: page_num_reported = gradio_annotator_current_page_number
193
- elif gradio_annotator_current_page_number == 0: page_num_reported = 1 # minimum possible reported page is 1
194
- else:
195
- gradio_annotator_current_page_number = 0
196
- page_num_reported = 1
197
 
198
- # Ensure page displayed can't exceed number of pages in document
199
- page_max_reported = len(out_image_annotations_state)
200
- if page_num_reported > page_max_reported: page_num_reported = page_max_reported
201
 
202
- page_num_reported_zero_indexed = page_num_reported - 1
203
- out_image_annotations_state = convert_review_df_to_annotation_json(review_df, image_file_paths, page_sizes)
 
 
 
 
204
 
205
- page_image_annotator_object, out_image_annotations_state = replace_images_in_image_annotation_object(out_image_annotations_state, out_image_annotations_state[page_num_reported_zero_indexed], page_sizes, page_num_reported)
 
206
 
207
- out_image_annotations_state[page_num_reported_zero_indexed] = page_image_annotator_object
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
 
209
- out_current_page_annotator = out_image_annotations_state[page_num_reported_zero_indexed]
 
 
210
 
211
- return out_current_page_annotator, out_image_annotations_state
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
 
213
  def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
214
  selected_rows_df: pd.DataFrame,
@@ -216,7 +370,7 @@ def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
216
  page_sizes:List[dict],
217
  image_annotations_state:dict,
218
  recogniser_entity_dataframe_base:pd.DataFrame):
219
- '''
220
  Remove selected items from the review dataframe from the annotation object and review dataframe.
221
  '''
222
 
@@ -253,149 +407,267 @@ def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
253
 
254
  return out_review_df, out_image_annotations_state, out_recogniser_entity_dataframe_base, backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
255
 
256
- def update_annotator_object_and_filter_df(
257
- all_image_annotations:List[AnnotatedImageData],
258
- gradio_annotator_current_page_number:int,
259
- recogniser_entities_dropdown_value:str="ALL",
260
- page_dropdown_value:str="ALL",
261
- text_dropdown_value:str="ALL",
262
- recogniser_dataframe_base:gr.Dataframe=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[], "text":[], "id":[]}), type="pandas", headers=["page", "label", "text", "id"], show_fullscreen_button=True, wrap=True, show_search='filter', max_height=400, static_columns=[0,1,2,3]),
263
- zoom:int=100,
264
- review_df:pd.DataFrame=[],
265
- page_sizes:List[dict]=[],
266
- doc_full_file_name_textbox:str='',
267
- input_folder:str=INPUT_FOLDER):
268
- '''
269
- Update a gradio_image_annotation object with new annotation data.
270
- '''
271
- zoom_str = str(zoom) + '%'
272
-
273
- #print("all_image_annotations at start of update_annotator_object_and_filter_df[-1]:", all_image_annotations[-1])
274
-
275
- if not gradio_annotator_current_page_number: gradio_annotator_current_page_number = 0
276
-
277
- # Check bounding values for current page and page max
278
- if gradio_annotator_current_page_number > 0: page_num_reported = gradio_annotator_current_page_number
279
- elif gradio_annotator_current_page_number == 0: page_num_reported = 1 # minimum possible reported page is 1
280
- else:
281
- gradio_annotator_current_page_number = 0
282
- page_num_reported = 1
283
 
284
- # Ensure page displayed can't exceed number of pages in document
285
- page_max_reported = len(all_image_annotations)
286
- if page_num_reported > page_max_reported: page_num_reported = page_max_reported
287
 
288
- page_num_reported_zero_indexed = page_num_reported - 1
 
 
 
 
289
 
290
- # First, check that the image on the current page is valid, replace with what exists in page_sizes object if not
291
- page_image_annotator_object, all_image_annotations = replace_images_in_image_annotation_object(all_image_annotations, all_image_annotations[page_num_reported_zero_indexed], page_sizes, page_num_reported)
292
 
293
- all_image_annotations[page_num_reported_zero_indexed] = page_image_annotator_object
294
-
295
- current_image_path = all_image_annotations[page_num_reported_zero_indexed]['image']
 
 
 
296
 
297
- # If image path is still not valid, load in a new image an overwrite it. Then replace all items in the image annotation object for all pages based on the updated information.
298
- page_sizes_df = pd.DataFrame(page_sizes)
299
 
300
- if not os.path.exists(current_image_path):
 
 
301
 
302
- page_num, replaced_image_path, width, height = process_single_page_for_image_conversion(doc_full_file_name_textbox, page_num_reported_zero_indexed, input_folder=input_folder)
303
 
304
- # Overwrite page_sizes values
305
- page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
306
- page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
307
- page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = replaced_image_path
308
-
309
- else:
310
- if not page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].isnull().all():
311
- width = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
312
- height = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
313
- else:
314
- image = Image.open(current_image_path)
315
- width = image.width
316
- height = image.height
317
 
 
318
  page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
319
  page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
 
 
 
 
 
 
 
 
 
 
320
 
321
- page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = current_image_path
 
322
 
323
- replaced_image_path = current_image_path
324
 
325
- if review_df.empty: review_df = pd.DataFrame(columns=["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text", "id"])
326
- review_df.loc[review_df["page"]==page_num_reported, 'image'] = replaced_image_path
 
327
 
328
- # Update dropdowns and review selection dataframe with the updated annotator object
329
- recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_modified, recogniser_entities_dropdown_value, text_entities_drop, page_entities_drop = update_recogniser_dataframes(all_image_annotations, recogniser_dataframe_base, recogniser_entities_dropdown_value, text_dropdown_value, page_dropdown_value, review_df.copy(), page_sizes)
330
-
331
- recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
332
 
333
- # page_sizes_df has been changed - save back to page_sizes_object
334
- page_sizes = page_sizes_df.to_dict(orient='records')
 
 
 
 
 
 
 
335
 
336
- images_list = list(page_sizes_df["image_path"])
337
- images_list[page_num_reported_zero_indexed] = replaced_image_path
338
 
339
- all_image_annotations[page_num_reported_zero_indexed]['image'] = replaced_image_path
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
340
 
341
- # Multiply out image_annotation coordinates from relative to absolute if necessary
342
- all_image_annotations_df = convert_annotation_data_to_dataframe(all_image_annotations)
 
 
343
 
344
- all_image_annotations_df = multiply_coordinates_by_page_sizes(all_image_annotations_df, page_sizes_df, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax")
 
 
 
 
345
 
346
- #print("all_image_annotations_df[-1] just before creating annotation dicts:", all_image_annotations_df.iloc[-1, :])
347
 
348
- all_image_annotations = create_annotation_dicts_from_annotation_df(all_image_annotations_df, page_sizes)
 
 
 
349
 
350
- #print("all_image_annotations[-1] after creating annotation dicts:", all_image_annotations[-1])
351
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
352
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
353
 
354
- # Remove blank duplicate entries
355
- all_image_annotations = remove_duplicate_images_with_blank_boxes(all_image_annotations)
 
 
 
 
 
 
356
 
357
- current_page_image_annotator_object = all_image_annotations[page_num_reported_zero_indexed]
358
 
359
- #print("current_page_image_annotator_object that goes into annotator object:", current_page_image_annotator_object)
 
360
 
361
- page_number_reported_gradio = gr.Number(label = "Current page", value=page_num_reported, precision=0)
362
 
363
- ###
364
- # If no data, present a blank page
365
- if not all_image_annotations:
366
- print("No all_image_annotation object found")
367
- page_num_reported = 1
368
 
369
- out_image_annotator = image_annotator(
370
- value = None,
371
- boxes_alpha=0.1,
372
- box_thickness=1,
373
- label_list=recogniser_entities_list,
374
- label_colors=recogniser_colour_list,
375
- show_label=False,
376
- height=zoom_str,
377
- width=zoom_str,
378
- box_min_size=1,
379
- box_selected_thickness=2,
380
- handle_size=4,
381
- sources=None,#["upload"],
382
- show_clear_button=False,
383
- show_share_button=False,
384
- show_remove_button=False,
385
- handles_cursor=True,
386
- interactive=True,
387
- use_default_label=True
388
- )
389
-
390
- return out_image_annotator, page_number_reported_gradio, page_number_reported_gradio, page_num_reported, recogniser_entities_dropdown_value, recogniser_dataframe_out_gr, recogniser_dataframe_modified, text_entities_drop, page_entities_drop, page_sizes, all_image_annotations
391
-
392
  else:
393
- ### Present image_annotator outputs
394
  out_image_annotator = image_annotator(
395
  value = current_page_image_annotator_object,
396
  boxes_alpha=0.1,
397
  box_thickness=1,
398
- label_list=recogniser_entities_list,
399
  label_colors=recogniser_colour_list,
400
  show_label=False,
401
  height=zoom_str,
@@ -408,41 +680,23 @@ def update_annotator_object_and_filter_df(
408
  show_share_button=False,
409
  show_remove_button=False,
410
  handles_cursor=True,
411
- interactive=True
412
  )
413
 
414
- #print("all_image_annotations at end of update_annotator...:", all_image_annotations)
415
- #print("review_df at end of update_annotator_object:", review_df)
416
-
417
- return out_image_annotator, page_number_reported_gradio, page_number_reported_gradio, page_num_reported, recogniser_entities_dropdown_value, recogniser_dataframe_out_gr, recogniser_dataframe_modified, text_entities_drop, page_entities_drop, page_sizes, all_image_annotations
418
-
419
- def replace_images_in_image_annotation_object(
420
- all_image_annotations:List[dict],
421
- page_image_annotator_object:AnnotatedImageData,
422
- page_sizes:List[dict],
423
- page:int):
424
-
425
- '''
426
- Check if the image value in an AnnotatedImageData dict is a placeholder or np.array. If either of these, replace the value with the file path of the image that is hopefully already loaded into the app related to this page.
427
- '''
428
-
429
- page_zero_index = page - 1
430
-
431
- if isinstance(all_image_annotations[page_zero_index]["image"], np.ndarray) or "placeholder_image" in all_image_annotations[page_zero_index]["image"] or isinstance(page_image_annotator_object['image'], np.ndarray):
432
- page_sizes_df = pd.DataFrame(page_sizes)
433
- page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
434
-
435
- # Check for matching pages
436
- matching_paths = page_sizes_df.loc[page_sizes_df['page'] == page, "image_path"].unique()
437
-
438
- if matching_paths.size > 0:
439
- image_path = matching_paths[0]
440
- page_image_annotator_object['image'] = image_path
441
- all_image_annotations[page_zero_index]["image"] = image_path
442
- else:
443
- print(f"No image path found for page {page}.")
444
-
445
- return page_image_annotator_object, all_image_annotations
446
 
447
  def update_all_page_annotation_object_based_on_previous_page(
448
  page_image_annotator_object:AnnotatedImageData,
@@ -459,12 +713,9 @@ def update_all_page_annotation_object_based_on_previous_page(
459
  previous_page_zero_index = previous_page -1
460
 
461
  if not current_page: current_page = 1
462
-
463
- #print("page_image_annotator_object at start of update_all_page_annotation_object:", page_image_annotator_object)
464
-
465
- page_image_annotator_object, all_image_annotations = replace_images_in_image_annotation_object(all_image_annotations, page_image_annotator_object, page_sizes, previous_page)
466
-
467
- #print("page_image_annotator_object after replace_images in update_all_page_annotation_object:", page_image_annotator_object)
468
 
469
  if clear_all == False: all_image_annotations[previous_page_zero_index] = page_image_annotator_object
470
  else: all_image_annotations[previous_page_zero_index]["boxes"] = []
@@ -493,7 +744,7 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
493
  page_image_annotator_object = all_image_annotations[current_page - 1]
494
 
495
  # This replaces the numpy array image object with the image file path
496
- page_image_annotator_object, all_image_annotations = replace_images_in_image_annotation_object(all_image_annotations, page_image_annotator_object, page_sizes, current_page)
497
  page_image_annotator_object['image'] = all_image_annotations[current_page - 1]["image"]
498
 
499
  if not page_image_annotator_object:
@@ -529,7 +780,7 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
529
  # Check if all elements are integers in the range 0-255
530
  if all(isinstance(c, int) and 0 <= c <= 255 for c in fill):
531
  pass
532
- #print("fill:", fill)
533
  else:
534
  print(f"Invalid color values: {fill}. Defaulting to black.")
535
  fill = (0, 0, 0) # Default to black if invalid
@@ -553,7 +804,6 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
553
  doc = [image]
554
 
555
  elif file_extension in '.csv':
556
- #print("This is a csv")
557
  pdf_doc = []
558
 
559
  # If working with pdfs
@@ -797,11 +1047,9 @@ def df_select_callback(df: pd.DataFrame, evt: gr.SelectData):
797
 
798
  row_value_df = pd.DataFrame(data={"page":[row_value_page], "label":[row_value_label], "text":[row_value_text], "id":[row_value_id]})
799
 
800
- return row_value_page, row_value_df
801
 
802
  def df_select_callback_textract_api(df: pd.DataFrame, evt: gr.SelectData):
803
-
804
- #print("evt.data:", evt._data)
805
 
806
  row_value_job_id = evt.row_value[0] # This is the page number value
807
  # row_value_label = evt.row_value[1] # This is the label number value
@@ -829,59 +1077,108 @@ def df_select_callback_ocr(df: pd.DataFrame, evt: gr.SelectData):
829
 
830
  return row_value_page, row_value_df
831
 
832
- def update_selected_review_df_row_colour(redaction_row_selection:pd.DataFrame, review_df:pd.DataFrame, previous_id:str="", previous_colour:str='(0, 0, 0)', page_sizes:List[dict]=[], output_folder:str=OUTPUT_FOLDER, colour:str='(1, 0, 255)'):
 
 
 
 
 
 
833
  '''
834
  Update the colour of a single redaction box based on the values in a selection row
 
835
  '''
836
- colour_tuple = str(tuple(colour))
837
 
838
- if "color" not in review_df.columns: review_df["color"] = '(0, 0, 0)'
 
 
 
 
839
  if "id" not in review_df.columns:
840
- review_df = fill_missing_ids(review_df)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
841
 
842
- # Reset existing highlight colours
843
- review_df.loc[review_df["id"]==previous_id, "color"] = review_df.loc[review_df["id"]==previous_id, "color"].apply(lambda _: previous_colour)
844
- review_df.loc[review_df["color"].astype(str)==colour, "color"] = review_df.loc[review_df["color"].astype(str)==colour, "color"].apply(lambda _: '(0, 0, 0)')
845
 
846
  if not redaction_row_selection.empty and not review_df.empty:
847
  use_id = (
848
- "id" in redaction_row_selection.columns
849
- and "id" in review_df.columns
850
- and not redaction_row_selection["id"].isnull().all()
851
  and not review_df["id"].isnull().all()
852
  )
853
 
854
- selected_merge_cols = ["id"] if use_id else ["label", "page", "text"]
855
 
856
- review_df = review_df.merge(redaction_row_selection[selected_merge_cols], on=selected_merge_cols, indicator=True, how="left")
 
 
 
 
 
 
857
 
858
- if "_merge" in review_df.columns:
859
- filtered_reviews = review_df.loc[review_df["_merge"]=="both"]
860
- else:
861
- filtered_reviews = pd.DataFrame()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
862
 
863
- if not filtered_reviews.empty:
864
- previous_colour = str(filtered_reviews["color"].values[0])
865
- previous_id = filtered_reviews["id"].values[0]
866
- review_df.loc[review_df["_merge"]=="both", "color"] = review_df.loc[review_df["_merge"] == "both", "color"].apply(lambda _: colour)
867
  else:
868
- # Handle the case where no rows match the condition
869
- print("No reviews found with _merge == 'both'")
870
- previous_colour = '(0, 0, 0)'
871
- review_df.loc[review_df["color"]==colour, "color"] = previous_colour
872
- previous_id =''
873
 
874
- review_df.drop("_merge", axis=1, inplace=True)
875
 
876
- # Ensure that all output coordinates are in proportional size
877
- #page_sizes_df = pd.DataFrame(page_sizes)
878
- #page_sizes_df .loc[:, "page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
879
- #print("review_df before divide:", review_df)
880
- #print("page_sizes_df before divide:", page_sizes_df)
881
- #review_df = divide_coordinates_by_page_sizes(review_df, page_sizes_df)
882
- #print("review_df after divide:", review_df)
883
 
884
- review_df = review_df[["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]]
885
 
886
  return review_df, previous_id, previous_colour
887
 
@@ -988,8 +1285,6 @@ def create_xfdf(review_file_df:pd.DataFrame, pdf_path:str, pymupdf_doc:object, i
988
  page_sizes_df = pd.DataFrame(page_sizes)
989
 
990
  # If there are no image coordinates, then convert coordinates to pymupdf coordinates prior to export
991
- #print("Using pymupdf coordinates for conversion.")
992
-
993
  pages_are_images = False
994
 
995
  if "mediabox_width" not in review_file_df.columns:
@@ -1041,33 +1336,9 @@ def create_xfdf(review_file_df:pd.DataFrame, pdf_path:str, pymupdf_doc:object, i
1041
  raise ValueError(f"Invalid cropbox format: {document_cropboxes[page_python_format]}")
1042
  else:
1043
  print("Document cropboxes not found.")
1044
-
1045
 
1046
  pdf_page_height = pymupdf_page.mediabox.height
1047
- pdf_page_width = pymupdf_page.mediabox.width
1048
-
1049
- # Check if image dimensions for page exist in page_sizes_df
1050
- # image_dimensions = {}
1051
-
1052
- # image_dimensions['image_width'] = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
1053
- # image_dimensions['image_height'] = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
1054
-
1055
- # if pd.isna(image_dimensions['image_width']):
1056
- # image_dimensions = {}
1057
-
1058
- # image = image_paths[page_python_format]
1059
-
1060
- # if image_dimensions:
1061
- # image_page_width, image_page_height = image_dimensions["image_width"], image_dimensions["image_height"]
1062
- # if isinstance(image, str) and 'placeholder' not in image:
1063
- # image = Image.open(image)
1064
- # image_page_width, image_page_height = image.size
1065
- # else:
1066
- # try:
1067
- # image = Image.open(image)
1068
- # image_page_width, image_page_height = image.size
1069
- # except Exception as e:
1070
- # print("Could not get image sizes due to:", e)
1071
 
1072
  # Create redaction annotation
1073
  redact_annot = SubElement(annots, 'redact')
@@ -1345,8 +1616,6 @@ def convert_xfdf_to_dataframe(file_paths_list:List[str], pymupdf_doc, image_path
1345
  # Optionally, you can add the image path or other relevant information
1346
  df.loc[_, 'image'] = image_path
1347
 
1348
- #print('row:', row)
1349
-
1350
  out_file_path = output_folder + file_path_name + "_review_file.csv"
1351
  df.to_csv(out_file_path, index=None)
1352
 
 
6
  from xml.etree.ElementTree import Element, SubElement, tostring, parse
7
  from xml.dom import minidom
8
  import uuid
9
+ from typing import List, Tuple
10
  from gradio_image_annotation import image_annotator
11
  from gradio_image_annotation.image_annotator import AnnotatedImageData
12
  from pymupdf import Document, Rect
13
  import pymupdf
 
14
  from PIL import ImageDraw, Image
15
 
16
  from tools.config import OUTPUT_FOLDER, CUSTOM_BOX_COLOUR, MAX_IMAGE_PIXELS, INPUT_FOLDER
 
54
 
55
  return current_zoom_level, annotate_current_page
56
 
 
57
  def update_dropdown_list_based_on_dataframe(df:pd.DataFrame, column:str) -> List["str"]:
58
  '''
59
  Gather unique elements from a string pandas Series, then append 'ALL' to the start and return the list.
 
164
 
165
  return recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_out, recogniser_entities_drop, text_entities_drop, page_entities_drop
166
 
167
+ def undo_last_removal(backup_review_state:pd.DataFrame, backup_image_annotations_state:list[dict], backup_recogniser_entity_dataframe_base:pd.DataFrame):
168
  return backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
169
 
170
+ def update_annotator_page_from_review_df(
171
+ review_df: pd.DataFrame,
172
+ image_file_paths:List[str], # Note: This input doesn't seem used in the original logic flow after the first line was removed
173
+ page_sizes:List[dict],
174
+ current_image_annotations_state:List[str], # This should ideally be List[dict] based on its usage
175
+ current_page_annotator:object, # Should be dict or a custom annotation object for one page
176
+ selected_recogniser_entity_df_row:pd.DataFrame,
177
+ input_folder:str,
178
+ doc_full_file_name_textbox:str
179
+ ) -> Tuple[object, List[dict], int, List[dict], pd.DataFrame, int]: # Correcting return types based on usage
180
  '''
181
+ Update the visible annotation object and related objects with the latest review file information,
182
+ optimizing by processing only the current page's data.
183
  '''
184
+ # Assume current_image_annotations_state is List[dict] and current_page_annotator is dict
185
+ out_image_annotations_state: List[dict] = list(current_image_annotations_state) # Make a copy to avoid modifying input in place
186
+ out_current_page_annotator: dict = current_page_annotator
187
+
188
+ # Get the target page number from the selected row
189
+ # Safely access the page number, handling potential errors or empty DataFrame
190
+ gradio_annotator_current_page_number: int = 0
191
+ annotate_previous_page: int = 0 # Renaming for clarity if needed, matches original output
192
+ if not selected_recogniser_entity_df_row.empty and 'page' in selected_recogniser_entity_df_row.columns:
193
+ try:
194
+ # Use .iloc[0] and .item() for robust scalar extraction
195
+ gradio_annotator_current_page_number = int(selected_recogniser_entity_df_row['page'].iloc[0])
196
+ annotate_previous_page = gradio_annotator_current_page_number # Store original page number
197
+ except (IndexError, ValueError, TypeError):
198
+ print("Warning: Could not extract valid page number from selected_recogniser_entity_df_row. Defaulting to page 0 (or 1).")
199
+ gradio_annotator_current_page_number = 1 # Or 0 depending on 1-based vs 0-based indexing elsewhere
200
+
201
+ # Ensure page number is valid and 1-based for external display/logic
202
+ if gradio_annotator_current_page_number <= 0:
203
+ gradio_annotator_current_page_number = 1
204
+
205
+ page_max_reported = len(out_image_annotations_state)
206
+ if gradio_annotator_current_page_number > page_max_reported:
207
+ gradio_annotator_current_page_number = page_max_reported # Cap at max pages
208
+
209
+ page_num_reported_zero_indexed = gradio_annotator_current_page_number - 1
210
+
211
+ # Process page sizes DataFrame early, as it's needed for image path handling and potentially coordinate multiplication
212
+ page_sizes_df = pd.DataFrame(page_sizes)
213
+ if not page_sizes_df.empty:
214
+ # Safely convert page column to numeric and then int
215
+ page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
216
+ page_sizes_df.dropna(subset=["page"], inplace=True)
217
+ if not page_sizes_df.empty:
218
+ page_sizes_df["page"] = page_sizes_df["page"].astype(int)
219
+ else:
220
+ print("Warning: Page sizes DataFrame became empty after processing.")
221
 
222
+ # --- OPTIMIZATION: Process only the current page's data from review_df ---
223
  if not review_df.empty:
224
+ # Filter review_df for the current page
225
+ # Ensure 'page' column in review_df is comparable to page_num_reported
226
+ if 'page' in review_df.columns:
227
+ review_df['page'] = pd.to_numeric(review_df['page'], errors='coerce').fillna(-1).astype(int)
228
 
229
+ current_image_path = out_image_annotations_state[page_num_reported_zero_indexed]['image']
 
 
 
 
 
230
 
231
+ replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(doc_full_file_name_textbox, current_image_path, page_sizes_df, gradio_annotator_current_page_number, input_folder)
 
 
232
 
233
+ # page_sizes_df has been changed - save back to page_sizes_object
234
+ page_sizes = page_sizes_df.to_dict(orient='records')
235
+ review_df.loc[review_df["page"]==gradio_annotator_current_page_number, 'image'] = replaced_image_path
236
+ images_list = list(page_sizes_df["image_path"])
237
+ images_list[page_num_reported_zero_indexed] = replaced_image_path
238
+ out_image_annotations_state[page_num_reported_zero_indexed]['image'] = replaced_image_path
239
 
240
+ current_page_review_df = review_df[review_df['page'] == gradio_annotator_current_page_number].copy()
241
+ current_page_review_df = multiply_coordinates_by_page_sizes(current_page_review_df, page_sizes_df)
242
 
243
+ else:
244
+ print(f"Warning: 'page' column not found in review_df. Cannot filter for page {gradio_annotator_current_page_number}. Skipping update from review_df.")
245
+ current_page_review_df = pd.DataFrame() # Empty dataframe if filter fails
246
+
247
+ if not current_page_review_df.empty:
248
+ # Convert the current page's review data to annotation list format for *this page*
249
+
250
+ current_page_annotations_list = []
251
+ # Define expected annotation dict keys, including 'image', 'page', coords, 'label', 'text', 'color' etc.
252
+ # Assuming review_df has compatible columns
253
+ expected_annotation_keys = ['label', 'color', 'xmin', 'ymin', 'xmax', 'ymax', 'text', 'id'] # Add/remove as needed
254
+
255
+ # Ensure necessary columns exist in current_page_review_df before converting rows
256
+ for key in expected_annotation_keys:
257
+ if key not in current_page_review_df.columns:
258
+ # Add missing column with default value
259
+ # Use np.nan for numeric, '' for string/object
260
+ default_value = np.nan if key in ['xmin', 'ymin', 'xmax', 'ymax'] else ''
261
+ current_page_review_df[key] = default_value
262
+
263
+ # Convert filtered DataFrame rows to list of dicts
264
+ # Using .to_dict(orient='records') is efficient for this
265
+ current_page_annotations_list_raw = current_page_review_df[expected_annotation_keys].to_dict(orient='records')
266
+
267
+ current_page_annotations_list = current_page_annotations_list_raw
268
+
269
+ # Update the annotations state for the current page
270
+ # Each entry in out_image_annotations_state seems to be a dict containing keys like 'image', 'page', 'annotations' (List[dict])
271
+ # Need to update the 'annotations' list for the specific page.
272
+ # Find the entry for the current page in the state
273
+ page_state_entry_found = False
274
+ for i, page_state_entry in enumerate(out_image_annotations_state):
275
+ # Assuming page_state_entry has a 'page' key (1-based)
276
+
277
+ match = re.search(r"(\d+)\.png$", page_state_entry['image'])
278
+ if match: page_no = int(match.group(1))
279
+ else: page_no = -1
280
+
281
+ if 'image' in page_state_entry and page_no == page_num_reported_zero_indexed:
282
+ # Replace the annotations list for this page with the new list from review_df
283
+ out_image_annotations_state[i]['boxes'] = current_page_annotations_list
284
+
285
+ # Update the image path as well, based on review_df if available, or keep existing
286
+ # Assuming review_df has an 'image' column for this page
287
+ if 'image' in current_page_review_df.columns and not current_page_review_df.empty:
288
+ # Use the image path from the first row of the filtered review_df
289
+ out_image_annotations_state[i]['image'] = current_page_review_df['image'].iloc[0]
290
+ page_state_entry_found = True
291
+ break
292
+
293
+ if not page_state_entry_found:
294
+ # This scenario might happen if the current_image_annotations_state didn't initially contain
295
+ # an entry for this page number. Depending on the application logic, you might need to
296
+ # add a new entry here, but based on the original code's structure, it seems
297
+ # out_image_annotations_state is pre-populated for all pages.
298
+ print(f"Warning: Entry for page {gradio_annotator_current_page_number} not found in current_image_annotations_state. Cannot update page annotations.")
299
+
300
+
301
+ # --- Image Path and Page Size Handling (already seems focused on current page, keep similar logic) ---
302
+ # Get the image path for the current page from the updated state
303
+ # Ensure the entry exists before accessing
304
+ current_image_path = None
305
+ if len(out_image_annotations_state) > page_num_reported_zero_indexed and 'image' in out_image_annotations_state[page_num_reported_zero_indexed]:
306
+ current_image_path = out_image_annotations_state[page_num_reported_zero_indexed]['image']
307
+ else:
308
+ print(f"Warning: Could not get image path from state for page index {page_num_reported_zero_indexed}.")
309
+
310
+
311
+ # Replace placeholder image with real image path if needed
312
+ if current_image_path and not page_sizes_df.empty:
313
+ try:
314
+ replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(
315
+ doc_full_file_name_textbox, current_image_path, page_sizes_df,
316
+ gradio_annotator_current_page_number, input_folder # Use 1-based page number
317
+ )
318
 
319
+ # Update state and review_df with the potentially replaced image path
320
+ if len(out_image_annotations_state) > page_num_reported_zero_indexed:
321
+ out_image_annotations_state[page_num_reported_zero_indexed]['image'] = replaced_image_path
322
 
323
+ if 'page' in review_df.columns and 'image' in review_df.columns:
324
+ review_df.loc[review_df["page"]==gradio_annotator_current_page_number, 'image'] = replaced_image_path
325
+
326
+ except Exception as e:
327
+ print(f"Error during image path replacement for page {gradio_annotator_current_page_number}: {e}")
328
+
329
+
330
+ # Save back page_sizes_df to page_sizes list format
331
+ if not page_sizes_df.empty:
332
+ page_sizes = page_sizes_df.to_dict(orient='records')
333
+ else:
334
+ page_sizes = [] # Ensure page_sizes is a list if df is empty
335
+
336
+ # --- Re-evaluate Coordinate Multiplication and Duplicate Removal ---
337
+ # The original code multiplied coordinates for the *entire* document and removed duplicates
338
+ # across the *entire* document *after* converting the full review_df to state.
339
+ # With the optimized approach, we updated only one page's annotations in the state.
340
+
341
+ # Let's assume remove_duplicate_images_with_blank_boxes expects the raw list of dicts state format:
342
+ try:
343
+ out_image_annotations_state = remove_duplicate_images_with_blank_boxes(out_image_annotations_state)
344
+ except Exception as e:
345
+ print(f"Error during duplicate removal: {e}. Proceeding without duplicate removal.")
346
+
347
+
348
+ # Select the current page's annotation object from the (potentially updated) state
349
+ if len(out_image_annotations_state) > page_num_reported_zero_indexed:
350
+ out_current_page_annotator = out_image_annotations_state[page_num_reported_zero_indexed]
351
+ else:
352
+ print(f"Warning: Cannot select current page annotator object for index {page_num_reported_zero_indexed}.")
353
+ out_current_page_annotator = {} # Or None, depending on expected output type
354
+
355
+
356
+ # The original code returns gradio_annotator_current_page_number as the 3rd value,
357
+ # which was potentially updated by bounding checks. Keep this.
358
+ final_page_number_returned = gradio_annotator_current_page_number
359
+
360
+ return (out_current_page_annotator,
361
+ out_image_annotations_state,
362
+ final_page_number_returned,
363
+ page_sizes,
364
+ review_df, # review_df might have its 'page' column type changed, keep it as is or revert if necessary
365
+ annotate_previous_page) # The original page number from selected_recogniser_entity_df_row
366
 
367
  def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
368
  selected_rows_df: pd.DataFrame,
 
370
  page_sizes:List[dict],
371
  image_annotations_state:dict,
372
  recogniser_entity_dataframe_base:pd.DataFrame):
373
+ '''
374
  Remove selected items from the review dataframe from the annotation object and review dataframe.
375
  '''
376
 
 
407
 
408
  return out_review_df, out_image_annotations_state, out_recogniser_entity_dataframe_base, backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
409
 
410
+ def replace_annotator_object_img_np_array_with_page_sizes_image_path(
411
+ all_image_annotations:List[dict],
412
+ page_image_annotator_object:AnnotatedImageData,
413
+ page_sizes:List[dict],
414
+ page:int):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
415
 
416
+ '''
417
+ Check if the image value in an AnnotatedImageData dict is a placeholder or np.array. If either of these, replace the value with the file path of the image that is hopefully already loaded into the app related to this page.
418
+ '''
419
 
420
+ page_zero_index = page - 1
421
+
422
+ if isinstance(all_image_annotations[page_zero_index]["image"], np.ndarray) or "placeholder_image" in all_image_annotations[page_zero_index]["image"] or isinstance(page_image_annotator_object['image'], np.ndarray):
423
+ page_sizes_df = pd.DataFrame(page_sizes)
424
+ page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
425
 
426
+ # Check for matching pages
427
+ matching_paths = page_sizes_df.loc[page_sizes_df['page'] == page, "image_path"].unique()
428
 
429
+ if matching_paths.size > 0:
430
+ image_path = matching_paths[0]
431
+ page_image_annotator_object['image'] = image_path
432
+ all_image_annotations[page_zero_index]["image"] = image_path
433
+ else:
434
+ print(f"No image path found for page {page}.")
435
 
436
+ return page_image_annotator_object, all_image_annotations
 
437
 
438
+ def replace_placeholder_image_with_real_image(doc_full_file_name_textbox:str, current_image_path:str, page_sizes_df:pd.DataFrame, page_num_reported:int, input_folder:str):
439
+ ''' If image path is still not valid, load in a new image an overwrite it. Then replace all items in the image annotation object for all pages based on the updated information.'''
440
+ page_num_reported_zero_indexed = page_num_reported - 1
441
 
442
+ if not os.path.exists(current_image_path):
443
 
444
+ page_num, replaced_image_path, width, height = process_single_page_for_image_conversion(doc_full_file_name_textbox, page_num_reported_zero_indexed, input_folder=input_folder)
 
 
 
 
 
 
 
 
 
 
 
 
445
 
446
+ # Overwrite page_sizes values
447
  page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
448
  page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
449
+ page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = replaced_image_path
450
+
451
+ else:
452
+ if not page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].isnull().all():
453
+ width = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
454
+ height = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
455
+ else:
456
+ image = Image.open(current_image_path)
457
+ width = image.width
458
+ height = image.height
459
 
460
+ page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
461
+ page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
462
 
463
+ page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = current_image_path
464
 
465
+ replaced_image_path = current_image_path
466
+
467
+ return replaced_image_path, page_sizes_df
468
 
469
+ def update_annotator_object_and_filter_df(
470
+ all_image_annotations:List[AnnotatedImageData],
471
+ gradio_annotator_current_page_number:int,
472
+ recogniser_entities_dropdown_value:str="ALL",
473
+ page_dropdown_value:str="ALL",
474
+ text_dropdown_value:str="ALL",
475
+ recogniser_dataframe_base:gr.Dataframe=None, # Simplified default
476
+ zoom:int=100,
477
+ review_df:pd.DataFrame=None, # Use None for default empty DataFrame
478
+ page_sizes:List[dict]=[],
479
+ doc_full_file_name_textbox:str='',
480
+ input_folder:str=INPUT_FOLDER
481
+ ) -> Tuple[image_annotator, gr.Number, gr.Number, int, str, gr.Dataframe, pd.DataFrame, List[str], List[str], List[dict], List[AnnotatedImageData]]:
482
+ '''
483
+ Update a gradio_image_annotation object with new annotation data for the current page
484
+ and update filter dataframes, optimizing by processing only the current page's data for display.
485
+ '''
486
+ zoom_str = str(zoom) + '%'
487
+
488
+ # Handle default empty review_df and recogniser_dataframe_base
489
+ if review_df is None or not isinstance(review_df, pd.DataFrame):
490
+ review_df = pd.DataFrame(columns=["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text", "id"])
491
+ if recogniser_dataframe_base is None: # Create a simple default if None
492
+ recogniser_dataframe_base = gr.Dataframe(pd.DataFrame(data={"page":[], "label":[], "text":[], "id":[]}))
493
+
494
+
495
+ # Handle empty all_image_annotations state early
496
+ if not all_image_annotations:
497
+ print("No all_image_annotation object found")
498
+ # Return blank/default outputs
499
+ blank_annotator = gr.ImageAnnotator(
500
+ value = None, boxes_alpha=0.1, box_thickness=1, label_list=[], label_colors=[],
501
+ show_label=False, height=zoom_str, width=zoom_str, box_min_size=1,
502
+ box_selected_thickness=2, handle_size=4, sources=None,
503
+ show_clear_button=False, show_share_button=False, show_remove_button=False,
504
+ handles_cursor=True, interactive=True, use_default_label=True
505
+ )
506
+ blank_df_out_gr = gr.Dataframe(pd.DataFrame(columns=["page", "label", "text", "id"]))
507
+ blank_df_modified = pd.DataFrame(columns=["page", "label", "text", "id"])
508
 
509
+ return (blank_annotator, gr.Number(value=1), gr.Number(value=1), 1,
510
+ recogniser_entities_dropdown_value, blank_df_out_gr, blank_df_modified,
511
+ [], [], [], []) # Return empty lists/defaults for other outputs
512
+
513
+ # Validate and bound the current page number (1-based logic)
514
+ page_num_reported = max(1, gradio_annotator_current_page_number) # Minimum page is 1
515
+ page_max_reported = len(all_image_annotations)
516
+ if page_num_reported > page_max_reported:
517
+ page_num_reported = page_max_reported
518
 
519
+ page_num_reported_zero_indexed = page_num_reported - 1
520
+ annotate_previous_page = page_num_reported # Store the determined page number
521
 
522
+ # --- Process page sizes DataFrame ---
523
+ page_sizes_df = pd.DataFrame(page_sizes)
524
+ if not page_sizes_df.empty:
525
+ page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
526
+ page_sizes_df.dropna(subset=["page"], inplace=True)
527
+ if not page_sizes_df.empty:
528
+ page_sizes_df["page"] = page_sizes_df["page"].astype(int)
529
+ else:
530
+ print("Warning: Page sizes DataFrame became empty after processing.")
531
+
532
+ # --- Handle Image Path Replacement for the Current Page ---
533
+ # This modifies the specific page entry within all_image_annotations list
534
+ # Assuming replace_annotator_object_img_np_array_with_page_sizes_image_path
535
+ # correctly updates the image path within the list element.
536
+ if len(all_image_annotations) > page_num_reported_zero_indexed:
537
+ # Make a shallow copy of the list and deep copy the specific page dict before modification
538
+ # to avoid modifying the input list unexpectedly if it's used elsewhere.
539
+ # However, the original code modified the list in place, so we'll stick to that
540
+ # pattern but acknowledge it.
541
+ page_object_to_update = all_image_annotations[page_num_reported_zero_indexed]
542
+
543
+ # Use the helper function to replace the image path within the page object
544
+ # Note: This helper returns the potentially modified page_object and the full state.
545
+ # The full state return seems redundant if only page_object_to_update is modified.
546
+ # Let's call it and assume it correctly updates the item in the list.
547
+ updated_page_object, all_image_annotations_after_img_replace = replace_annotator_object_img_np_array_with_page_sizes_image_path(
548
+ all_image_annotations, page_object_to_update, page_sizes, page_num_reported)
549
+
550
+ # The original code immediately re-assigns all_image_annotations.
551
+ # We'll rely on the function modifying the list element in place or returning the updated list.
552
+ # Assuming it returns the updated list for robustness:
553
+ all_image_annotations = all_image_annotations_after_img_replace
554
+
555
+
556
+ # Now handle the actual image file path replacement using replace_placeholder_image_with_real_image
557
+ current_image_path = updated_page_object.get('image') # Get potentially updated image path
558
+
559
+ if current_image_path and not page_sizes_df.empty:
560
+ try:
561
+ replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(
562
+ doc_full_file_name_textbox, current_image_path, page_sizes_df,
563
+ page_num_reported, input_folder=input_folder # Use 1-based page num
564
+ )
565
 
566
+ # Update the image path in the state and review_df for the current page
567
+ # Find the correct entry in all_image_annotations list again by index
568
+ if len(all_image_annotations) > page_num_reported_zero_indexed:
569
+ all_image_annotations[page_num_reported_zero_indexed]['image'] = replaced_image_path
570
 
571
+ # Update review_df's image path for this page
572
+ if 'page' in review_df.columns and 'image' in review_df.columns:
573
+ # Ensure review_df page column is numeric for filtering
574
+ review_df['page'] = pd.to_numeric(review_df['page'], errors='coerce').fillna(-1).astype(int)
575
+ review_df.loc[review_df["page"]==page_num_reported, 'image'] = replaced_image_path
576
 
 
577
 
578
+ except Exception as e:
579
+ print(f"Error during image path replacement for page {page_num_reported}: {e}")
580
+ else:
581
+ print(f"Warning: Page index {page_num_reported_zero_indexed} out of bounds for all_image_annotations list.")
582
 
 
583
 
584
+ # Save back page_sizes_df to page_sizes list format
585
+ if not page_sizes_df.empty:
586
+ page_sizes = page_sizes_df.to_dict(orient='records')
587
+ else:
588
+ page_sizes = [] # Ensure page_sizes is a list if df is empty
589
+
590
+ # --- OPTIMIZATION: Prepare data *only* for the current page for display ---
591
+ current_page_image_annotator_object = None
592
+ if len(all_image_annotations) > page_num_reported_zero_indexed:
593
+ page_data_for_display = all_image_annotations[page_num_reported_zero_indexed]
594
+
595
+ # Convert current page annotations list to DataFrame for coordinate multiplication IF needed
596
+ # Assuming coordinate multiplication IS needed for display if state stores relative coords
597
+ current_page_annotations_df = convert_annotation_data_to_dataframe([page_data_for_display])
598
+
599
+
600
+ if not current_page_annotations_df.empty and not page_sizes_df.empty:
601
+ # Multiply coordinates *only* for this page's DataFrame
602
+ try:
603
+ # Need the specific page's size for multiplication
604
+ page_size_row = page_sizes_df[page_sizes_df['page'] == page_num_reported]
605
+ if not page_size_row.empty:
606
+ current_page_annotations_df = multiply_coordinates_by_page_sizes(
607
+ current_page_annotations_df, page_size_row, # Pass only the row for the current page
608
+ xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
609
+ )
610
+
611
+ except Exception as e:
612
+ print(f"Warning: Error during coordinate multiplication for page {page_num_reported}: {e}. Using original coordinates.")
613
+ # If error, proceed with original coordinates or handle as needed
614
+
615
+ if "color" not in current_page_annotations_df.columns:
616
+ current_page_annotations_df['color'] = '(0, 0, 0)'
617
+
618
+ # Convert the processed DataFrame back to the list of dicts format for the annotator
619
+ processed_current_page_annotations_list = current_page_annotations_df[["xmin", "xmax", "ymin", "ymax", "label", "color", "text", "id"]].to_dict(orient='records')
620
+
621
+ # Construct the final object expected by the Gradio ImageAnnotator value parameter
622
+ current_page_image_annotator_object: AnnotatedImageData = {
623
+ 'image': page_data_for_display.get('image'), # Use the (potentially updated) image path
624
+ 'boxes': processed_current_page_annotations_list
625
+ }
626
 
627
+ # --- Update Dropdowns and Review DataFrame ---
628
+ # This external function still operates on potentially large DataFrames.
629
+ # It receives all_image_annotations and a copy of review_df.
630
+ try:
631
+ recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_modified, recogniser_entities_dropdown_value, text_entities_drop, page_entities_drop = update_recogniser_dataframes(
632
+ all_image_annotations, # Pass the updated full state
633
+ recogniser_dataframe_base,
634
+ recogniser_entities_dropdown_value,
635
+ text_dropdown_value,
636
+ page_dropdown_value,
637
+ review_df.copy(), # Keep the copy as per original function call
638
+ page_sizes # Pass updated page sizes
639
+ )
640
+ # Generate default black colors for labels if needed by image_annotator
641
+ recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]
642
 
643
+ except Exception as e:
644
+ print(f"Error calling update_recogniser_dataframes: {e}. Returning empty/default filter data.")
645
+ recogniser_entities_list = []
646
+ recogniser_colour_list = []
647
+ recogniser_dataframe_out_gr = gr.Dataframe(pd.DataFrame(columns=["page", "label", "text", "id"]))
648
+ recogniser_dataframe_modified = pd.DataFrame(columns=["page", "label", "text", "id"])
649
+ text_entities_drop = []
650
+ page_entities_drop = []
651
 
 
652
 
653
+ # --- Final Output Components ---
654
+ page_number_reported_gradio_comp = gr.Number(label = "Current page", value=page_num_reported, precision=0)
655
 
 
656
 
 
 
 
 
 
657
 
658
+ ### Present image_annotator outputs
659
+ # Handle the case where current_page_image_annotator_object couldn't be prepared
660
+ if current_page_image_annotator_object is None:
661
+ # This should ideally be covered by the initial empty check for all_image_annotations,
662
+ # but as a safeguard:
663
+ print("Warning: Could not prepare annotator object for the current page.")
664
+ out_image_annotator = image_annotator(value=None, interactive=False) # Present blank/non-interactive
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
665
  else:
 
666
  out_image_annotator = image_annotator(
667
  value = current_page_image_annotator_object,
668
  boxes_alpha=0.1,
669
  box_thickness=1,
670
+ label_list=recogniser_entities_list, # Use labels from update_recogniser_dataframes
671
  label_colors=recogniser_colour_list,
672
  show_label=False,
673
  height=zoom_str,
 
680
  show_share_button=False,
681
  show_remove_button=False,
682
  handles_cursor=True,
683
+ interactive=True # Keep interactive if data is present
684
  )
685
 
686
+ # The original code returned page_number_reported_gradio twice;
687
+ # returning the Gradio component and the plain integer value.
688
+ # Let's match the output signature.
689
+ return (out_image_annotator,
690
+ page_number_reported_gradio_comp,
691
+ page_number_reported_gradio_comp, # Redundant, but matches original return signature
692
+ page_num_reported, # Plain integer value
693
+ recogniser_entities_dropdown_value,
694
+ recogniser_dataframe_out_gr,
695
+ recogniser_dataframe_modified,
696
+ text_entities_drop, # List of text entities for dropdown
697
+ page_entities_drop, # List of page numbers for dropdown
698
+ page_sizes, # Updated page_sizes list
699
+ all_image_annotations) # Return the updated full state
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
700
 
701
  def update_all_page_annotation_object_based_on_previous_page(
702
  page_image_annotator_object:AnnotatedImageData,
 
713
  previous_page_zero_index = previous_page -1
714
 
715
  if not current_page: current_page = 1
716
+
717
+ # This replaces the numpy array image object with the image file path
718
+ page_image_annotator_object, all_image_annotations = replace_annotator_object_img_np_array_with_page_sizes_image_path(all_image_annotations, page_image_annotator_object, page_sizes, previous_page)
 
 
 
719
 
720
  if clear_all == False: all_image_annotations[previous_page_zero_index] = page_image_annotator_object
721
  else: all_image_annotations[previous_page_zero_index]["boxes"] = []
 
744
  page_image_annotator_object = all_image_annotations[current_page - 1]
745
 
746
  # This replaces the numpy array image object with the image file path
747
+ page_image_annotator_object, all_image_annotations = replace_annotator_object_img_np_array_with_page_sizes_image_path(all_image_annotations, page_image_annotator_object, page_sizes, current_page)
748
  page_image_annotator_object['image'] = all_image_annotations[current_page - 1]["image"]
749
 
750
  if not page_image_annotator_object:
 
780
  # Check if all elements are integers in the range 0-255
781
  if all(isinstance(c, int) and 0 <= c <= 255 for c in fill):
782
  pass
783
+
784
  else:
785
  print(f"Invalid color values: {fill}. Defaulting to black.")
786
  fill = (0, 0, 0) # Default to black if invalid
 
804
  doc = [image]
805
 
806
  elif file_extension in '.csv':
 
807
  pdf_doc = []
808
 
809
  # If working with pdfs
 
1047
 
1048
  row_value_df = pd.DataFrame(data={"page":[row_value_page], "label":[row_value_label], "text":[row_value_text], "id":[row_value_id]})
1049
 
1050
+ return row_value_df
1051
 
1052
  def df_select_callback_textract_api(df: pd.DataFrame, evt: gr.SelectData):
 
 
1053
 
1054
  row_value_job_id = evt.row_value[0] # This is the page number value
1055
  # row_value_label = evt.row_value[1] # This is the label number value
 
1077
 
1078
  return row_value_page, row_value_df
1079
 
1080
+ def update_selected_review_df_row_colour(
1081
+ redaction_row_selection: pd.DataFrame,
1082
+ review_df: pd.DataFrame,
1083
+ previous_id: str = "",
1084
+ previous_colour: str = '(0, 0, 0)',
1085
+ colour: str = '(1, 0, 255)'
1086
+ ) -> tuple[pd.DataFrame, str, str]:
1087
  '''
1088
  Update the colour of a single redaction box based on the values in a selection row
1089
+ (Optimized Version)
1090
  '''
 
1091
 
1092
+ # Ensure 'color' column exists, default to previous_colour if previous_id is provided
1093
+ if "color" not in review_df.columns:
1094
+ review_df["color"] = previous_colour if previous_id else '(0, 0, 0)'
1095
+
1096
+ # Ensure 'id' column exists
1097
  if "id" not in review_df.columns:
1098
+ # Assuming fill_missing_ids is a defined function that returns a DataFrame
1099
+ # It's more efficient if this is handled outside if possible,
1100
+ # or optimized internally.
1101
+ print("Warning: 'id' column not found. Calling fill_missing_ids.")
1102
+ review_df = fill_missing_ids(review_df) # Keep this if necessary, but note it can be slow
1103
+
1104
+ # --- Optimization 1 & 2: Reset existing highlight colours using vectorized assignment ---
1105
+ # Reset the color of the previously highlighted row
1106
+ if previous_id and previous_id in review_df["id"].values:
1107
+ review_df.loc[review_df["id"] == previous_id, "color"] = previous_colour
1108
+
1109
+ # Reset the color of any row that currently has the highlight colour (handle cases where previous_id might not have been tracked correctly)
1110
+ # Convert to string for comparison only if the dtype might be mixed or not purely string
1111
+ # If 'color' is consistently string, the .astype(str) might be avoidable.
1112
+ # Assuming color is consistently string format like '(R, G, B)'
1113
+ review_df.loc[review_df["color"] == colour, "color"] = '(0, 0, 0)'
1114
 
 
 
 
1115
 
1116
  if not redaction_row_selection.empty and not review_df.empty:
1117
  use_id = (
1118
+ "id" in redaction_row_selection.columns
1119
+ and "id" in review_df.columns
1120
+ and not redaction_row_selection["id"].isnull().all()
1121
  and not review_df["id"].isnull().all()
1122
  )
1123
 
1124
+ selected_merge_cols = ["id"] if use_id else ["label", "page", "text"]
1125
 
1126
+ # --- Optimization 3: Use inner merge directly ---
1127
+ # Merge to find rows in review_df that match redaction_row_selection
1128
+ merged_reviews = review_df.merge(
1129
+ redaction_row_selection[selected_merge_cols],
1130
+ on=selected_merge_cols,
1131
+ how="inner" # Use inner join as we only care about matches
1132
+ )
1133
 
1134
+ if not merged_reviews.empty:
1135
+ # Assuming we only expect one match for highlighting a single row
1136
+ # If multiple matches are possible and you want to highlight all,
1137
+ # the logic for previous_id and previous_colour needs adjustment.
1138
+ new_previous_colour = str(merged_reviews["color"].iloc[0])
1139
+ new_previous_id = merged_reviews["id"].iloc[0]
1140
+
1141
+ # --- Optimization 1 & 2: Update color of the matched row using vectorized assignment ---
1142
+
1143
+ if use_id:
1144
+ # Faster update if using unique 'id' as merge key
1145
+ review_df.loc[review_df["id"].isin(merged_reviews["id"]), "color"] = colour
1146
+ else:
1147
+ # More general case using multiple columns - might be slower
1148
+ # Create a temporary key for comparison
1149
+ def create_merge_key(df, cols):
1150
+ return df[cols].astype(str).agg('_'.join, axis=1)
1151
+
1152
+ review_df_key = create_merge_key(review_df, selected_merge_cols)
1153
+ merged_reviews_key = create_merge_key(merged_reviews, selected_merge_cols)
1154
+
1155
+ review_df.loc[review_df_key.isin(merged_reviews_key), "color"] = colour
1156
+
1157
+ previous_colour = new_previous_colour
1158
+ previous_id = new_previous_id
1159
+ else:
1160
+ # No rows matched the selection
1161
+ print("No reviews found matching selection criteria")
1162
+ # The reset logic at the beginning already handles setting color to (0, 0, 0)
1163
+ # if it was the highlight colour and didn't match.
1164
+ # No specific action needed here for color reset beyond what's done initially.
1165
+ previous_colour = '(0, 0, 0)' # Reset previous_colour as no row was highlighted
1166
+ previous_id = '' # Reset previous_id
1167
 
 
 
 
 
1168
  else:
1169
+ # If selection is empty, reset any existing highlights
1170
+ review_df.loc[review_df["color"] == colour, "color"] = '(0, 0, 0)'
1171
+ previous_colour = '(0, 0, 0)'
1172
+ previous_id = ''
 
1173
 
 
1174
 
1175
+ # Ensure column order is maintained if necessary, though pandas generally preserves order
1176
+ # Creating a new DataFrame here might involve copying data, consider if this is strictly needed.
1177
+ if set(["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]).issubset(review_df.columns):
1178
+ review_df = review_df[["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]]
1179
+ else:
1180
+ print("Warning: Not all expected columns are present in review_df for reordering.")
 
1181
 
 
1182
 
1183
  return review_df, previous_id, previous_colour
1184
 
 
1285
  page_sizes_df = pd.DataFrame(page_sizes)
1286
 
1287
  # If there are no image coordinates, then convert coordinates to pymupdf coordinates prior to export
 
 
1288
  pages_are_images = False
1289
 
1290
  if "mediabox_width" not in review_file_df.columns:
 
1336
  raise ValueError(f"Invalid cropbox format: {document_cropboxes[page_python_format]}")
1337
  else:
1338
  print("Document cropboxes not found.")
 
1339
 
1340
  pdf_page_height = pymupdf_page.mediabox.height
1341
+ pdf_page_width = pymupdf_page.mediabox.width
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1342
 
1343
  # Create redaction annotation
1344
  redact_annot = SubElement(annots, 'redact')
 
1616
  # Optionally, you can add the image path or other relevant information
1617
  df.loc[_, 'image'] = image_path
1618
 
 
 
1619
  out_file_path = output_folder + file_path_name + "_review_file.csv"
1620
  df.to_csv(out_file_path, index=None)
1621
 
tools/textract_batch_call.py CHANGED
@@ -10,7 +10,7 @@ from io import StringIO
10
  from urllib.parse import urlparse
11
  from botocore.exceptions import ClientError, NoCredentialsError, PartialCredentialsError, TokenRetrievalError
12
 
13
- from tools.config import TEXTRACT_BULK_ANALYSIS_BUCKET, OUTPUT_FOLDER, AWS_REGION, DOCUMENT_REDACTION_BUCKET, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC
14
  #from tools.aws_textract import json_to_ocrresult
15
 
16
  def analyse_document_with_textract_api(
@@ -18,7 +18,7 @@ def analyse_document_with_textract_api(
18
  s3_input_prefix: str,
19
  s3_output_prefix: str,
20
  job_df:pd.DataFrame,
21
- s3_bucket_name: str = TEXTRACT_BULK_ANALYSIS_BUCKET,
22
  local_output_dir: str = OUTPUT_FOLDER,
23
  analyse_signatures:List[str] = [],
24
  successful_job_number:int=0,
@@ -328,7 +328,7 @@ def poll_bulk_textract_analysis_progress_and_download(
328
  s3_output_prefix: str,
329
  pdf_filename:str,
330
  job_df:pd.DataFrame,
331
- s3_bucket_name: str = TEXTRACT_BULK_ANALYSIS_BUCKET,
332
  local_output_dir: str = OUTPUT_FOLDER,
333
  load_s3_jobs_loc:str=TEXTRACT_JOBS_S3_LOC,
334
  load_local_jobs_loc:str=TEXTRACT_JOBS_LOCAL_LOC,
 
10
  from urllib.parse import urlparse
11
  from botocore.exceptions import ClientError, NoCredentialsError, PartialCredentialsError, TokenRetrievalError
12
 
13
+ from tools.config import TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET, OUTPUT_FOLDER, AWS_REGION, DOCUMENT_REDACTION_BUCKET, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC
14
  #from tools.aws_textract import json_to_ocrresult
15
 
16
  def analyse_document_with_textract_api(
 
18
  s3_input_prefix: str,
19
  s3_output_prefix: str,
20
  job_df:pd.DataFrame,
21
+ s3_bucket_name: str = TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET,
22
  local_output_dir: str = OUTPUT_FOLDER,
23
  analyse_signatures:List[str] = [],
24
  successful_job_number:int=0,
 
328
  s3_output_prefix: str,
329
  pdf_filename:str,
330
  job_df:pd.DataFrame,
331
+ s3_bucket_name: str = TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET,
332
  local_output_dir: str = OUTPUT_FOLDER,
333
  load_s3_jobs_loc:str=TEXTRACT_JOBS_S3_LOC,
334
  load_local_jobs_loc:str=TEXTRACT_JOBS_LOCAL_LOC,