Merge pull request #18 from seanpedrick-case/dev
Browse filesImproved review efficiency, logging to DynamoDB, local OCR text extraction saves, bug fixes
- .dockerignore +3 -1
- .gitignore +3 -1
- README.md +260 -56
- app.py +130 -70
- pyproject.toml +57 -0
- requirements.txt +2 -2
- tools/aws_functions.py +57 -1
- tools/aws_textract.py +205 -20
- tools/config.py +48 -19
- tools/custom_csvlogger.py +162 -26
- tools/custom_image_analyser_engine.py +143 -40
- tools/data_anonymise.py +62 -42
- tools/file_conversion.py +665 -289
- tools/file_redaction.py +145 -66
- tools/helper_functions.py +27 -10
- tools/redaction_review.py +518 -249
- tools/textract_batch_call.py +3 -3
.dockerignore
CHANGED
@@ -17,4 +17,6 @@ dist/*
|
|
17 |
build_deps/*
|
18 |
logs/*
|
19 |
config/*
|
20 |
-
user_guide/*
|
|
|
|
|
|
17 |
build_deps/*
|
18 |
logs/*
|
19 |
config/*
|
20 |
+
user_guide/*
|
21 |
+
cdk/*
|
22 |
+
web/*
|
.gitignore
CHANGED
@@ -18,4 +18,6 @@ build_deps/*
|
|
18 |
logs/*
|
19 |
config/*
|
20 |
doc_redaction_amplify_app/*
|
21 |
-
user_guide/*
|
|
|
|
|
|
18 |
logs/*
|
19 |
config/*
|
20 |
doc_redaction_amplify_app/*
|
21 |
+
user_guide/*
|
22 |
+
cdk/*
|
23 |
+
web/*
|
README.md
CHANGED
@@ -20,6 +20,12 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
|
|
20 |
|
21 |
# USER GUIDE
|
22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
23 |
## Table of contents
|
24 |
|
25 |
- [Example data files](#example-data-files)
|
@@ -33,59 +39,102 @@ NOTE: The app is not 100% accurate, and it will miss some personal information.
|
|
33 |
- [Redacting only specific pages](#redacting-only-specific-pages)
|
34 |
- [Handwriting and signature redaction](#handwriting-and-signature-redaction)
|
35 |
- [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
|
|
|
36 |
|
37 |
See the [advanced user guide here](#advanced-user-guide):
|
38 |
-
- [
|
39 |
-
- [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
|
40 |
-
- [Merging existing redaction review files](#merging-existing-redaction-review-files)
|
41 |
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
42 |
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
43 |
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
44 |
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
|
45 |
- [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
|
|
|
46 |
- [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
|
|
|
47 |
|
48 |
## Example data files
|
49 |
|
50 |
-
Please
|
51 |
- [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
|
52 |
- [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
|
53 |
- [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
|
|
|
54 |
|
55 |
## Basic redaction
|
56 |
|
57 |
-
The document redaction app can detect personally-identifiable information (PII) in documents. Documents can be redacted directly, or suggested redactions can be reviewed and modified using a grapical user interface.
|
58 |
|
59 |
Download the example PDFs above to your computer. Open up the redaction app with the link provided by email.
|
60 |
|
61 |

|
62 |
|
63 |
-
|
|
|
|
|
|
|
|
|
64 |
|
65 |
-
First, select one of the three text extraction options
|
66 |
-
- 'Local model - selectable text' - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
|
67 |
-
- 'Local OCR model - PDFs without selectable text' - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
|
68 |
-
- 'AWS Textract service - all PDF types' - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
|
70 |
If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
|
71 |
-
- '
|
72 |
-
- '
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
|
74 |
-
|
75 |
|
76 |

|
77 |
|
78 |
-
- '...redacted.pdf' files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
|
79 |
-
- '...ocr_results.csv' files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
|
80 |
-
- '...review_file.csv' files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
|
81 |
|
82 |
-
Additional
|
83 |
|
84 |
-
|
85 |
|
86 |
-
|
87 |
-
|
88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
We have covered redacting documents with the default redaction options. The '...redacted.pdf' file output may be enough for your purposes. But it is very likely that you will need to customise your redaction options, which we will cover below.
|
91 |
|
@@ -126,6 +175,16 @@ There may be full pages in a document that you want to redact. The app also prov
|
|
126 |
|
127 |
Using the above approaches to allow, deny, and full page redaction lists will give you an output [like this](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/Partnership-Agreement-Toolkit_0_0_redacted.pdf).
|
128 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
129 |
### Redacting additional types of personal information
|
130 |
|
131 |
You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
|
@@ -146,7 +205,9 @@ Say also we are only interested in redacting page 1 of the loaded documents. On
|
|
146 |
|
147 |
## Handwriting and signature redaction
|
148 |
|
149 |
-
The file [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf) is provided as an example document to test AWS Textract + redaction with a document that has signatures in. If you have access to AWS Textract in the app, try removing all entity types from redaction on the Redaction settings and clicking the big X to the right of 'Entities to redact'.
|
|
|
|
|
150 |
|
151 |

|
152 |
|
@@ -156,72 +217,169 @@ The outputs should show handwriting/signatures redacted (see pages 5 - 7), which
|
|
156 |
|
157 |
## Reviewing and modifying suggested redactions
|
158 |
|
159 |
-
|
|
|
|
|
160 |
|
161 |
-
On
|
162 |
|
163 |

|
164 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
165 |
You can change the page viewed either by clicking 'Previous page' or 'Next page', or by typing a specific page number in the 'Current page' box and pressing Enter on your keyboard. Each time you switch page, it will save redactions you have made on the page you are moving from, so you will not lose changes you have made.
|
166 |
|
167 |
-
|
168 |
|
169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
170 |
|
171 |

|
172 |
|
173 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
174 |
|
175 |
-
|
176 |
|
177 |
-
|
178 |
|
179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
180 |
|
181 |

|
182 |
|
183 |
-
|
184 |
|
185 |
-
|
186 |
|
187 |
-
|
188 |
|
189 |
-
|
190 |
|
191 |
-
|
192 |
-
- [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
|
193 |
-
- [Merging existing redaction review files](#merging-existing-redaction-review-files)
|
194 |
-
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
195 |
-
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
196 |
-
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
197 |
-
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
|
198 |
-
- [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
|
199 |
|
|
|
200 |
|
201 |
-
|
|
|
|
|
202 |
|
203 |
-
|
204 |
|
205 |
-
|
|
|
206 |
|
207 |
-
|
208 |
-
If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
|
209 |
|
210 |
-
|
211 |
|
212 |
-
|
213 |
|
214 |
-
|
215 |
|
216 |
-
|
217 |
|
218 |
-
|
219 |
|
220 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
221 |
|
222 |
-
We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
|
223 |
|
224 |
-
|
225 |
|
226 |
Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
|
227 |
|
@@ -303,6 +461,30 @@ When you click the 'convert .xfdf comment file to review_file.csv' button, the a
|
|
303 |
|
304 |

|
305 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
306 |
## Using AWS Textract and Comprehend when not running in an AWS environment
|
307 |
|
308 |
AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
|
@@ -322,4 +504,26 @@ AWS_SECRET_KEY= your-secret-key
|
|
322 |
|
323 |
The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
|
324 |
|
325 |
-
Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
# USER GUIDE
|
22 |
|
23 |
+
## Experiment with the test (public) version of the app
|
24 |
+
You can test out many of the features described in this user guide at the [public test version of the app](https://huggingface.co/spaces/seanpedrickcase/document_redaction), which is free. AWS functions (e.g. Textract, Comprehend) are not enabled (unless you have valid API keys).
|
25 |
+
|
26 |
+
## Chat over this user guide
|
27 |
+
You can now [speak with a chat bot about this user guide](https://huggingface.co/spaces/seanpedrickcase/Light-PDF-Web-QA-Chatbot) (beta!)
|
28 |
+
|
29 |
## Table of contents
|
30 |
|
31 |
- [Example data files](#example-data-files)
|
|
|
39 |
- [Redacting only specific pages](#redacting-only-specific-pages)
|
40 |
- [Handwriting and signature redaction](#handwriting-and-signature-redaction)
|
41 |
- [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
|
42 |
+
- [Redacting tabular data files (CSV/XLSX) or copy and pasted text](#redacting-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
|
43 |
|
44 |
See the [advanced user guide here](#advanced-user-guide):
|
45 |
+
- [Merging redaction review files](#merging-redaction-review-files)
|
|
|
|
|
46 |
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
47 |
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
48 |
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
49 |
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
|
50 |
- [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
|
51 |
+
- [Using the AWS Textract document API](#using-the-aws-textract-document-api)
|
52 |
- [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
|
53 |
+
- [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
|
54 |
|
55 |
## Example data files
|
56 |
|
57 |
+
Please try these example files to follow along with this guide:
|
58 |
- [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
|
59 |
- [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
|
60 |
- [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
|
61 |
+
- [Dummy case note data](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv)
|
62 |
|
63 |
## Basic redaction
|
64 |
|
65 |
+
The document redaction app can detect personally-identifiable information (PII) in documents. Documents can be redacted directly, or suggested redactions can be reviewed and modified using a grapical user interface. Basic document redaction can be performed quickly using the default options.
|
66 |
|
67 |
Download the example PDFs above to your computer. Open up the redaction app with the link provided by email.
|
68 |
|
69 |

|
70 |
|
71 |
+
### Upload files to the app
|
72 |
+
|
73 |
+
The 'Redact PDFs/images tab' currently accepts PDFs and image files (JPG, PNG) for redaction. Click on the 'Drop files here or Click to Upload' area of the screen, and select one of the three different [example files](#example-data-files) (they should all be stored in the same folder if you want them to be redacted at the same time).
|
74 |
+
|
75 |
+
### Text extraction
|
76 |
|
77 |
+
First, select one of the three text extraction options:
|
78 |
+
- **'Local model - selectable text'** - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
|
79 |
+
- **'Local OCR model - PDFs without selectable text'** - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
|
80 |
+
- **'AWS Textract service - all PDF types'** - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
|
81 |
+
|
82 |
+
### Optional - select signature extraction
|
83 |
+
If you chose the AWS Textract service above, you can choose if you want handwriting and/or signatures redacted by default. Choosing signatures here will have a cost implication, as identifying signatures will cost ~£2.66 ($3.50) per 1,000 pages vs ~£1.14 ($1.50) per 1,000 pages without signature detection.
|
84 |
+
|
85 |
+

|
86 |
+
|
87 |
+
### PII redaction method
|
88 |
|
89 |
If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
|
90 |
+
- **'Only extract text - (no redaction)'** - If you are only interested in getting the text out of the document for further processing (e.g. to find duplicate pages, or to review text on the Review redactions page)
|
91 |
+
- **'Local'** - This uses the spacy package to rapidly detect PII in extracted text. This method is often sufficient if you are just interested in redacting specific terms defined in a custom list.
|
92 |
+
- **'AWS Comprehend'** - This method calls an AWS service to provide more accurate identification of PII in extracted text.
|
93 |
+
|
94 |
+
### Optional - costs and time estimation
|
95 |
+
If the option is enabled (by your system admin, in the config file), you will see a cost and time estimate for the redaction process. 'Existing Textract output file found' will be checked automatically if previous Textract text extraction files exist in the output folder, or have been [previously uploaded by the user](#aws-textract-outputs) (saving time and money for redaction).
|
96 |
+
|
97 |
+

|
98 |
+
|
99 |
+
### Optional - cost code selection
|
100 |
+
If the option is enabled (by your system admin, in the config file), you may be prompted to select a cost code before continuing with the redaction task.
|
101 |
+
|
102 |
+

|
103 |
+
|
104 |
+
The relevant cost code can be found either by: 1. Using the search bar above the data table to find relevant cost codes, then clicking on the relevant row, or 2. typing it directly into the dropdown to the right, where it should filter as you type.
|
105 |
+
|
106 |
+
### Optional - Submit whole documents to Textract API
|
107 |
+
If this option is enabled (by your system admin, in the config file), you will have the option to submit whole documents in quick succession to the AWS Textract service to get extracted text outputs quickly (faster than using the 'Redact document' process described here). This feature is described in more detail in the [advanced user guide](#using-the-aws-textract-document-api).
|
108 |
+
|
109 |
+

|
110 |
+
|
111 |
+
### Redact the document
|
112 |
+
|
113 |
+
Click 'Redact document'. After loading in the document, the app should be able to process about 30 pages per minute (depending on redaction methods chose above). When ready, you should see a message saying that processing is complete, with output files appearing in the bottom right.
|
114 |
|
115 |
+
### Redaction outputs
|
116 |
|
117 |

|
118 |
|
119 |
+
- **'...redacted.pdf'** files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
|
120 |
+
- **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
|
121 |
+
- **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
|
122 |
|
123 |
+
### Additional AWS Textract / local OCR outputs
|
124 |
|
125 |
+
If you have used the AWS Textract option for extracting text, you may also see a '..._textract.json' file. This file contains all the relevant extracted text information that comes from the AWS Textract service. You can keep this file and upload it at a later date alongside your input document, which will enable you to skip calling AWS Textract every single time you want to do a redaction task, as follows:
|
126 |
|
127 |
+

|
128 |
+
|
129 |
+
Similarly, if you have used the 'Local OCR method' to extract text, you may see a '..._ocr_results_with_words.json' file. This file works in the same way as the AWS Textract .json results described above, and can be uploaded alongside an input document to save time on text extraction in future in the same way.
|
130 |
+
|
131 |
+
### Downloading output files from previous redaction tasks
|
132 |
+
|
133 |
+
If you are logged in via AWS Cognito and you lose your app page for some reason (e.g. from a crash, reloading), it is possible recover your previous output files, provided the server has not been shut down since you redacted the document. Go to 'Redaction settings', then scroll to the bottom to see 'View all output files from this session'.
|
134 |
+
|
135 |
+

|
136 |
+
|
137 |
+
### Basic redaction summary
|
138 |
|
139 |
We have covered redacting documents with the default redaction options. The '...redacted.pdf' file output may be enough for your purposes. But it is very likely that you will need to customise your redaction options, which we will cover below.
|
140 |
|
|
|
175 |
|
176 |
Using the above approaches to allow, deny, and full page redaction lists will give you an output [like this](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/allow_list/Partnership-Agreement-Toolkit_0_0_redacted.pdf).
|
177 |
|
178 |
+
#### Adding to the loaded allow, deny, and whole page lists in-app
|
179 |
+
|
180 |
+
If you open the accordion below the allow list options called 'Manually modify custom allow...', you should be able to see a few tables with options to add new rows:
|
181 |
+
|
182 |
+

|
183 |
+
|
184 |
+
If the table is empty, you can add a new entry, you can add a new row by clicking on the '+' item below each table header. If there is existing data, you may need to click on the three dots to the right and select 'Add row below'. Type the item you wish to keep/remove in the cell, and then (important) press enter to add this new item to the allow/deny/whole page list. Your output tables should look something like below.
|
185 |
+
|
186 |
+

|
187 |
+
|
188 |
### Redacting additional types of personal information
|
189 |
|
190 |
You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
|
|
|
205 |
|
206 |
## Handwriting and signature redaction
|
207 |
|
208 |
+
The file [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf) is provided as an example document to test AWS Textract + redaction with a document that has signatures in. If you have access to AWS Textract in the app, try removing all entity types from redaction on the Redaction settings and clicking the big X to the right of 'Entities to redact'.
|
209 |
+
|
210 |
+
To ensure that handwriting and signatures are enabled (enabled by default), on the front screen go the 'AWS Textract signature detection' to enable/disable the following options :
|
211 |
|
212 |

|
213 |
|
|
|
217 |
|
218 |
## Reviewing and modifying suggested redactions
|
219 |
|
220 |
+
Sometimes the app will suggest redactions that are incorrect, or will miss personal information entirely. The app allows you to review and modify suggested redactions to compensate for this. You can do this on the 'Review redactions' tab.
|
221 |
+
|
222 |
+
We will go through ways to review suggested redactions with an example.On the first tab 'PDFs/images' upload the ['Example of files sent to a professor before applying.pdf'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf) file. Let's stick with the 'Local model - selectable text' option, and click 'Redact document'. Once the outputs are created, go to the 'Review redactions' tab.
|
223 |
|
224 |
+
On the 'Review redactions' tab you have a visual interface that allows you to inspect and modify redactions suggested by the app. There are quite a few options to look at, so we'll go from top to bottom.
|
225 |
|
226 |

|
227 |
|
228 |
+
### Uploading documents for review
|
229 |
+
|
230 |
+
The top area has a file upload area where you can upload original, unredacted PDFs, alongside the '..._review_file.csv' that is produced by the redaction process. Once you have uploaded these two files, click the 'Review PDF...' button to load in the files for review. This will allow you to visualise and modify the suggested redactions using the interface below.
|
231 |
+
|
232 |
+
Optionally, you can also upload one of the '..._ocr_output.csv' files here that comes out of a redaction task, so that you can navigate the extracted text from the document.
|
233 |
+
|
234 |
+

|
235 |
+
|
236 |
+
You can upload the three review files in the box (unredacted document, '..._review_file.csv' and '..._ocr_output.csv' file) before clicking 'Review PDF...', as in the image below:
|
237 |
+
|
238 |
+

|
239 |
+
|
240 |
+
**NOTE:** ensure you upload the ***unredacted*** document here and not the redacted version, otherwise you will be checking over a document that already has redaction boxes applied!
|
241 |
+
|
242 |
+
### Page navigation
|
243 |
+
|
244 |
You can change the page viewed either by clicking 'Previous page' or 'Next page', or by typing a specific page number in the 'Current page' box and pressing Enter on your keyboard. Each time you switch page, it will save redactions you have made on the page you are moving from, so you will not lose changes you have made.
|
245 |
|
246 |
+
You can also navigate to different pages by clicking on rows in the tables under 'Search suggested redactions' to the right, or 'search all extracted text' (if enabled) beneath that.
|
247 |
|
248 |
+
### The document viewer pane
|
249 |
+
|
250 |
+
On the selected page, each redaction is highlighted with a box next to its suggested redaction label (e.g. person, email).
|
251 |
+
|
252 |
+

|
253 |
+
|
254 |
+
There are a number of different options to add and modify redaction boxes and page on the document viewer pane. To zoom in and out of the page, use your mouse wheel. To move around the page while zoomed, you need to be in modify mode. Scroll to the bottom of the document viewer to see the relevant controls. You should see a box icon, a hand icon, and two arrows pointing counter-clockwise and clockwise.
|
255 |
|
256 |

|
257 |
|
258 |
+
Click on the hand icon to go into modify mode. When you click and hold on the document viewer, This will allow you to move around the page when zoomed in. To rotate the page, you can click on either of the round arrow buttons to turn in that direction.
|
259 |
+
|
260 |
+
**NOTE:** When you switch page, the viewer will stay in your selected orientation, so if it looks strange, just rotate the page again and hopefully it will look correct!
|
261 |
+
|
262 |
+
#### Modify existing redactions (hand icon)
|
263 |
+
|
264 |
+
After clicking on the hand icon, the interface allows you to modify existing redaction boxes. When in this mode, you can click and hold on an existing box to move it.
|
265 |
+
|
266 |
+

|
267 |
+
|
268 |
+
Click on one of the small boxes at the edges to change the size of the box. To delete a box, click on it to highlight it, then press delete on your keyboard. Alternatively, double click on a box and click 'Remove' on the box that appears.
|
269 |
+
|
270 |
+

|
271 |
+
|
272 |
+
#### Add new redaction boxes (box icon)
|
273 |
+
|
274 |
+
To change to 'add redaction boxes' mode, scroll to the bottom of the page. Click on the box icon, and your cursor will change into a crosshair. Now you can add new redaction boxes where you wish. A popup will appear when you create a new box so you can select a label and colour for the new box.
|
275 |
+
|
276 |
+
#### 'Locking in' new redaction box format
|
277 |
|
278 |
+
It is possible to lock in a chosen format for new redaction boxes so that you don't have the popup appearing each time. When you make a new box, select the options for your 'locked' format, and then click on the lock icon on the left side of the popup, which should turn blue.
|
279 |
|
280 |
+

|
281 |
|
282 |
+
You can now add new redaction boxes without a popup appearing. If you want to change or 'unlock' the your chosen box format, you can click on the new icon that has appeared at the bottom of the document viewer pane that looks a little like a gift tag. You can then change the defaults, or click on the lock icon again to 'unlock' the new box format - then popups will appear again each time you create a new box.
|
283 |
+
|
284 |
+

|
285 |
+
|
286 |
+
### Apply redactions to PDF and Save changes on current page
|
287 |
+
|
288 |
+
Once you have reviewed all the redactions in your document and you are happy with the outputs, you can click 'Apply revised redactions to PDF' to create a new '_redacted.pdf' output alongside a new '_review_file.csv' output.
|
289 |
+
|
290 |
+
If you are working on a page and haven't saved for a while, you can click 'Save changes on current page to file' to ensure that they are saved to an updated 'review_file.csv' output.
|
291 |
|
292 |

|
293 |
|
294 |
+
### Selecting and removing redaction boxes using the 'Search suggested redactions' table
|
295 |
|
296 |
+
The table shows a list of all the suggested redactions in the document alongside the page, label, and text (if available).
|
297 |
|
298 |
+

|
299 |
|
300 |
+
If you click on one of the rows in this table, you will be taken to the page of the redaction. Clicking on a redaction row on the same page *should* change the colour of redaction box to blue to help you locate it in the document viewer (just in app, not in redaction output PDFs).
|
301 |
|
302 |
+

|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
303 |
|
304 |
+
You can choose a specific entity type to see which pages the entity is present on. If you want to go to the page specified in the table, you can click on a cell in the table and the review page will be changed to that page.
|
305 |
|
306 |
+
To filter the 'Search suggested redactions' table you can:
|
307 |
+
1. Click on one of the dropdowns (Redaction category, Page, Text), and select an option, or
|
308 |
+
2. Write text in the 'Filter' box just above the table. Click the blue box to apply the filter to the table.
|
309 |
|
310 |
+
Once you have filtered the table, you have a few options underneath on what you can do with the filtered rows:
|
311 |
|
312 |
+
- Click the 'Exclude specific row from redactions' button to remove only the redaction from the last row you clicked on from the document.
|
313 |
+
- Click the 'Exclude all items in table from redactions' button to remove all redactions visible in the table from the document. **Important:** ensure that you have clicked the blue tick icon next to the search box before doing this, or you will remove all redactions from the document. If you do end up doing this, click the 'Undo last element removal' button below to restore the redactions.
|
314 |
|
315 |
+
**NOTE**: After excluding redactions using either of the above options, click the 'Reset filters' button below to ensure that the dropdowns and table return to seeing all remaining redactions in the document.
|
|
|
316 |
|
317 |
+
If you made a mistake, click the 'Undo last element removal' button to restore the Search suggested redactions table to its previous state (can only undo the last action).
|
318 |
|
319 |
+
### Navigating through the document using the 'Search all extracted text'
|
320 |
|
321 |
+
The 'search all extracted text' table will contain text if you have just redacted a document, or if you have uploaded a '..._ocr_output.csv' file alongside a document file and review file on the Review redactions tab as [described above](#uploading-documents-for-review).
|
322 |
|
323 |
+
You can navigate through the document using this table. When you click on a row, the Document viewer pane to the left will change to the selected page.
|
324 |
|
325 |
+

|
326 |
|
327 |
+
You can search through the extracted text by using the search bar just above the table, which should filter as you type. To apply the filter and 'cut' the table, click on the blue tick inside the box next to your search term. To return the table to its original content, click the button below the table 'Reset OCR output table filter'.
|
328 |
+
|
329 |
+

|
330 |
+
|
331 |
+
## Redacting tabular data files (XLSX/CSV) or copy and pasted text
|
332 |
+
|
333 |
+
### Tabular data files (XLSX/CSV)
|
334 |
+
|
335 |
+
The app can be used to redact tabular data files such as xlsx or csv files. For this to work properly, your data file needs to be in a simple table format, with a single table starting from the first cell (A1), and no other information in the sheet. Similarly for .xlsx files, each sheet in the file that you want to redact should be in this simple format.
|
336 |
+
|
337 |
+
To demonstrate this, we can use [the example csv file 'combined_case_notes.csv'](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/combined_case_notes.csv), which is a small dataset of dummy social care case notes. Go to the 'Open text or Excel/csv files' tab. Drop the file into the upload area. After the file is loaded, you should see the suggested columns for redaction in the box underneath. You can select and deselect columns to redact as you wish from this list.
|
338 |
+
|
339 |
+

|
340 |
+
|
341 |
+
If you were instead to upload an xlsx file, you would see also a list of all the sheets in the xlsx file that can be redacted. The 'Select columns' area underneath will suggest a list of all columns in the file across all sheets.
|
342 |
+
|
343 |
+

|
344 |
+
|
345 |
+
Once you have chosen your input file and sheets/columns to redact, you can choose the redaction method. 'Local' will use the same local model as used for documents on the first tab. 'AWS Comprehend' will give better results, at a slight cost.
|
346 |
+
|
347 |
+
When you click Redact text/data files, you will see the progress of the redaction task by file and sheet, and you will receive a csv output with the redacted data.
|
348 |
+
|
349 |
+
### Choosing output anonymisation format
|
350 |
+
You can also choose the anonymisation format of your output results. Open the tab 'Anonymisation output format' to see the options. By default, any detected PII will be replaced with the word 'REDACTED' in the cell. You can choose one of the following options as the form of replacement for the redacted text:
|
351 |
+
- replace with 'REDACTED': Replaced by the word 'REDACTED' (default)
|
352 |
+
- replace with <ENTITY_NAME>: Replaced by e.g. 'PERSON' for people, 'EMAIL_ADDRESS' for emails etc.
|
353 |
+
- redact completely: Text is removed completely and replaced by nothing.
|
354 |
+
- hash: Replaced by a unique long ID code that is consistent with entity text. I.e. a particular name will always have the same ID code.
|
355 |
+
- mask: Replace with stars '*'.
|
356 |
+
|
357 |
+
### Redacting copy and pasted text
|
358 |
+
You can also write open text into an input box and redact that using the same methods as described above. To do this, write or paste text into the 'Enter open text' box that appears when you open the 'Redact open text' tab. Then select a redaction method, and an anonymisation output format as described above. The redacted text will be printed in the output textbox, and will also be saved to a simple csv file in the output file box.
|
359 |
+
|
360 |
+

|
361 |
+
|
362 |
+
### Redaction log outputs
|
363 |
+
A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
|
364 |
+
|
365 |
+
# ADVANCED USER GUIDE
|
366 |
+
|
367 |
+
This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
|
368 |
+
|
369 |
+
## Table of contents
|
370 |
+
|
371 |
+
- [Merging redaction review files](#merging-redaction-review-files)
|
372 |
+
- [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
|
373 |
+
- [Fuzzy search and redaction](#fuzzy-search-and-redaction)
|
374 |
+
- [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
|
375 |
+
- [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
|
376 |
+
- [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
|
377 |
+
- [Using the AWS Textract document API](#using-the-aws-textract-document-api)
|
378 |
+
- [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
|
379 |
+
- [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
|
380 |
|
|
|
381 |
|
382 |
+
## Merging redaction review files
|
383 |
|
384 |
Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
|
385 |
|
|
|
461 |
|
462 |

|
463 |
|
464 |
+
## Using the AWS Textract document API
|
465 |
+
|
466 |
+
This option can be enabled by your system admin, in the config file ('SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS' environment variable, and subsequent variables). Using this, you will have the option to submit whole documents in quick succession to the AWS Textract service to get extracted text outputs quickly (faster than using the 'Redact document' process described here).
|
467 |
+
|
468 |
+
### Starting a new Textract API job
|
469 |
+
|
470 |
+
To use this feature, first upload a document file in the file input box [in the usual way](#upload-files-to-the-app) on the first tab of the app. Under AWS Textract signature detection you can select whether or not you would like to analyse signatures or not (with a [cost implication](#optional---select-signature-extraction)).
|
471 |
+
|
472 |
+
Then, open the section under the heading 'Submit whole document to AWS Textract API...'.
|
473 |
+
|
474 |
+

|
475 |
+
|
476 |
+
Click 'Analyse document with AWS Textract API call'. After a few seconds, the job should be submitted to the AWS Textract service. The box 'Job ID to check status' should now have an ID filled in. If it is not already filled with previous jobs (up to seven days old), the table should have a row added with details of the new API job.
|
477 |
+
|
478 |
+
Click the button underneath, 'Check status of Textract job and download', to see progress on the job. Processing will continue in the background until the job is ready, so it is worth periodically clicking this button to see if the outputs are ready. In testing, and as a rough estimate, it seems like this process takes about five seconds per page. However, this has not been tested with very large documents. Once ready, the '_textract.json' output should appear below.
|
479 |
+
|
480 |
+
### Textract API job outputs
|
481 |
+
|
482 |
+
The '_textract.json' output can be used to speed up further redaction tasks as [described previously](#optional---costs-and-time-estimation), the 'Existing Textract output file found' flag should now be ticked.
|
483 |
+
|
484 |
+

|
485 |
+
|
486 |
+
You can now easily get the '..._ocr_output.csv' redaction output based on this '_textract.json' (described in [Redaction outputs](#redaction-outputs)) by clicking on the button 'Convert Textract job outputs to OCR results'. You can now use this file e.g. for [identifying duplicate pages](#identifying-and-redacting-duplicate-pages), or for redaction review.
|
487 |
+
|
488 |
## Using AWS Textract and Comprehend when not running in an AWS environment
|
489 |
|
490 |
AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
|
|
|
504 |
|
505 |
The app should then pick up these keys when trying to access the AWS Textract and Comprehend services during redaction.
|
506 |
|
507 |
+
Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
|
508 |
+
|
509 |
+
## Modifying existing redaction review files
|
510 |
+
|
511 |
+
You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
|
512 |
+
|
513 |
+
As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
|
514 |
+
|
515 |
+
If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
|
516 |
+
|
517 |
+

|
518 |
+
|
519 |
+
The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Finad & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
|
520 |
+
|
521 |
+
How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
|
522 |
+
|
523 |
+
Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
|
524 |
+
|
525 |
+
I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
|
526 |
+
|
527 |
+

|
528 |
+
|
529 |
+
We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
|
app.py
CHANGED
@@ -4,11 +4,11 @@ import pandas as pd
|
|
4 |
import gradio as gr
|
5 |
from gradio_image_annotation import image_annotator
|
6 |
|
7 |
-
from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET,
|
8 |
-
from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select
|
9 |
-
from tools.aws_functions import upload_file_to_s3, download_file_from_s3
|
10 |
from tools.file_redaction import choose_and_run_redactor
|
11 |
-
from tools.file_conversion import prepare_image_or_pdf, get_input_file_names
|
12 |
from tools.redaction_review import apply_redactions_to_review_df_and_files, update_all_page_annotation_object_based_on_previous_page, decrease_page, increase_page, update_annotator_object_and_filter_df, update_entities_df_recogniser_entities, update_entities_df_page, update_entities_df_text, df_select_callback, convert_df_to_xfdf, convert_xfdf_to_dataframe, reset_dropdowns, exclude_selected_items_from_redaction, undo_last_removal, update_selected_review_df_row_colour, update_all_entity_df_dropdowns, df_select_callback_cost, update_other_annotator_number_from_current, update_annotator_page_from_review_df, df_select_callback_ocr, df_select_callback_textract_api
|
13 |
from tools.data_anonymise import anonymise_data_files
|
14 |
from tools.auth import authenticate_user
|
@@ -44,6 +44,19 @@ else:
|
|
44 |
default_ocr_val = text_ocr_option
|
45 |
default_pii_detector = local_pii_detector
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
# Create the gradio interface
|
48 |
app = gr.Blocks(theme = gr.themes.Base(), fill_width=True)
|
49 |
|
@@ -61,6 +74,9 @@ with app:
|
|
61 |
all_decision_process_table_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="all_decision_process_table", visible=False, type="pandas", wrap=True)
|
62 |
review_file_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="review_file_df", visible=False, type="pandas", wrap=True)
|
63 |
|
|
|
|
|
|
|
64 |
session_hash_state = gr.Textbox(label= "session_hash_state", value="", visible=False)
|
65 |
host_name_textbox = gr.Textbox(label= "host_name_textbox", value=HOST_NAME, visible=False)
|
66 |
s3_output_folder_state = gr.Textbox(label= "s3_output_folder_state", value="", visible=False)
|
@@ -105,7 +121,12 @@ with app:
|
|
105 |
|
106 |
doc_full_file_name_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
|
107 |
doc_file_name_no_extension_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
|
108 |
-
blank_doc_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
|
|
|
|
|
|
|
|
|
|
|
109 |
doc_file_name_with_extension_textbox = gr.Textbox(label = "doc_file_name_with_extension_textbox", value="", visible=False)
|
110 |
doc_file_name_textbox_list = gr.Dropdown(label = "doc_file_name_textbox_list", value="", allow_custom_value=True,visible=False)
|
111 |
latest_review_file_path = gr.Textbox(label = "latest_review_file_path", value="", visible=False) # Latest review file path output from redaction
|
@@ -149,9 +170,9 @@ with app:
|
|
149 |
s3_default_allow_list_file = gr.Textbox(label = "Default allow list file", value=S3_ALLOW_LIST_PATH, visible=False)
|
150 |
default_allow_list_output_folder_location = gr.Textbox(label = "Output default allow list location", value=OUTPUT_ALLOW_LIST_PATH, visible=False)
|
151 |
|
152 |
-
s3_bulk_textract_default_bucket = gr.Textbox(label = "Default Textract bulk S3 bucket", value=
|
153 |
-
s3_bulk_textract_input_subfolder = gr.Textbox(label = "Default Textract bulk S3 input folder", value=
|
154 |
-
s3_bulk_textract_output_subfolder = gr.Textbox(label = "Default Textract bulk S3 output folder", value=
|
155 |
successful_textract_api_call_number = gr.Number(precision=0, value=0, visible=False)
|
156 |
no_redaction_method_drop = gr.Radio(label = """Placeholder for no redaction method after downloading Textract outputs""", value = no_redaction_option, choices=[no_redaction_option], visible=False)
|
157 |
textract_only_method_drop = gr.Radio(label="""Placeholder for Textract method after downloading Textract outputs""", value = textract_option, choices=[textract_option], visible=False)
|
@@ -184,6 +205,7 @@ with app:
|
|
184 |
cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
|
185 |
|
186 |
textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=False)
|
|
|
187 |
total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=False)
|
188 |
estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost ($)", value=0, visible=False, precision=2)
|
189 |
estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=False, precision=2)
|
@@ -240,10 +262,14 @@ with app:
|
|
240 |
if SHOW_COSTS == "True":
|
241 |
with gr.Accordion("Estimated costs and time taken", open = True, visible=True):
|
242 |
with gr.Row(equal_height=True):
|
243 |
-
|
244 |
-
|
245 |
-
|
246 |
-
|
|
|
|
|
|
|
|
|
247 |
|
248 |
if GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True":
|
249 |
with gr.Accordion("Apply cost code", open = True, visible=True):
|
@@ -253,7 +279,7 @@ with app:
|
|
253 |
reset_cost_code_dataframe_button = gr.Button(value="Reset code code table filter")
|
254 |
cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=True)
|
255 |
|
256 |
-
if
|
257 |
with gr.Accordion("Submit whole document to AWS Textract API (quicker, max 3,000 pages per document)", open = False, visible=True):
|
258 |
with gr.Row(equal_height=True):
|
259 |
gr.Markdown("""Document will be submitted to AWS Textract API service to extract all text in the document. Processing will take place on (secure) AWS servers, and outputs will be stored on S3 for up to 7 days. To download the results, click 'Check status' below and they will be downloaded if ready.""")
|
@@ -381,7 +407,7 @@ with app:
|
|
381 |
###
|
382 |
with gr.Tab(label="Open text or Excel/csv files"):
|
383 |
gr.Markdown("""### Choose open text or a tabular data file (xlsx or csv) to redact.""")
|
384 |
-
with gr.Accordion("
|
385 |
in_text = gr.Textbox(label="Enter open text", lines=10)
|
386 |
with gr.Accordion("Upload xlsx or csv files", open = True):
|
387 |
in_data_files = gr.File(label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet', '.csv.gz'], height=file_input_height)
|
@@ -391,6 +417,9 @@ with app:
|
|
391 |
in_colnames = gr.Dropdown(choices=["Choose columns to anonymise"], multiselect = True, label="Select columns that you want to anonymise (showing columns present across all files).")
|
392 |
|
393 |
pii_identification_method_drop_tabular = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
|
|
|
|
|
|
|
394 |
|
395 |
tabular_data_redact_btn = gr.Button("Redact text/data files", variant="primary")
|
396 |
|
@@ -448,10 +477,10 @@ with app:
|
|
448 |
aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
|
449 |
aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
|
450 |
|
451 |
-
|
452 |
-
anon_strat = gr.Radio(choices=["replace with 'REDACTED'", "replace with <ENTITY_NAME>", "redact completely", "hash", "mask", "encrypt", "fake_first_name"], label="Select an anonymisation method.", value = "replace with 'REDACTED'")
|
453 |
|
454 |
-
|
|
|
455 |
|
456 |
with gr.Accordion("Combine multiple review files", open = False):
|
457 |
multiple_review_files_in_out = gr.File(label="Combine multiple review_file.csv files together here.", file_count='multiple', file_types=['.csv'])
|
@@ -477,14 +506,17 @@ with app:
|
|
477 |
handwrite_signature_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
|
478 |
textract_output_found_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
|
479 |
only_extract_text_radio.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
|
|
|
480 |
|
481 |
# Calculate time taken
|
482 |
-
total_pdf_page_count.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
|
483 |
-
text_extract_method_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
|
484 |
-
pii_identification_method_drop.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
|
485 |
-
handwrite_signature_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
|
486 |
-
textract_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
|
487 |
-
only_extract_text_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_time_taken_number])
|
|
|
|
|
488 |
|
489 |
# Allow user to select items from cost code dataframe for cost code
|
490 |
if SHOW_COSTS=="True" and (GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True"):
|
@@ -494,27 +526,30 @@ with app:
|
|
494 |
cost_code_choice_drop.select(update_cost_code_dataframe_from_dropdown_select, inputs=[cost_code_choice_drop, cost_code_dataframe_base], outputs=[cost_code_dataframe])
|
495 |
|
496 |
in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
|
497 |
-
success(fn = prepare_image_or_pdf, inputs=[in_doc_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, first_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool_false, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base]).\
|
498 |
-
success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox])
|
|
|
499 |
|
500 |
# Run redaction function
|
501 |
document_redact_btn.click(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call, textract_query_number]).\
|
502 |
success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
|
503 |
-
success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
|
504 |
-
outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path], api_name="redact_doc").\
|
505 |
success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
506 |
|
507 |
# If the app has completed a batch of pages, it will rerun the redaction process until the end of all pages in the document
|
508 |
-
current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
|
509 |
-
outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path]).\
|
510 |
success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
511 |
|
512 |
# If a file has been completed, the function will continue onto the next document
|
513 |
-
latest_file_completed_text.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
|
514 |
-
outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path]).\
|
515 |
success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state]).\
|
516 |
success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
|
517 |
-
success(fn=
|
|
|
|
|
518 |
|
519 |
# If the line level ocr results are changed by load in by user or by a new redaction task, replace the ocr results displayed in the table
|
520 |
all_line_level_ocr_results_df_base.change(reset_ocr_base_dataframe, inputs=[all_line_level_ocr_results_df_base], outputs=[all_line_level_ocr_results_df])
|
@@ -532,8 +567,8 @@ with app:
|
|
532 |
convert_textract_outputs_to_ocr_results.click(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
|
533 |
success(fn= check_textract_outputs_exist, inputs=[textract_output_found_checkbox]).\
|
534 |
success(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call]).\
|
535 |
-
success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, textract_only_method_drop, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, no_redaction_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path],
|
536 |
-
outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path])
|
537 |
|
538 |
###
|
539 |
# REVIEW PDF REDACTIONS
|
@@ -542,7 +577,7 @@ with app:
|
|
542 |
# Upload previous files for modifying redactions
|
543 |
upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
|
544 |
success(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
|
545 |
-
success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base], api_name="prepare_doc").\
|
546 |
success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
547 |
|
548 |
# Page number controls
|
@@ -572,9 +607,9 @@ with app:
|
|
572 |
text_entity_dropdown.select(update_entities_df_text, inputs=[text_entity_dropdown, recogniser_entity_dataframe_base, recogniser_entity_dropdown, page_entity_dropdown], outputs=[recogniser_entity_dataframe, recogniser_entity_dropdown, page_entity_dropdown])
|
573 |
|
574 |
# Clicking on a cell in the recogniser entity dataframe will take you to that page, and also highlight the target redaction box in blue
|
575 |
-
recogniser_entity_dataframe.select(df_select_callback, inputs=[recogniser_entity_dataframe], outputs=[
|
576 |
-
success(update_selected_review_df_row_colour, inputs=[selected_entity_dataframe_row, review_file_state, selected_entity_id, selected_entity_colour
|
577 |
-
success(update_annotator_page_from_review_df, inputs=[review_file_state, images_pdf_state, page_sizes,
|
578 |
|
579 |
reset_dropdowns_btn.click(reset_dropdowns, inputs=[recogniser_entity_dataframe_base], outputs=[recogniser_entity_dropdown, text_entity_dropdown, page_entity_dropdown]).\
|
580 |
success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
@@ -604,12 +639,12 @@ with app:
|
|
604 |
|
605 |
# Convert review file to xfdf Adobe format
|
606 |
convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
|
607 |
-
success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder]).\
|
608 |
success(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state, output_folder_textbox, document_cropboxes, page_sizes], outputs=[adobe_review_files_out])
|
609 |
|
610 |
# Convert xfdf Adobe file back to review_file.csv
|
611 |
convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
|
612 |
-
success(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder]).\
|
613 |
success(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state, output_folder_textbox], outputs=[output_review_files], scroll_to_output=True)
|
614 |
|
615 |
###
|
@@ -618,11 +653,14 @@ with app:
|
|
618 |
in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets]).\
|
619 |
success(fn=get_input_file_names, inputs=[in_data_files], outputs=[data_file_name_no_extension_textbox, data_file_name_with_extension_textbox, data_full_file_name_textbox, data_file_name_textbox_list, total_pdf_page_count])
|
620 |
|
621 |
-
tabular_data_redact_btn.click(
|
|
|
|
|
622 |
|
|
|
623 |
# If the output file count text box changes, keep going with redacting each data file until done
|
624 |
-
text_tabular_files_done.change(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, second_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state]).\
|
625 |
-
success(fn = reveal_feedback_buttons, outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title])
|
626 |
|
627 |
###
|
628 |
# IDENTIFY DUPLICATE PAGES
|
@@ -654,7 +692,7 @@ with app:
|
|
654 |
|
655 |
# Get connection details on app load
|
656 |
|
657 |
-
if
|
658 |
app.load(get_connection_params, inputs=[output_folder_textbox, input_folder_textbox, session_output_folder_textbox, s3_bulk_textract_input_subfolder, s3_bulk_textract_output_subfolder, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder], outputs=[session_hash_state, output_folder_textbox, session_hash_textbox, input_folder_textbox, s3_bulk_textract_input_subfolder, s3_bulk_textract_output_subfolder, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder]).\
|
659 |
success(load_in_textract_job_details, inputs=[load_s3_bulk_textract_logs_bool, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder], outputs=[textract_job_detail_df])
|
660 |
else:
|
@@ -691,49 +729,71 @@ with app:
|
|
691 |
# LOGGING
|
692 |
###
|
693 |
|
|
|
694 |
# Log usernames and times of access to file (to know who is using the app when running on AWS)
|
695 |
access_callback = CSVLogger_custom(dataset_file_name=log_file_name)
|
696 |
-
access_callback.setup([session_hash_textbox, host_name_textbox], ACCESS_LOGS_FOLDER)
|
697 |
-
|
698 |
-
|
699 |
-
success(fn = upload_file_to_s3, inputs=[access_logs_state, access_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
|
700 |
|
701 |
-
|
702 |
-
|
703 |
-
|
704 |
-
|
705 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
706 |
|
707 |
-
|
708 |
-
|
709 |
-
data_callback.setup([data_feedback_radio, data_further_details_text, data_full_file_name_textbox], FEEDBACK_LOGS_FOLDER)
|
710 |
-
data_submit_feedback_btn.click(lambda *args: data_callback.flag(list(args)), [data_feedback_radio, data_further_details_text, data_full_file_name_textbox], None, preprocess=False).\
|
711 |
-
success(fn = upload_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[data_further_details_text])
|
712 |
|
713 |
-
|
714 |
-
usage_callback = CSVLogger_custom(dataset_file_name=log_file_name)
|
715 |
|
716 |
if DISPLAY_FILE_NAMES_IN_LOGS == 'True':
|
717 |
usage_callback.setup([session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], USAGE_LOGS_FOLDER)
|
718 |
|
719 |
-
latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args)), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
|
720 |
-
success(fn =
|
721 |
|
722 |
-
|
723 |
-
success(fn =
|
|
|
|
|
|
|
724 |
else:
|
725 |
-
usage_callback.setup([session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs,
|
|
|
|
|
|
|
726 |
|
727 |
-
|
728 |
-
success(fn =
|
729 |
|
730 |
-
successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args)), [session_hash_textbox,
|
731 |
-
success(fn =
|
732 |
|
733 |
if __name__ == "__main__":
|
734 |
if RUN_DIRECT_MODE == "0":
|
735 |
|
736 |
-
if
|
737 |
app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, auth=authenticate_user, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
|
738 |
else:
|
739 |
app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
|
|
|
4 |
import gradio as gr
|
5 |
from gradio_image_annotation import image_annotator
|
6 |
|
7 |
+
from tools.config import OUTPUT_FOLDER, INPUT_FOLDER, RUN_DIRECT_MODE, MAX_QUEUE_SIZE, DEFAULT_CONCURRENCY_LIMIT, MAX_FILE_SIZE, GRADIO_SERVER_PORT, ROOT_PATH, GET_DEFAULT_ALLOW_LIST, ALLOW_LIST_PATH, S3_ALLOW_LIST_PATH, FEEDBACK_LOGS_FOLDER, ACCESS_LOGS_FOLDER, USAGE_LOGS_FOLDER, TESSERACT_FOLDER, POPPLER_FOLDER, REDACTION_LANGUAGE, GET_COST_CODES, COST_CODES_PATH, S3_COST_CODES_PATH, ENFORCE_COST_CODES, DISPLAY_FILE_NAMES_IN_LOGS, SHOW_COSTS, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER, SESSION_OUTPUT_FOLDER, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC, HOST_NAME, DEFAULT_COST_CODE, OUTPUT_COST_CODES_PATH, OUTPUT_ALLOW_LIST_PATH, COGNITO_AUTH, SAVE_LOGS_TO_CSV, SAVE_LOGS_TO_DYNAMODB, ACCESS_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_ACCESS_LOG_HEADERS, CSV_ACCESS_LOG_HEADERS, FEEDBACK_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_FEEDBACK_LOG_HEADERS, CSV_FEEDBACK_LOG_HEADERS, USAGE_LOG_DYNAMODB_TABLE_NAME, DYNAMODB_USAGE_LOG_HEADERS, CSV_USAGE_LOG_HEADERS
|
8 |
+
from tools.helper_functions import put_columns_in_df, get_connection_params, reveal_feedback_buttons, custom_regex_load, reset_state_vars, load_in_default_allow_list, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option, reset_review_vars, merge_csv_files, load_all_output_files, update_dataframe, check_for_existing_textract_file, load_in_default_cost_codes, enforce_cost_codes, calculate_aws_costs, calculate_time_taken, reset_base_dataframe, reset_ocr_base_dataframe, update_cost_code_dataframe_from_dropdown_select, check_for_existing_local_ocr_file, reset_data_vars, reset_aws_call_vars
|
9 |
+
from tools.aws_functions import upload_file_to_s3, download_file_from_s3, upload_log_file_to_s3
|
10 |
from tools.file_redaction import choose_and_run_redactor
|
11 |
+
from tools.file_conversion import prepare_image_or_pdf, get_input_file_names
|
12 |
from tools.redaction_review import apply_redactions_to_review_df_and_files, update_all_page_annotation_object_based_on_previous_page, decrease_page, increase_page, update_annotator_object_and_filter_df, update_entities_df_recogniser_entities, update_entities_df_page, update_entities_df_text, df_select_callback, convert_df_to_xfdf, convert_xfdf_to_dataframe, reset_dropdowns, exclude_selected_items_from_redaction, undo_last_removal, update_selected_review_df_row_colour, update_all_entity_df_dropdowns, df_select_callback_cost, update_other_annotator_number_from_current, update_annotator_page_from_review_df, df_select_callback_ocr, df_select_callback_textract_api
|
13 |
from tools.data_anonymise import anonymise_data_files
|
14 |
from tools.auth import authenticate_user
|
|
|
44 |
default_ocr_val = text_ocr_option
|
45 |
default_pii_detector = local_pii_detector
|
46 |
|
47 |
+
SAVE_LOGS_TO_CSV = eval(SAVE_LOGS_TO_CSV)
|
48 |
+
SAVE_LOGS_TO_DYNAMODB = eval(SAVE_LOGS_TO_DYNAMODB)
|
49 |
+
|
50 |
+
if CSV_ACCESS_LOG_HEADERS: CSV_ACCESS_LOG_HEADERS = eval(CSV_ACCESS_LOG_HEADERS)
|
51 |
+
if CSV_FEEDBACK_LOG_HEADERS: CSV_FEEDBACK_LOG_HEADERS = eval(CSV_FEEDBACK_LOG_HEADERS)
|
52 |
+
if CSV_USAGE_LOG_HEADERS: CSV_USAGE_LOG_HEADERS = eval(CSV_USAGE_LOG_HEADERS)
|
53 |
+
|
54 |
+
if DYNAMODB_ACCESS_LOG_HEADERS: DYNAMODB_ACCESS_LOG_HEADERS = eval(DYNAMODB_ACCESS_LOG_HEADERS)
|
55 |
+
if DYNAMODB_FEEDBACK_LOG_HEADERS: DYNAMODB_FEEDBACK_LOG_HEADERS = eval(DYNAMODB_FEEDBACK_LOG_HEADERS)
|
56 |
+
if DYNAMODB_USAGE_LOG_HEADERS: DYNAMODB_USAGE_LOG_HEADERS = eval(DYNAMODB_USAGE_LOG_HEADERS)
|
57 |
+
|
58 |
+
print
|
59 |
+
|
60 |
# Create the gradio interface
|
61 |
app = gr.Blocks(theme = gr.themes.Base(), fill_width=True)
|
62 |
|
|
|
74 |
all_decision_process_table_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="all_decision_process_table", visible=False, type="pandas", wrap=True)
|
75 |
review_file_state = gr.Dataframe(value=pd.DataFrame(), headers=None, col_count=0, row_count = (0, "dynamic"), label="review_file_df", visible=False, type="pandas", wrap=True)
|
76 |
|
77 |
+
all_page_line_level_ocr_results = gr.State([])
|
78 |
+
all_page_line_level_ocr_results_with_children = gr.State([])
|
79 |
+
|
80 |
session_hash_state = gr.Textbox(label= "session_hash_state", value="", visible=False)
|
81 |
host_name_textbox = gr.Textbox(label= "host_name_textbox", value=HOST_NAME, visible=False)
|
82 |
s3_output_folder_state = gr.Textbox(label= "s3_output_folder_state", value="", visible=False)
|
|
|
121 |
|
122 |
doc_full_file_name_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
|
123 |
doc_file_name_no_extension_textbox = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
|
124 |
+
blank_doc_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "doc_full_file_name_textbox", value="", visible=False)
|
125 |
+
blank_data_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "data_full_file_name_textbox", value="", visible=False)
|
126 |
+
placeholder_doc_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "doc_full_file_name_textbox", value="document", visible=False)
|
127 |
+
placeholder_data_file_name_no_extension_textbox_for_logs = gr.Textbox(label = "data_full_file_name_textbox", value="data_file", visible=False)
|
128 |
+
|
129 |
+
# Left blank for when user does not want to report file names
|
130 |
doc_file_name_with_extension_textbox = gr.Textbox(label = "doc_file_name_with_extension_textbox", value="", visible=False)
|
131 |
doc_file_name_textbox_list = gr.Dropdown(label = "doc_file_name_textbox_list", value="", allow_custom_value=True,visible=False)
|
132 |
latest_review_file_path = gr.Textbox(label = "latest_review_file_path", value="", visible=False) # Latest review file path output from redaction
|
|
|
170 |
s3_default_allow_list_file = gr.Textbox(label = "Default allow list file", value=S3_ALLOW_LIST_PATH, visible=False)
|
171 |
default_allow_list_output_folder_location = gr.Textbox(label = "Output default allow list location", value=OUTPUT_ALLOW_LIST_PATH, visible=False)
|
172 |
|
173 |
+
s3_bulk_textract_default_bucket = gr.Textbox(label = "Default Textract bulk S3 bucket", value=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET, visible=False)
|
174 |
+
s3_bulk_textract_input_subfolder = gr.Textbox(label = "Default Textract bulk S3 input folder", value=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER, visible=False)
|
175 |
+
s3_bulk_textract_output_subfolder = gr.Textbox(label = "Default Textract bulk S3 output folder", value=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER, visible=False)
|
176 |
successful_textract_api_call_number = gr.Number(precision=0, value=0, visible=False)
|
177 |
no_redaction_method_drop = gr.Radio(label = """Placeholder for no redaction method after downloading Textract outputs""", value = no_redaction_option, choices=[no_redaction_option], visible=False)
|
178 |
textract_only_method_drop = gr.Radio(label="""Placeholder for Textract method after downloading Textract outputs""", value = textract_option, choices=[textract_option], visible=False)
|
|
|
205 |
cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis. Please contact Finance if you can't find your cost code in the given list.", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=False)
|
206 |
|
207 |
textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=False)
|
208 |
+
local_ocr_output_found_checkbox = gr.Checkbox(value= False, label="Existing local OCR output file found", interactive=False, visible=False)
|
209 |
total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=False)
|
210 |
estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost ($)", value=0, visible=False, precision=2)
|
211 |
estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=False, precision=2)
|
|
|
262 |
if SHOW_COSTS == "True":
|
263 |
with gr.Accordion("Estimated costs and time taken", open = True, visible=True):
|
264 |
with gr.Row(equal_height=True):
|
265 |
+
with gr.Column(scale=1):
|
266 |
+
textract_output_found_checkbox = gr.Checkbox(value= False, label="Existing Textract output file found", interactive=False, visible=True)
|
267 |
+
local_ocr_output_found_checkbox = gr.Checkbox(value= False, label="Existing local OCR output file found", interactive=False, visible=True)
|
268 |
+
with gr.Column(scale=4):
|
269 |
+
with gr.Row(equal_height=True):
|
270 |
+
total_pdf_page_count = gr.Number(label = "Total page count", value=0, visible=True)
|
271 |
+
estimated_aws_costs_number = gr.Number(label = "Approximate AWS Textract and/or Comprehend cost (£)", value=0.00, precision=2, visible=True)
|
272 |
+
estimated_time_taken_number = gr.Number(label = "Approximate time taken to extract text/redact (minutes)", value=0, visible=True, precision=2)
|
273 |
|
274 |
if GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True":
|
275 |
with gr.Accordion("Apply cost code", open = True, visible=True):
|
|
|
279 |
reset_cost_code_dataframe_button = gr.Button(value="Reset code code table filter")
|
280 |
cost_code_choice_drop = gr.Dropdown(value=DEFAULT_COST_CODE, label="Choose cost code for analysis", choices=[DEFAULT_COST_CODE], allow_custom_value=False, visible=True)
|
281 |
|
282 |
+
if SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS == "True":
|
283 |
with gr.Accordion("Submit whole document to AWS Textract API (quicker, max 3,000 pages per document)", open = False, visible=True):
|
284 |
with gr.Row(equal_height=True):
|
285 |
gr.Markdown("""Document will be submitted to AWS Textract API service to extract all text in the document. Processing will take place on (secure) AWS servers, and outputs will be stored on S3 for up to 7 days. To download the results, click 'Check status' below and they will be downloaded if ready.""")
|
|
|
407 |
###
|
408 |
with gr.Tab(label="Open text or Excel/csv files"):
|
409 |
gr.Markdown("""### Choose open text or a tabular data file (xlsx or csv) to redact.""")
|
410 |
+
with gr.Accordion("Redact open text", open = False):
|
411 |
in_text = gr.Textbox(label="Enter open text", lines=10)
|
412 |
with gr.Accordion("Upload xlsx or csv files", open = True):
|
413 |
in_data_files = gr.File(label="Choose Excel or csv files", file_count= "multiple", file_types=['.xlsx', '.xls', '.csv', '.parquet', '.csv.gz'], height=file_input_height)
|
|
|
417 |
in_colnames = gr.Dropdown(choices=["Choose columns to anonymise"], multiselect = True, label="Select columns that you want to anonymise (showing columns present across all files).")
|
418 |
|
419 |
pii_identification_method_drop_tabular = gr.Radio(label = "Choose PII detection method. AWS Comprehend has a cost of approximately $0.01 per 10,000 characters.", value = default_pii_detector, choices=[local_pii_detector, aws_pii_detector])
|
420 |
+
|
421 |
+
with gr.Accordion("Anonymisation output format", open = False):
|
422 |
+
anon_strat = gr.Radio(choices=["replace with 'REDACTED'", "replace with <ENTITY_NAME>", "redact completely", "hash", "mask"], label="Select an anonymisation method.", value = "replace with 'REDACTED'") # , "encrypt", "fake_first_name" are also available, but are not currently included as not that useful in current form
|
423 |
|
424 |
tabular_data_redact_btn = gr.Button("Redact text/data files", variant="primary")
|
425 |
|
|
|
477 |
aws_access_key_textbox = gr.Textbox(value='', label="AWS access key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
|
478 |
aws_secret_key_textbox = gr.Textbox(value='', label="AWS secret key for account with permissions for AWS Textract and Comprehend", visible=True, type="password")
|
479 |
|
480 |
+
|
|
|
481 |
|
482 |
+
with gr.Accordion("Log file outputs", open = False):
|
483 |
+
log_files_output = gr.File(label="Log file output", interactive=False)
|
484 |
|
485 |
with gr.Accordion("Combine multiple review files", open = False):
|
486 |
multiple_review_files_in_out = gr.File(label="Combine multiple review_file.csv files together here.", file_count='multiple', file_types=['.csv'])
|
|
|
506 |
handwrite_signature_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
|
507 |
textract_output_found_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
|
508 |
only_extract_text_radio.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
|
509 |
+
textract_output_found_checkbox.change(calculate_aws_costs, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio], outputs=[estimated_aws_costs_number])
|
510 |
|
511 |
# Calculate time taken
|
512 |
+
total_pdf_page_count.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
|
513 |
+
text_extract_method_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
|
514 |
+
pii_identification_method_drop.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
|
515 |
+
handwrite_signature_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
|
516 |
+
textract_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, handwrite_signature_checkbox, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
|
517 |
+
only_extract_text_radio.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
|
518 |
+
textract_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
|
519 |
+
local_ocr_output_found_checkbox.change(calculate_time_taken, inputs=[total_pdf_page_count, text_extract_method_radio, pii_identification_method_drop, textract_output_found_checkbox, only_extract_text_radio, local_ocr_output_found_checkbox], outputs=[estimated_time_taken_number])
|
520 |
|
521 |
# Allow user to select items from cost code dataframe for cost code
|
522 |
if SHOW_COSTS=="True" and (GET_COST_CODES == "True" or ENFORCE_COST_CODES == "True"):
|
|
|
526 |
cost_code_choice_drop.select(update_cost_code_dataframe_from_dropdown_select, inputs=[cost_code_choice_drop, cost_code_dataframe_base], outputs=[cost_code_dataframe])
|
527 |
|
528 |
in_doc_files.upload(fn=get_input_file_names, inputs=[in_doc_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
|
529 |
+
success(fn = prepare_image_or_pdf, inputs=[in_doc_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, first_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool_false, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base, local_ocr_output_found_checkbox]).\
|
530 |
+
success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
|
531 |
+
success(fn=check_for_existing_local_ocr_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[local_ocr_output_found_checkbox])
|
532 |
|
533 |
# Run redaction function
|
534 |
document_redact_btn.click(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call, textract_query_number]).\
|
535 |
success(fn= enforce_cost_codes, inputs=[enforce_cost_code_textbox, cost_code_choice_drop, cost_code_dataframe_base]).\
|
536 |
+
success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
|
537 |
+
outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children], api_name="redact_doc").\
|
538 |
success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
539 |
|
540 |
# If the app has completed a batch of pages, it will rerun the redaction process until the end of all pages in the document
|
541 |
+
current_loop_page_number.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
|
542 |
+
outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children]).\
|
543 |
success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
544 |
|
545 |
# If a file has been completed, the function will continue onto the next document
|
546 |
+
latest_file_completed_text.change(fn = choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, text_extract_method_radio, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, second_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, pii_identification_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
|
547 |
+
outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children]).\
|
548 |
success(fn=update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, page_min, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs=[annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state]).\
|
549 |
success(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
|
550 |
+
success(fn=check_for_existing_local_ocr_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[local_ocr_output_found_checkbox]).\
|
551 |
+
success(fn=reveal_feedback_buttons, outputs=[pdf_feedback_radio, pdf_further_details_text, pdf_submit_feedback_btn, pdf_feedback_title]).\
|
552 |
+
success(fn = reset_aws_call_vars, outputs=[comprehend_query_number, textract_query_number])
|
553 |
|
554 |
# If the line level ocr results are changed by load in by user or by a new redaction task, replace the ocr results displayed in the table
|
555 |
all_line_level_ocr_results_df_base.change(reset_ocr_base_dataframe, inputs=[all_line_level_ocr_results_df_base], outputs=[all_line_level_ocr_results_df])
|
|
|
567 |
convert_textract_outputs_to_ocr_results.click(fn=check_for_existing_textract_file, inputs=[doc_file_name_no_extension_textbox, output_folder_textbox], outputs=[textract_output_found_checkbox]).\
|
568 |
success(fn= check_textract_outputs_exist, inputs=[textract_output_found_checkbox]).\
|
569 |
success(fn = reset_state_vars, outputs=[all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, textract_metadata_textbox, annotator, output_file_list_state, log_files_output_list_state, recogniser_entity_dataframe, recogniser_entity_dataframe_base, pdf_doc_state, duplication_file_path_outputs_list_state, redaction_output_summary_textbox, is_a_textract_api_call]).\
|
570 |
+
success(fn= choose_and_run_redactor, inputs=[in_doc_files, prepared_pdf_state, images_pdf_state, in_redact_language, in_redact_entities, in_redact_comprehend_entities, textract_only_method_drop, in_allow_list_state, in_deny_list_state, in_fully_redacted_list_state, latest_file_completed_text, redaction_output_summary_textbox, output_file_list_state, log_files_output_list_state, first_loop_state, page_min, page_max, actual_time_taken_number, handwrite_signature_checkbox, textract_metadata_textbox, all_image_annotations_state, all_line_level_ocr_results_df_base, all_decision_process_table_state, pdf_doc_state, current_loop_page_number, page_break_return, no_redaction_method_drop, comprehend_query_number, max_fuzzy_spelling_mistakes_num, match_fuzzy_whole_phrase_bool, aws_access_key_textbox, aws_secret_key_textbox, annotate_max_pages, review_file_state, output_folder_textbox, document_cropboxes, page_sizes, textract_output_found_checkbox, only_extract_text_radio, duplication_file_path_outputs_list_state, latest_review_file_path, input_folder_textbox, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children],
|
571 |
+
outputs=[redaction_output_summary_textbox, output_file, output_file_list_state, latest_file_completed_text, log_files_output, log_files_output_list_state, actual_time_taken_number, textract_metadata_textbox, pdf_doc_state, all_image_annotations_state, current_loop_page_number, page_break_return, all_line_level_ocr_results_df_base, all_decision_process_table_state, comprehend_query_number, output_review_files, annotate_max_pages, annotate_max_pages_bottom, prepared_pdf_state, images_pdf_state, review_file_state, page_sizes, duplication_file_path_outputs_list_state, in_duplicate_pages, latest_review_file_path, textract_query_number, latest_ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_children])
|
572 |
|
573 |
###
|
574 |
# REVIEW PDF REDACTIONS
|
|
|
577 |
# Upload previous files for modifying redactions
|
578 |
upload_previous_review_file_btn.click(fn=reset_review_vars, inputs=None, outputs=[recogniser_entity_dataframe, recogniser_entity_dataframe_base]).\
|
579 |
success(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
|
580 |
+
success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_base, local_ocr_output_found_checkbox], api_name="prepare_doc").\
|
581 |
success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
582 |
|
583 |
# Page number controls
|
|
|
607 |
text_entity_dropdown.select(update_entities_df_text, inputs=[text_entity_dropdown, recogniser_entity_dataframe_base, recogniser_entity_dropdown, page_entity_dropdown], outputs=[recogniser_entity_dataframe, recogniser_entity_dropdown, page_entity_dropdown])
|
608 |
|
609 |
# Clicking on a cell in the recogniser entity dataframe will take you to that page, and also highlight the target redaction box in blue
|
610 |
+
recogniser_entity_dataframe.select(df_select_callback, inputs=[recogniser_entity_dataframe], outputs=[selected_entity_dataframe_row]).\
|
611 |
+
success(update_selected_review_df_row_colour, inputs=[selected_entity_dataframe_row, review_file_state, selected_entity_id, selected_entity_colour], outputs=[review_file_state, selected_entity_id, selected_entity_colour]).\
|
612 |
+
success(update_annotator_page_from_review_df, inputs=[review_file_state, images_pdf_state, page_sizes, all_image_annotations_state, annotator, selected_entity_dataframe_row, input_folder_textbox, doc_full_file_name_textbox], outputs=[annotator, all_image_annotations_state, annotate_current_page, page_sizes, review_file_state, annotate_previous_page])
|
613 |
|
614 |
reset_dropdowns_btn.click(reset_dropdowns, inputs=[recogniser_entity_dataframe_base], outputs=[recogniser_entity_dropdown, text_entity_dropdown, page_entity_dropdown]).\
|
615 |
success(update_annotator_object_and_filter_df, inputs=[all_image_annotations_state, annotate_current_page, recogniser_entity_dropdown, page_entity_dropdown, text_entity_dropdown, recogniser_entity_dataframe_base, annotator_zoom_number, review_file_state, page_sizes, doc_full_file_name_textbox, input_folder_textbox], outputs = [annotator, annotate_current_page, annotate_current_page_bottom, annotate_previous_page, recogniser_entity_dropdown, recogniser_entity_dataframe, recogniser_entity_dataframe_base, text_entity_dropdown, page_entity_dropdown, page_sizes, all_image_annotations_state])
|
|
|
639 |
|
640 |
# Convert review file to xfdf Adobe format
|
641 |
convert_review_file_to_adobe_btn.click(fn=get_input_file_names, inputs=[output_review_files], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
|
642 |
+
success(fn = prepare_image_or_pdf, inputs=[output_review_files, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder, local_ocr_output_found_checkbox]).\
|
643 |
success(convert_df_to_xfdf, inputs=[output_review_files, pdf_doc_state, images_pdf_state, output_folder_textbox, document_cropboxes, page_sizes], outputs=[adobe_review_files_out])
|
644 |
|
645 |
# Convert xfdf Adobe file back to review_file.csv
|
646 |
convert_adobe_to_review_file_btn.click(fn=get_input_file_names, inputs=[adobe_review_files_out], outputs=[doc_file_name_no_extension_textbox, doc_file_name_with_extension_textbox, doc_full_file_name_textbox, doc_file_name_textbox_list, total_pdf_page_count]).\
|
647 |
+
success(fn = prepare_image_or_pdf, inputs=[adobe_review_files_out, text_extract_method_radio, latest_file_completed_text, redaction_output_summary_textbox, second_loop_state, annotate_max_pages, all_image_annotations_state, prepare_for_review_bool, in_fully_redacted_list_state, output_folder_textbox, input_folder_textbox, prepare_images_bool_false], outputs=[redaction_output_summary_textbox, prepared_pdf_state, images_pdf_state, annotate_max_pages, annotate_max_pages_bottom, pdf_doc_state, all_image_annotations_state, review_file_state, document_cropboxes, page_sizes, textract_output_found_checkbox, all_img_details_state, all_line_level_ocr_results_df_placeholder, local_ocr_output_found_checkbox]).\
|
648 |
success(fn=convert_xfdf_to_dataframe, inputs=[adobe_review_files_out, pdf_doc_state, images_pdf_state, output_folder_textbox], outputs=[output_review_files], scroll_to_output=True)
|
649 |
|
650 |
###
|
|
|
653 |
in_data_files.upload(fn=put_columns_in_df, inputs=[in_data_files], outputs=[in_colnames, in_excel_sheets]).\
|
654 |
success(fn=get_input_file_names, inputs=[in_data_files], outputs=[data_file_name_no_extension_textbox, data_file_name_with_extension_textbox, data_full_file_name_textbox, data_file_name_textbox_list, total_pdf_page_count])
|
655 |
|
656 |
+
tabular_data_redact_btn.click(reset_data_vars, outputs=[actual_time_taken_number, log_files_output_list_state, comprehend_query_number]).\
|
657 |
+
success(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, first_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox, actual_time_taken_number], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state, actual_time_taken_number], api_name="redact_data").\
|
658 |
+
success(fn = reveal_feedback_buttons, outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title])
|
659 |
|
660 |
+
# Currently only supports redacting one data file at a time
|
661 |
# If the output file count text box changes, keep going with redacting each data file until done
|
662 |
+
# text_tabular_files_done.change(fn=anonymise_data_files, inputs=[in_data_files, in_text, anon_strat, in_colnames, in_redact_language, in_redact_entities, in_allow_list_state, text_tabular_files_done, text_output_summary, text_output_file_list_state, log_files_output_list_state, in_excel_sheets, second_loop_state, output_folder_textbox, in_deny_list_state, max_fuzzy_spelling_mistakes_num, pii_identification_method_drop_tabular, in_redact_comprehend_entities, comprehend_query_number, aws_access_key_textbox, aws_secret_key_textbox, actual_time_taken_number], outputs=[text_output_summary, text_output_file, text_output_file_list_state, text_tabular_files_done, log_files_output, log_files_output_list_state, actual_time_taken_number]).\
|
663 |
+
# success(fn = reveal_feedback_buttons, outputs=[data_feedback_radio, data_further_details_text, data_submit_feedback_btn, data_feedback_title])
|
664 |
|
665 |
###
|
666 |
# IDENTIFY DUPLICATE PAGES
|
|
|
692 |
|
693 |
# Get connection details on app load
|
694 |
|
695 |
+
if SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS == "True":
|
696 |
app.load(get_connection_params, inputs=[output_folder_textbox, input_folder_textbox, session_output_folder_textbox, s3_bulk_textract_input_subfolder, s3_bulk_textract_output_subfolder, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder], outputs=[session_hash_state, output_folder_textbox, session_hash_textbox, input_folder_textbox, s3_bulk_textract_input_subfolder, s3_bulk_textract_output_subfolder, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder]).\
|
697 |
success(load_in_textract_job_details, inputs=[load_s3_bulk_textract_logs_bool, s3_bulk_textract_logs_subfolder, local_bulk_textract_logs_subfolder], outputs=[textract_job_detail_df])
|
698 |
else:
|
|
|
729 |
# LOGGING
|
730 |
###
|
731 |
|
732 |
+
### ACCESS LOGS
|
733 |
# Log usernames and times of access to file (to know who is using the app when running on AWS)
|
734 |
access_callback = CSVLogger_custom(dataset_file_name=log_file_name)
|
735 |
+
access_callback.setup([session_hash_textbox, host_name_textbox], ACCESS_LOGS_FOLDER)
|
736 |
+
session_hash_textbox.change(lambda *args: access_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=ACCESS_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_ACCESS_LOG_HEADERS, replacement_headers=CSV_ACCESS_LOG_HEADERS), [session_hash_textbox, host_name_textbox], None, preprocess=False).\
|
737 |
+
success(fn = upload_log_file_to_s3, inputs=[access_logs_state, access_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
|
|
|
738 |
|
739 |
+
### FEEDBACK LOGS
|
740 |
+
if DISPLAY_FILE_NAMES_IN_LOGS == 'True':
|
741 |
+
# User submitted feedback for pdf redactions
|
742 |
+
pdf_callback = CSVLogger_custom(dataset_file_name=log_file_name)
|
743 |
+
pdf_callback.setup([pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], FEEDBACK_LOGS_FOLDER)
|
744 |
+
pdf_submit_feedback_btn.click(lambda *args: pdf_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], None, preprocess=False).\
|
745 |
+
success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[pdf_further_details_text])
|
746 |
+
|
747 |
+
# User submitted feedback for data redactions
|
748 |
+
data_callback = CSVLogger_custom(dataset_file_name=log_file_name)
|
749 |
+
data_callback.setup([data_feedback_radio, data_further_details_text, data_full_file_name_textbox], FEEDBACK_LOGS_FOLDER)
|
750 |
+
data_submit_feedback_btn.click(lambda *args: data_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [data_feedback_radio, data_further_details_text, data_full_file_name_textbox], None, preprocess=False).\
|
751 |
+
success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[data_further_details_text])
|
752 |
+
else:
|
753 |
+
# User submitted feedback for pdf redactions
|
754 |
+
pdf_callback = CSVLogger_custom(dataset_file_name=log_file_name)
|
755 |
+
pdf_callback.setup([pdf_feedback_radio, pdf_further_details_text, doc_file_name_no_extension_textbox], FEEDBACK_LOGS_FOLDER)
|
756 |
+
pdf_submit_feedback_btn.click(lambda *args: pdf_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [pdf_feedback_radio, pdf_further_details_text, placeholder_doc_file_name_no_extension_textbox_for_logs], None, preprocess=False).\
|
757 |
+
success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[pdf_further_details_text])
|
758 |
+
|
759 |
+
# User submitted feedback for data redactions
|
760 |
+
data_callback = CSVLogger_custom(dataset_file_name=log_file_name)
|
761 |
+
data_callback.setup([data_feedback_radio, data_further_details_text, data_full_file_name_textbox], FEEDBACK_LOGS_FOLDER)
|
762 |
+
data_submit_feedback_btn.click(lambda *args: data_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=FEEDBACK_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_FEEDBACK_LOG_HEADERS, replacement_headers=CSV_FEEDBACK_LOG_HEADERS), [data_feedback_radio, data_further_details_text, placeholder_data_file_name_no_extension_textbox_for_logs], None, preprocess=False).\
|
763 |
+
success(fn = upload_log_file_to_s3, inputs=[feedback_logs_state, feedback_s3_logs_loc_state], outputs=[data_further_details_text])
|
764 |
|
765 |
+
### USAGE LOGS
|
766 |
+
# Log processing usage - time taken for redaction queries, and also logs for queries to Textract/Comprehend
|
|
|
|
|
|
|
767 |
|
768 |
+
usage_callback = CSVLogger_custom(dataset_file_name=log_file_name)
|
|
|
769 |
|
770 |
if DISPLAY_FILE_NAMES_IN_LOGS == 'True':
|
771 |
usage_callback.setup([session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], USAGE_LOGS_FOLDER)
|
772 |
|
773 |
+
latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
|
774 |
+
success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
|
775 |
|
776 |
+
text_tabular_files_done.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop_tabular, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
|
777 |
+
success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
|
778 |
+
|
779 |
+
successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, doc_file_name_no_extension_textbox, data_full_file_name_textbox, total_pdf_page_count, actual_time_taken_number, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
|
780 |
+
success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
|
781 |
else:
|
782 |
+
usage_callback.setup([session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, blank_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], USAGE_LOGS_FOLDER)
|
783 |
+
|
784 |
+
latest_file_completed_text.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, placeholder_doc_file_name_no_extension_textbox_for_logs, blank_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
|
785 |
+
success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
|
786 |
|
787 |
+
text_tabular_files_done.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, blank_doc_file_name_no_extension_textbox_for_logs, placeholder_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop_tabular, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
|
788 |
+
success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
|
789 |
|
790 |
+
successful_textract_api_call_number.change(lambda *args: usage_callback.flag(list(args), save_to_csv=SAVE_LOGS_TO_CSV, save_to_dynamodb=SAVE_LOGS_TO_DYNAMODB, dynamodb_table_name=USAGE_LOG_DYNAMODB_TABLE_NAME, dynamodb_headers=DYNAMODB_USAGE_LOG_HEADERS, replacement_headers=CSV_USAGE_LOG_HEADERS), [session_hash_textbox, placeholder_doc_file_name_no_extension_textbox_for_logs, blank_data_file_name_no_extension_textbox_for_logs, actual_time_taken_number, total_pdf_page_count, textract_query_number, pii_identification_method_drop, comprehend_query_number, cost_code_choice_drop, handwrite_signature_checkbox, host_name_textbox, text_extract_method_radio, is_a_textract_api_call], None, preprocess=False).\
|
791 |
+
success(fn = upload_log_file_to_s3, inputs=[usage_logs_state, usage_s3_logs_loc_state], outputs=[s3_logs_output_textbox])
|
792 |
|
793 |
if __name__ == "__main__":
|
794 |
if RUN_DIRECT_MODE == "0":
|
795 |
|
796 |
+
if COGNITO_AUTH == "1":
|
797 |
app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, auth=authenticate_user, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
|
798 |
else:
|
799 |
app.queue(max_size=int(MAX_QUEUE_SIZE), default_concurrency_limit=int(DEFAULT_CONCURRENCY_LIMIT)).launch(show_error=True, inbrowser=True, max_file_size=MAX_FILE_SIZE, server_port=GRADIO_SERVER_PORT, root_path=ROOT_PATH)
|
pyproject.toml
ADDED
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[build-system]
|
2 |
+
requires = ["setuptools>=61.0", "wheel"]
|
3 |
+
build-backend = "setuptools.build_meta"
|
4 |
+
|
5 |
+
[project]
|
6 |
+
name = "doc_redaction" # Your application's name
|
7 |
+
version = "0.6.0" # Your application's current version
|
8 |
+
description = "Redact PDF/image-based documents, or CSV/XLSX files using a Gradio-based GUI interface" # A short description
|
9 |
+
readme = "README.md" # Path to your project's README file
|
10 |
+
requires-python = ">=3.10" # The minimum Python version required
|
11 |
+
|
12 |
+
dependencies = [
|
13 |
+
"pdfminer.six==20240706",
|
14 |
+
"pdf2image==1.17.0",
|
15 |
+
"pymupdf==1.25.3",
|
16 |
+
"opencv-python==4.10.0.84",
|
17 |
+
"presidio_analyzer==2.2.358",
|
18 |
+
"presidio_anonymizer==2.2.358",
|
19 |
+
"presidio-image-redactor==0.0.56",
|
20 |
+
"pikepdf==9.5.2",
|
21 |
+
"pandas==2.2.3",
|
22 |
+
"scikit-learn==1.6.1",
|
23 |
+
"spacy==3.8.4",
|
24 |
+
# Direct URL dependency for spacy model
|
25 |
+
"en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0.tar.gz",
|
26 |
+
"gradio==5.27.1",
|
27 |
+
"boto3==1.38.4",
|
28 |
+
"pyarrow==19.0.1",
|
29 |
+
"openpyxl==3.1.5",
|
30 |
+
"Faker==36.1.1",
|
31 |
+
"python-levenshtein==0.26.1",
|
32 |
+
"spaczz==0.6.1",
|
33 |
+
# Direct URL dependency for gradio_image_annotator wheel
|
34 |
+
"gradio_image_annotation @ https://github.com/seanpedrick-case/gradio_image_annotator/releases/download/v0.3.2/gradio_image_annotation-0.3.2-py3-none-any.whl",
|
35 |
+
"rapidfuzz==3.12.1",
|
36 |
+
"python-dotenv==1.0.1",
|
37 |
+
"numpy==1.26.4",
|
38 |
+
"awslambdaric==3.0.1"
|
39 |
+
]
|
40 |
+
|
41 |
+
[project.urls]
|
42 |
+
Homepage = "https://seanpedrick-case.github.io/doc_redaction/README.html"
|
43 |
+
repository = "https://github.com/seanpedrick-case/doc_redaction"
|
44 |
+
|
45 |
+
[project.optional-dependencies]
|
46 |
+
dev = ["pytest"]
|
47 |
+
|
48 |
+
# Optional: You can add configuration for tools used in your project under the [tool] section
|
49 |
+
# For example, configuration for a linter like Ruff:
|
50 |
+
[tool.ruff]
|
51 |
+
line-length = 88
|
52 |
+
select = ["E", "F", "I"]
|
53 |
+
|
54 |
+
# Optional: Configuration for a formatter like Black:
|
55 |
+
[tool.black]
|
56 |
+
line-length = 88
|
57 |
+
target-version = ['py310']
|
requirements.txt
CHANGED
@@ -10,8 +10,8 @@ pandas==2.2.3
|
|
10 |
scikit-learn==1.6.1
|
11 |
spacy==3.8.4
|
12 |
en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0.tar.gz
|
13 |
-
gradio==5.
|
14 |
-
boto3==1.
|
15 |
pyarrow==19.0.1
|
16 |
openpyxl==3.1.5
|
17 |
Faker==36.1.1
|
|
|
10 |
scikit-learn==1.6.1
|
11 |
spacy==3.8.4
|
12 |
en_core_web_lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.8.0/en_core_web_lg-3.8.0.tar.gz
|
13 |
+
gradio==5.27.1
|
14 |
+
boto3==1.38.4
|
15 |
pyarrow==19.0.1
|
16 |
openpyxl==3.1.5
|
17 |
Faker==36.1.1
|
tools/aws_functions.py
CHANGED
@@ -3,7 +3,7 @@ import pandas as pd
|
|
3 |
import boto3
|
4 |
import tempfile
|
5 |
import os
|
6 |
-
from tools.config import AWS_REGION, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET
|
7 |
PandasDataFrame = Type[pd.DataFrame]
|
8 |
|
9 |
def get_assumed_role_info():
|
@@ -174,3 +174,59 @@ def upload_file_to_s3(local_file_paths:List[str], s3_key:str, s3_bucket:str=DOCU
|
|
174 |
final_out_message_str = "App not set to run AWS functions"
|
175 |
|
176 |
return final_out_message_str
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
import boto3
|
4 |
import tempfile
|
5 |
import os
|
6 |
+
from tools.config import AWS_REGION, RUN_AWS_FUNCTIONS, DOCUMENT_REDACTION_BUCKET, SAVE_LOGS_TO_CSV
|
7 |
PandasDataFrame = Type[pd.DataFrame]
|
8 |
|
9 |
def get_assumed_role_info():
|
|
|
174 |
final_out_message_str = "App not set to run AWS functions"
|
175 |
|
176 |
return final_out_message_str
|
177 |
+
|
178 |
+
|
179 |
+
def upload_log_file_to_s3(local_file_paths:List[str], s3_key:str, s3_bucket:str=DOCUMENT_REDACTION_BUCKET, RUN_AWS_FUNCTIONS:str = RUN_AWS_FUNCTIONS, SAVE_LOGS_TO_CSV:str=SAVE_LOGS_TO_CSV):
|
180 |
+
"""
|
181 |
+
Uploads a log file from local machine to Amazon S3.
|
182 |
+
|
183 |
+
Args:
|
184 |
+
- local_file_path: Local file path(s) of the file(s) to upload.
|
185 |
+
- s3_key: Key (path) to the file in the S3 bucket.
|
186 |
+
- s3_bucket: Name of the S3 bucket.
|
187 |
+
|
188 |
+
Returns:
|
189 |
+
- Message as variable/printed to console
|
190 |
+
"""
|
191 |
+
final_out_message = []
|
192 |
+
final_out_message_str = ""
|
193 |
+
|
194 |
+
if RUN_AWS_FUNCTIONS == "1" and SAVE_LOGS_TO_CSV == "True":
|
195 |
+
try:
|
196 |
+
if s3_bucket and s3_key and local_file_paths:
|
197 |
+
|
198 |
+
s3_client = boto3.client('s3', region_name=AWS_REGION)
|
199 |
+
|
200 |
+
if isinstance(local_file_paths, str):
|
201 |
+
local_file_paths = [local_file_paths]
|
202 |
+
|
203 |
+
for file in local_file_paths:
|
204 |
+
if s3_client:
|
205 |
+
#print(s3_client)
|
206 |
+
try:
|
207 |
+
# Get file name off file path
|
208 |
+
file_name = os.path.basename(file)
|
209 |
+
|
210 |
+
s3_key_full = s3_key + file_name
|
211 |
+
print("S3 key: ", s3_key_full)
|
212 |
+
|
213 |
+
s3_client.upload_file(file, s3_bucket, s3_key_full)
|
214 |
+
out_message = "File " + file_name + " uploaded successfully!"
|
215 |
+
print(out_message)
|
216 |
+
|
217 |
+
except Exception as e:
|
218 |
+
out_message = f"Error uploading file(s): {e}"
|
219 |
+
print(out_message)
|
220 |
+
|
221 |
+
final_out_message.append(out_message)
|
222 |
+
final_out_message_str = '\n'.join(final_out_message)
|
223 |
+
|
224 |
+
else: final_out_message_str = "Could not connect to AWS."
|
225 |
+
else: final_out_message_str = "At least one essential variable is empty, could not upload to S3"
|
226 |
+
except Exception as e:
|
227 |
+
final_out_message_str = "Could not upload files to S3 due to: " + str(e)
|
228 |
+
print(final_out_message_str)
|
229 |
+
else:
|
230 |
+
final_out_message_str = "App not set to run AWS functions"
|
231 |
+
|
232 |
+
return final_out_message_str
|
tools/aws_textract.py
CHANGED
@@ -108,6 +108,174 @@ def convert_pike_pdf_page_to_bytes(pdf:object, page_num:int):
|
|
108 |
|
109 |
return pdf_bytes
|
110 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
111 |
def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_no:int):
|
112 |
'''
|
113 |
Convert the json response from textract to the OCRResult format used elsewhere in the code. Looks for lines, words, and signatures. Handwriting and signatures are set aside especially for later in case the user wants to override the default behaviour and redact all handwriting/signatures.
|
@@ -118,7 +286,7 @@ def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_
|
|
118 |
handwriting_recogniser_results = []
|
119 |
signatures = []
|
120 |
handwriting = []
|
121 |
-
|
122 |
text_block={}
|
123 |
|
124 |
i = 1
|
@@ -141,7 +309,7 @@ def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_
|
|
141 |
is_signature = False
|
142 |
is_handwriting = False
|
143 |
|
144 |
-
for text_block in text_blocks:
|
145 |
|
146 |
if (text_block['BlockType'] == 'LINE') | (text_block['BlockType'] == 'SIGNATURE'): # (text_block['BlockType'] == 'WORD') |
|
147 |
|
@@ -244,36 +412,53 @@ def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_
|
|
244 |
'text': line_text,
|
245 |
'bounding_box': (line_left, line_top, line_right, line_bottom)
|
246 |
}]
|
247 |
-
|
248 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
249 |
"line": i,
|
250 |
'text': line_text,
|
251 |
'bounding_box': (line_left, line_top, line_right, line_bottom),
|
252 |
-
'words': words
|
253 |
-
|
|
|
254 |
|
255 |
# Create OCRResult with absolute coordinates
|
256 |
ocr_result = OCRResult(line_text, line_left, line_top, width_abs, height_abs)
|
257 |
all_ocr_results.append(ocr_result)
|
258 |
|
259 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
260 |
|
261 |
-
|
262 |
-
|
263 |
-
|
264 |
-
signature_or_handwriting_recogniser_results.append(recogniser_result)
|
265 |
|
266 |
-
|
267 |
-
if recogniser_result not in signature_recogniser_results:
|
268 |
-
signature_recogniser_results.append(recogniser_result)
|
269 |
|
270 |
-
|
271 |
-
|
272 |
-
|
273 |
|
274 |
-
|
275 |
|
276 |
-
return all_ocr_results, signature_or_handwriting_recogniser_results, signature_recogniser_results, handwriting_recogniser_results, ocr_results_with_children
|
277 |
|
278 |
def load_and_convert_textract_json(textract_json_file_path:str, log_files_output_paths:str, page_sizes_df:pd.DataFrame):
|
279 |
"""
|
@@ -315,7 +500,7 @@ def load_and_convert_textract_json(textract_json_file_path:str, log_files_output
|
|
315 |
return {}, True, log_files_output_paths # Conversion failed
|
316 |
else:
|
317 |
print("Invalid Textract JSON format: 'Blocks' missing.")
|
318 |
-
print("textract data:", textract_data)
|
319 |
return {}, True, log_files_output_paths # Return empty data if JSON is not recognized
|
320 |
|
321 |
def restructure_textract_output(textract_output: dict, page_sizes_df:pd.DataFrame):
|
|
|
108 |
|
109 |
return pdf_bytes
|
110 |
|
111 |
+
# def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_no:int):
|
112 |
+
# '''
|
113 |
+
# Convert the json response from textract to the OCRResult format used elsewhere in the code. Looks for lines, words, and signatures. Handwriting and signatures are set aside especially for later in case the user wants to override the default behaviour and redact all handwriting/signatures.
|
114 |
+
# '''
|
115 |
+
# all_ocr_results = []
|
116 |
+
# signature_or_handwriting_recogniser_results = []
|
117 |
+
# signature_recogniser_results = []
|
118 |
+
# handwriting_recogniser_results = []
|
119 |
+
# signatures = []
|
120 |
+
# handwriting = []
|
121 |
+
# ocr_results_with_words = {}
|
122 |
+
# text_block={}
|
123 |
+
|
124 |
+
# i = 1
|
125 |
+
|
126 |
+
# # Assuming json_data is structured as a dictionary with a "pages" key
|
127 |
+
# #if "pages" in json_data:
|
128 |
+
# # Find the specific page data
|
129 |
+
# page_json_data = json_data #next((page for page in json_data["pages"] if page["page_no"] == page_no), None)
|
130 |
+
|
131 |
+
# #print("page_json_data:", page_json_data)
|
132 |
+
|
133 |
+
# if "Blocks" in page_json_data:
|
134 |
+
# # Access the data for the specific page
|
135 |
+
# text_blocks = page_json_data["Blocks"] # Access the Blocks within the page data
|
136 |
+
# # This is a new page
|
137 |
+
# elif "page_no" in page_json_data:
|
138 |
+
# text_blocks = page_json_data["data"]["Blocks"]
|
139 |
+
# else: text_blocks = []
|
140 |
+
|
141 |
+
# is_signature = False
|
142 |
+
# is_handwriting = False
|
143 |
+
|
144 |
+
# for text_block in text_blocks:
|
145 |
+
|
146 |
+
# if (text_block['BlockType'] == 'LINE') | (text_block['BlockType'] == 'SIGNATURE'): # (text_block['BlockType'] == 'WORD') |
|
147 |
+
|
148 |
+
# # Extract text and bounding box for the line
|
149 |
+
# line_bbox = text_block["Geometry"]["BoundingBox"]
|
150 |
+
# line_left = int(line_bbox["Left"] * page_width)
|
151 |
+
# line_top = int(line_bbox["Top"] * page_height)
|
152 |
+
# line_right = int((line_bbox["Left"] + line_bbox["Width"]) * page_width)
|
153 |
+
# line_bottom = int((line_bbox["Top"] + line_bbox["Height"]) * page_height)
|
154 |
+
|
155 |
+
# width_abs = int(line_bbox["Width"] * page_width)
|
156 |
+
# height_abs = int(line_bbox["Height"] * page_height)
|
157 |
+
|
158 |
+
# if text_block['BlockType'] == 'LINE':
|
159 |
+
|
160 |
+
# # Extract text and bounding box for the line
|
161 |
+
# line_text = text_block.get('Text', '')
|
162 |
+
# words = []
|
163 |
+
# current_line_handwriting_results = [] # Track handwriting results for this line
|
164 |
+
|
165 |
+
# if 'Relationships' in text_block:
|
166 |
+
# for relationship in text_block['Relationships']:
|
167 |
+
# if relationship['Type'] == 'CHILD':
|
168 |
+
# for child_id in relationship['Ids']:
|
169 |
+
# child_block = next((block for block in text_blocks if block['Id'] == child_id), None)
|
170 |
+
# if child_block and child_block['BlockType'] == 'WORD':
|
171 |
+
# word_text = child_block.get('Text', '')
|
172 |
+
# word_bbox = child_block["Geometry"]["BoundingBox"]
|
173 |
+
# confidence = child_block.get('Confidence','')
|
174 |
+
# word_left = int(word_bbox["Left"] * page_width)
|
175 |
+
# word_top = int(word_bbox["Top"] * page_height)
|
176 |
+
# word_right = int((word_bbox["Left"] + word_bbox["Width"]) * page_width)
|
177 |
+
# word_bottom = int((word_bbox["Top"] + word_bbox["Height"]) * page_height)
|
178 |
+
|
179 |
+
# # Extract BoundingBox details
|
180 |
+
# word_width = word_bbox["Width"]
|
181 |
+
# word_height = word_bbox["Height"]
|
182 |
+
|
183 |
+
# # Convert proportional coordinates to absolute coordinates
|
184 |
+
# word_width_abs = int(word_width * page_width)
|
185 |
+
# word_height_abs = int(word_height * page_height)
|
186 |
+
|
187 |
+
# words.append({
|
188 |
+
# 'text': word_text,
|
189 |
+
# 'bounding_box': (word_left, word_top, word_right, word_bottom)
|
190 |
+
# })
|
191 |
+
# # Check for handwriting
|
192 |
+
# text_type = child_block.get("TextType", '')
|
193 |
+
|
194 |
+
# if text_type == "HANDWRITING":
|
195 |
+
# is_handwriting = True
|
196 |
+
# entity_name = "HANDWRITING"
|
197 |
+
# word_end = len(word_text)
|
198 |
+
|
199 |
+
# recogniser_result = CustomImageRecognizerResult(
|
200 |
+
# entity_type=entity_name,
|
201 |
+
# text=word_text,
|
202 |
+
# score=confidence,
|
203 |
+
# start=0,
|
204 |
+
# end=word_end,
|
205 |
+
# left=word_left,
|
206 |
+
# top=word_top,
|
207 |
+
# width=word_width_abs,
|
208 |
+
# height=word_height_abs
|
209 |
+
# )
|
210 |
+
|
211 |
+
# # Add to handwriting collections immediately
|
212 |
+
# handwriting.append(recogniser_result)
|
213 |
+
# handwriting_recogniser_results.append(recogniser_result)
|
214 |
+
# signature_or_handwriting_recogniser_results.append(recogniser_result)
|
215 |
+
# current_line_handwriting_results.append(recogniser_result)
|
216 |
+
|
217 |
+
# # If handwriting or signature, add to bounding box
|
218 |
+
|
219 |
+
# elif (text_block['BlockType'] == 'SIGNATURE'):
|
220 |
+
# line_text = "SIGNATURE"
|
221 |
+
# is_signature = True
|
222 |
+
# entity_name = "SIGNATURE"
|
223 |
+
# confidence = text_block.get('Confidence', 0)
|
224 |
+
# word_end = len(line_text)
|
225 |
+
|
226 |
+
# recogniser_result = CustomImageRecognizerResult(
|
227 |
+
# entity_type=entity_name,
|
228 |
+
# text=line_text,
|
229 |
+
# score=confidence,
|
230 |
+
# start=0,
|
231 |
+
# end=word_end,
|
232 |
+
# left=line_left,
|
233 |
+
# top=line_top,
|
234 |
+
# width=width_abs,
|
235 |
+
# height=height_abs
|
236 |
+
# )
|
237 |
+
|
238 |
+
# # Add to signature collections immediately
|
239 |
+
# signatures.append(recogniser_result)
|
240 |
+
# signature_recogniser_results.append(recogniser_result)
|
241 |
+
# signature_or_handwriting_recogniser_results.append(recogniser_result)
|
242 |
+
|
243 |
+
# words = [{
|
244 |
+
# 'text': line_text,
|
245 |
+
# 'bounding_box': (line_left, line_top, line_right, line_bottom)
|
246 |
+
# }]
|
247 |
+
|
248 |
+
# ocr_results_with_words["text_line_" + str(i)] = {
|
249 |
+
# "line": i,
|
250 |
+
# 'text': line_text,
|
251 |
+
# 'bounding_box': (line_left, line_top, line_right, line_bottom),
|
252 |
+
# 'words': words
|
253 |
+
# }
|
254 |
+
|
255 |
+
# # Create OCRResult with absolute coordinates
|
256 |
+
# ocr_result = OCRResult(line_text, line_left, line_top, width_abs, height_abs)
|
257 |
+
# all_ocr_results.append(ocr_result)
|
258 |
+
|
259 |
+
# is_signature_or_handwriting = is_signature | is_handwriting
|
260 |
+
|
261 |
+
# # If it is signature or handwriting, will overwrite the default behaviour of the PII analyser
|
262 |
+
# if is_signature_or_handwriting:
|
263 |
+
# if recogniser_result not in signature_or_handwriting_recogniser_results:
|
264 |
+
# signature_or_handwriting_recogniser_results.append(recogniser_result)
|
265 |
+
|
266 |
+
# if is_signature:
|
267 |
+
# if recogniser_result not in signature_recogniser_results:
|
268 |
+
# signature_recogniser_results.append(recogniser_result)
|
269 |
+
|
270 |
+
# if is_handwriting:
|
271 |
+
# if recogniser_result not in handwriting_recogniser_results:
|
272 |
+
# handwriting_recogniser_results.append(recogniser_result)
|
273 |
+
|
274 |
+
# i += 1
|
275 |
+
|
276 |
+
# return all_ocr_results, signature_or_handwriting_recogniser_results, signature_recogniser_results, handwriting_recogniser_results, ocr_results_with_words
|
277 |
+
|
278 |
+
|
279 |
def json_to_ocrresult(json_data:dict, page_width:float, page_height:float, page_no:int):
|
280 |
'''
|
281 |
Convert the json response from textract to the OCRResult format used elsewhere in the code. Looks for lines, words, and signatures. Handwriting and signatures are set aside especially for later in case the user wants to override the default behaviour and redact all handwriting/signatures.
|
|
|
286 |
handwriting_recogniser_results = []
|
287 |
signatures = []
|
288 |
handwriting = []
|
289 |
+
ocr_results_with_words = {}
|
290 |
text_block={}
|
291 |
|
292 |
i = 1
|
|
|
309 |
is_signature = False
|
310 |
is_handwriting = False
|
311 |
|
312 |
+
for text_block in text_blocks:
|
313 |
|
314 |
if (text_block['BlockType'] == 'LINE') | (text_block['BlockType'] == 'SIGNATURE'): # (text_block['BlockType'] == 'WORD') |
|
315 |
|
|
|
412 |
'text': line_text,
|
413 |
'bounding_box': (line_left, line_top, line_right, line_bottom)
|
414 |
}]
|
415 |
+
else:
|
416 |
+
line_text = ""
|
417 |
+
words=[]
|
418 |
+
line_left = 0
|
419 |
+
line_top = 0
|
420 |
+
line_right = 0
|
421 |
+
line_bottom = 0
|
422 |
+
width_abs = 0
|
423 |
+
height_abs = 0
|
424 |
+
|
425 |
+
if line_text:
|
426 |
+
|
427 |
+
ocr_results_with_words["text_line_" + str(i)] = {
|
428 |
"line": i,
|
429 |
'text': line_text,
|
430 |
'bounding_box': (line_left, line_top, line_right, line_bottom),
|
431 |
+
'words': words,
|
432 |
+
'page': page_no
|
433 |
+
}
|
434 |
|
435 |
# Create OCRResult with absolute coordinates
|
436 |
ocr_result = OCRResult(line_text, line_left, line_top, width_abs, height_abs)
|
437 |
all_ocr_results.append(ocr_result)
|
438 |
|
439 |
+
is_signature_or_handwriting = is_signature | is_handwriting
|
440 |
+
|
441 |
+
# If it is signature or handwriting, will overwrite the default behaviour of the PII analyser
|
442 |
+
if is_signature_or_handwriting:
|
443 |
+
if recogniser_result not in signature_or_handwriting_recogniser_results:
|
444 |
+
signature_or_handwriting_recogniser_results.append(recogniser_result)
|
445 |
+
|
446 |
+
if is_signature:
|
447 |
+
if recogniser_result not in signature_recogniser_results:
|
448 |
+
signature_recogniser_results.append(recogniser_result)
|
449 |
|
450 |
+
if is_handwriting:
|
451 |
+
if recogniser_result not in handwriting_recogniser_results:
|
452 |
+
handwriting_recogniser_results.append(recogniser_result)
|
|
|
453 |
|
454 |
+
i += 1
|
|
|
|
|
455 |
|
456 |
+
# Add page key to the line level results
|
457 |
+
all_ocr_results_with_page = {"page": page_no, "results": all_ocr_results}
|
458 |
+
ocr_results_with_words_with_page = {"page": page_no, "results": ocr_results_with_words}
|
459 |
|
460 |
+
return all_ocr_results_with_page, signature_or_handwriting_recogniser_results, signature_recogniser_results, handwriting_recogniser_results, ocr_results_with_words_with_page
|
461 |
|
|
|
462 |
|
463 |
def load_and_convert_textract_json(textract_json_file_path:str, log_files_output_paths:str, page_sizes_df:pd.DataFrame):
|
464 |
"""
|
|
|
500 |
return {}, True, log_files_output_paths # Conversion failed
|
501 |
else:
|
502 |
print("Invalid Textract JSON format: 'Blocks' missing.")
|
503 |
+
#print("textract data:", textract_data)
|
504 |
return {}, True, log_files_output_paths # Return empty data if JSON is not recognized
|
505 |
|
506 |
def restructure_textract_output(textract_output: dict, page_sizes_df:pd.DataFrame):
|
tools/config.py
CHANGED
@@ -108,19 +108,7 @@ if AWS_SECRET_KEY: print(f'AWS_SECRET_KEY found in environment variables')
|
|
108 |
|
109 |
DOCUMENT_REDACTION_BUCKET = get_or_create_env_var('DOCUMENT_REDACTION_BUCKET', '')
|
110 |
|
111 |
-
SHOW_BULK_TEXTRACT_CALL_OPTIONS = get_or_create_env_var('SHOW_BULK_TEXTRACT_CALL_OPTIONS', 'False') # This feature not currently implemented
|
112 |
|
113 |
-
TEXTRACT_BULK_ANALYSIS_BUCKET = get_or_create_env_var('TEXTRACT_BULK_ANALYSIS_BUCKET', '')
|
114 |
-
|
115 |
-
TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_BULK_ANALYSIS_INPUT_SUBFOLDER', 'input')
|
116 |
-
|
117 |
-
TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_BULK_ANALYSIS_OUTPUT_SUBFOLDER', 'output')
|
118 |
-
|
119 |
-
LOAD_PREVIOUS_TEXTRACT_JOBS_S3 = get_or_create_env_var('LOAD_PREVIOUS_TEXTRACT_JOBS_S3', 'False') # Whether or not to load previous Textract jobs from S3
|
120 |
-
|
121 |
-
TEXTRACT_JOBS_S3_LOC = get_or_create_env_var('TEXTRACT_JOBS_S3_LOC', 'output') # Subfolder in the DOCUMENT_REDACTION_BUCKET where the Textract jobs are stored
|
122 |
-
|
123 |
-
TEXTRACT_JOBS_LOCAL_LOC = get_or_create_env_var('TEXTRACT_JOBS_LOCAL_LOC', 'output') # Local subfolder where the Textract jobs are stored
|
124 |
|
125 |
# Custom headers e.g. if routing traffic through Cloudfront
|
126 |
# Retrieving or setting CUSTOM_HEADER
|
@@ -161,6 +149,8 @@ if OUTPUT_FOLDER == "TEMP" or INPUT_FOLDER == "TEMP":
|
|
161 |
# By default, logs are put into a subfolder of today's date and the host name of the instance running the app. This is to avoid at all possible the possibility of log files from one instance overwriting the logs of another instance on S3. If running the app on one system always, or just locally, it is not necessary to make the log folders so specific.
|
162 |
# Another way to address this issue would be to write logs to another type of storage, e.g. database such as dynamodb. I may look into this in future.
|
163 |
|
|
|
|
|
164 |
USE_LOG_SUBFOLDERS = get_or_create_env_var('USE_LOG_SUBFOLDERS', 'True')
|
165 |
|
166 |
if USE_LOG_SUBFOLDERS == "True":
|
@@ -181,8 +171,28 @@ ensure_folder_exists(USAGE_LOGS_FOLDER)
|
|
181 |
# Should the redacted file name be included in the logs? In some instances, the names of the files themselves could be sensitive, and should not be disclosed beyond the app. So, by default this is false.
|
182 |
DISPLAY_FILE_NAMES_IN_LOGS = get_or_create_env_var('DISPLAY_FILE_NAMES_IN_LOGS', 'False')
|
183 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
184 |
###
|
185 |
-
# REDACTION CONFIG
|
186 |
|
187 |
# Create Tesseract and Poppler folders if you have installed them locally
|
188 |
TESSERACT_FOLDER = get_or_create_env_var('TESSERACT_FOLDER', "") # e.g. tesseract/
|
@@ -226,7 +236,7 @@ ROOT_PATH = get_or_create_env_var('ROOT_PATH', '')
|
|
226 |
|
227 |
DEFAULT_CONCURRENCY_LIMIT = get_or_create_env_var('DEFAULT_CONCURRENCY_LIMIT', '3')
|
228 |
|
229 |
-
GET_DEFAULT_ALLOW_LIST = get_or_create_env_var('GET_DEFAULT_ALLOW_LIST', '
|
230 |
|
231 |
ALLOW_LIST_PATH = get_or_create_env_var('ALLOW_LIST_PATH', '') # config/default_allow_list.csv
|
232 |
|
@@ -235,19 +245,38 @@ S3_ALLOW_LIST_PATH = get_or_create_env_var('S3_ALLOW_LIST_PATH', '') # default_a
|
|
235 |
if ALLOW_LIST_PATH: OUTPUT_ALLOW_LIST_PATH = ALLOW_LIST_PATH
|
236 |
else: OUTPUT_ALLOW_LIST_PATH = 'config/default_allow_list.csv'
|
237 |
|
|
|
|
|
238 |
SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
|
239 |
|
240 |
-
GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', '
|
241 |
|
242 |
DEFAULT_COST_CODE = get_or_create_env_var('DEFAULT_COST_CODE', '')
|
243 |
|
244 |
COST_CODES_PATH = get_or_create_env_var('COST_CODES_PATH', '') # 'config/COST_CENTRES.csv' # file should be a csv file with a single table in it that has two columns with a header. First column should contain cost codes, second column should contain a name or description for the cost code
|
245 |
|
246 |
-
S3_COST_CODES_PATH = get_or_create_env_var('S3_COST_CODES_PATH', '') # COST_CENTRES.csv # This is a path within the DOCUMENT_REDACTION_BUCKET
|
247 |
-
|
|
|
248 |
if COST_CODES_PATH: OUTPUT_COST_CODES_PATH = COST_CODES_PATH
|
249 |
-
else: OUTPUT_COST_CODES_PATH = 'config/
|
250 |
|
251 |
ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
|
252 |
|
253 |
-
if ENFORCE_COST_CODES == 'True': GET_COST_CODES = 'True'
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
108 |
|
109 |
DOCUMENT_REDACTION_BUCKET = get_or_create_env_var('DOCUMENT_REDACTION_BUCKET', '')
|
110 |
|
|
|
111 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
112 |
|
113 |
# Custom headers e.g. if routing traffic through Cloudfront
|
114 |
# Retrieving or setting CUSTOM_HEADER
|
|
|
149 |
# By default, logs are put into a subfolder of today's date and the host name of the instance running the app. This is to avoid at all possible the possibility of log files from one instance overwriting the logs of another instance on S3. If running the app on one system always, or just locally, it is not necessary to make the log folders so specific.
|
150 |
# Another way to address this issue would be to write logs to another type of storage, e.g. database such as dynamodb. I may look into this in future.
|
151 |
|
152 |
+
SAVE_LOGS_TO_CSV = get_or_create_env_var('SAVE_LOGS_TO_CSV', 'True')
|
153 |
+
|
154 |
USE_LOG_SUBFOLDERS = get_or_create_env_var('USE_LOG_SUBFOLDERS', 'True')
|
155 |
|
156 |
if USE_LOG_SUBFOLDERS == "True":
|
|
|
171 |
# Should the redacted file name be included in the logs? In some instances, the names of the files themselves could be sensitive, and should not be disclosed beyond the app. So, by default this is false.
|
172 |
DISPLAY_FILE_NAMES_IN_LOGS = get_or_create_env_var('DISPLAY_FILE_NAMES_IN_LOGS', 'False')
|
173 |
|
174 |
+
# Further customisation options for CSV logs
|
175 |
+
|
176 |
+
CSV_ACCESS_LOG_HEADERS = get_or_create_env_var('CSV_ACCESS_LOG_HEADERS', '') # If blank, uses component labels
|
177 |
+
CSV_FEEDBACK_LOG_HEADERS = get_or_create_env_var('CSV_FEEDBACK_LOG_HEADERS', '') # If blank, uses component labels
|
178 |
+
CSV_USAGE_LOG_HEADERS = get_or_create_env_var('CSV_USAGE_LOG_HEADERS', '["session_hash_textbox", "doc_full_file_name_textbox", "data_full_file_name_textbox", "actual_time_taken_number", "total_page_count", "textract_query_number", "pii_detection_method", "comprehend_query_number", "cost_code", "textract_handwriting_signature", "host_name_textbox", "text_extraction_method", "is_this_a_textract_api_call"]') # If blank, uses component labels
|
179 |
+
|
180 |
+
### DYNAMODB logs. Whether to save to DynamoDB, and the headers of the table
|
181 |
+
|
182 |
+
SAVE_LOGS_TO_DYNAMODB = get_or_create_env_var('SAVE_LOGS_TO_DYNAMODB', 'False')
|
183 |
+
|
184 |
+
ACCESS_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('ACCESS_LOG_DYNAMODB_TABLE_NAME', 'redaction_access_log')
|
185 |
+
DYNAMODB_ACCESS_LOG_HEADERS = get_or_create_env_var('DYNAMODB_ACCESS_LOG_HEADERS', '')
|
186 |
+
|
187 |
+
FEEDBACK_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('FEEDBACK_LOG_DYNAMODB_TABLE_NAME', 'redaction_feedback')
|
188 |
+
DYNAMODB_FEEDBACK_LOG_HEADERS = get_or_create_env_var('DYNAMODB_FEEDBACK_LOG_HEADERS', '')
|
189 |
+
|
190 |
+
USAGE_LOG_DYNAMODB_TABLE_NAME = get_or_create_env_var('USAGE_LOG_DYNAMODB_TABLE_NAME', 'redaction_usage')
|
191 |
+
DYNAMODB_USAGE_LOG_HEADERS = get_or_create_env_var('DYNAMODB_USAGE_LOG_HEADERS', '')
|
192 |
+
|
193 |
+
###
|
194 |
+
# REDACTION
|
195 |
###
|
|
|
196 |
|
197 |
# Create Tesseract and Poppler folders if you have installed them locally
|
198 |
TESSERACT_FOLDER = get_or_create_env_var('TESSERACT_FOLDER', "") # e.g. tesseract/
|
|
|
236 |
|
237 |
DEFAULT_CONCURRENCY_LIMIT = get_or_create_env_var('DEFAULT_CONCURRENCY_LIMIT', '3')
|
238 |
|
239 |
+
GET_DEFAULT_ALLOW_LIST = get_or_create_env_var('GET_DEFAULT_ALLOW_LIST', '')
|
240 |
|
241 |
ALLOW_LIST_PATH = get_or_create_env_var('ALLOW_LIST_PATH', '') # config/default_allow_list.csv
|
242 |
|
|
|
245 |
if ALLOW_LIST_PATH: OUTPUT_ALLOW_LIST_PATH = ALLOW_LIST_PATH
|
246 |
else: OUTPUT_ALLOW_LIST_PATH = 'config/default_allow_list.csv'
|
247 |
|
248 |
+
### COST CODE OPTIONS
|
249 |
+
|
250 |
SHOW_COSTS = get_or_create_env_var('SHOW_COSTS', 'False')
|
251 |
|
252 |
+
GET_COST_CODES = get_or_create_env_var('GET_COST_CODES', 'True')
|
253 |
|
254 |
DEFAULT_COST_CODE = get_or_create_env_var('DEFAULT_COST_CODE', '')
|
255 |
|
256 |
COST_CODES_PATH = get_or_create_env_var('COST_CODES_PATH', '') # 'config/COST_CENTRES.csv' # file should be a csv file with a single table in it that has two columns with a header. First column should contain cost codes, second column should contain a name or description for the cost code
|
257 |
|
258 |
+
S3_COST_CODES_PATH = get_or_create_env_var('S3_COST_CODES_PATH', '') # COST_CENTRES.csv # This is a path within the DOCUMENT_REDACTION_BUCKET
|
259 |
+
|
260 |
+
# A default path in case s3 cost code location is provided but no local cost code location given
|
261 |
if COST_CODES_PATH: OUTPUT_COST_CODES_PATH = COST_CODES_PATH
|
262 |
+
else: OUTPUT_COST_CODES_PATH = 'config/cost_codes.csv'
|
263 |
|
264 |
ENFORCE_COST_CODES = get_or_create_env_var('ENFORCE_COST_CODES', 'False') # If you have cost codes listed, is it compulsory to choose one before redacting?
|
265 |
|
266 |
+
if ENFORCE_COST_CODES == 'True': GET_COST_CODES = 'True'
|
267 |
+
|
268 |
+
### WHOLE DOCUMENT API OPTIONS
|
269 |
+
|
270 |
+
SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS = get_or_create_env_var('SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS', 'False') # This feature not currently implemented
|
271 |
+
|
272 |
+
TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET', '')
|
273 |
+
|
274 |
+
TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER', 'input')
|
275 |
+
|
276 |
+
TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER = get_or_create_env_var('TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER', 'output')
|
277 |
+
|
278 |
+
LOAD_PREVIOUS_TEXTRACT_JOBS_S3 = get_or_create_env_var('LOAD_PREVIOUS_TEXTRACT_JOBS_S3', 'False') # Whether or not to load previous Textract jobs from S3
|
279 |
+
|
280 |
+
TEXTRACT_JOBS_S3_LOC = get_or_create_env_var('TEXTRACT_JOBS_S3_LOC', 'output') # Subfolder in the DOCUMENT_REDACTION_BUCKET where the Textract jobs are stored
|
281 |
+
|
282 |
+
TEXTRACT_JOBS_LOCAL_LOC = get_or_create_env_var('TEXTRACT_JOBS_LOCAL_LOC', 'output') # Local subfolder where the Textract jobs are stored
|
tools/custom_csvlogger.py
CHANGED
@@ -4,6 +4,10 @@ import csv
|
|
4 |
import datetime
|
5 |
import os
|
6 |
import re
|
|
|
|
|
|
|
|
|
7 |
from collections.abc import Sequence
|
8 |
from multiprocessing import Lock
|
9 |
from pathlib import Path
|
@@ -11,6 +15,9 @@ from typing import TYPE_CHECKING, Any
|
|
11 |
from gradio_client import utils as client_utils
|
12 |
import gradio as gr
|
13 |
from gradio import utils, wasm_utils
|
|
|
|
|
|
|
14 |
|
15 |
if TYPE_CHECKING:
|
16 |
from gradio.components import Component
|
@@ -62,21 +69,28 @@ class CSVLogger_custom(FlaggingCallback):
|
|
62 |
self.flagging_dir = Path(flagging_dir)
|
63 |
self.first_time = True
|
64 |
|
65 |
-
def _create_dataset_file(
|
|
|
|
|
|
|
|
|
66 |
os.makedirs(self.flagging_dir, exist_ok=True)
|
67 |
|
68 |
-
if
|
69 |
-
|
70 |
-
|
71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
72 |
getattr(component, "label", None) or f"component {idx}"
|
73 |
for idx, component in enumerate(self.components)
|
74 |
-
]
|
75 |
-
|
76 |
-
+ [
|
77 |
-
"timestamp",
|
78 |
-
]
|
79 |
-
)
|
80 |
headers = utils.sanitize_list_for_csv(headers)
|
81 |
dataset_files = list(Path(self.flagging_dir).glob("dataset*.csv"))
|
82 |
|
@@ -115,18 +129,24 @@ class CSVLogger_custom(FlaggingCallback):
|
|
115 |
print("Using existing dataset file at:", self.dataset_filepath)
|
116 |
|
117 |
def flag(
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
|
122 |
-
|
|
|
|
|
|
|
|
|
|
|
123 |
if self.first_time:
|
124 |
additional_headers = []
|
125 |
if flag_option is not None:
|
126 |
additional_headers.append("flag")
|
127 |
if username is not None:
|
128 |
additional_headers.append("username")
|
129 |
-
|
|
|
130 |
self.first_time = False
|
131 |
|
132 |
csv_data = []
|
@@ -155,15 +175,131 @@ class CSVLogger_custom(FlaggingCallback):
|
|
155 |
csv_data.append(flag_option)
|
156 |
if username is not None:
|
157 |
csv_data.append(username)
|
158 |
-
csv_data.append(str(datetime.datetime.now()))
|
159 |
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
167 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
168 |
|
169 |
return line_count
|
|
|
4 |
import datetime
|
5 |
import os
|
6 |
import re
|
7 |
+
import boto3
|
8 |
+
import botocore
|
9 |
+
import uuid
|
10 |
+
import time
|
11 |
from collections.abc import Sequence
|
12 |
from multiprocessing import Lock
|
13 |
from pathlib import Path
|
|
|
15 |
from gradio_client import utils as client_utils
|
16 |
import gradio as gr
|
17 |
from gradio import utils, wasm_utils
|
18 |
+
from tools.config import AWS_REGION, AWS_ACCESS_KEY, AWS_SECRET_KEY, RUN_AWS_FUNCTIONS
|
19 |
+
from botocore.exceptions import NoCredentialsError, TokenRetrievalError
|
20 |
+
|
21 |
|
22 |
if TYPE_CHECKING:
|
23 |
from gradio.components import Component
|
|
|
69 |
self.flagging_dir = Path(flagging_dir)
|
70 |
self.first_time = True
|
71 |
|
72 |
+
def _create_dataset_file(
|
73 |
+
self,
|
74 |
+
additional_headers: list[str] | None = None,
|
75 |
+
replacement_headers: list[str] | None = None
|
76 |
+
):
|
77 |
os.makedirs(self.flagging_dir, exist_ok=True)
|
78 |
|
79 |
+
if replacement_headers:
|
80 |
+
if len(replacement_headers) != len(self.components):
|
81 |
+
raise ValueError(
|
82 |
+
f"replacement_headers must have the same length as components "
|
83 |
+
f"({len(replacement_headers)} provided, {len(self.components)} expected)"
|
84 |
+
)
|
85 |
+
headers = replacement_headers + ["timestamp"]
|
86 |
+
else:
|
87 |
+
if additional_headers is None:
|
88 |
+
additional_headers = []
|
89 |
+
headers = [
|
90 |
getattr(component, "label", None) or f"component {idx}"
|
91 |
for idx, component in enumerate(self.components)
|
92 |
+
] + additional_headers + ["timestamp"]
|
93 |
+
|
|
|
|
|
|
|
|
|
94 |
headers = utils.sanitize_list_for_csv(headers)
|
95 |
dataset_files = list(Path(self.flagging_dir).glob("dataset*.csv"))
|
96 |
|
|
|
129 |
print("Using existing dataset file at:", self.dataset_filepath)
|
130 |
|
131 |
def flag(
|
132 |
+
self,
|
133 |
+
flag_data: list[Any],
|
134 |
+
flag_option: str | None = None,
|
135 |
+
username: str | None = None,
|
136 |
+
save_to_csv: bool = True,
|
137 |
+
save_to_dynamodb: bool = False,
|
138 |
+
dynamodb_table_name: str | None = None,
|
139 |
+
dynamodb_headers: list[str] | None = None, # New: specify headers for DynamoDB
|
140 |
+
replacement_headers: list[str] | None = None
|
141 |
+
) -> int:
|
142 |
if self.first_time:
|
143 |
additional_headers = []
|
144 |
if flag_option is not None:
|
145 |
additional_headers.append("flag")
|
146 |
if username is not None:
|
147 |
additional_headers.append("username")
|
148 |
+
additional_headers.append("id")
|
149 |
+
self._create_dataset_file(additional_headers=additional_headers, replacement_headers=replacement_headers)
|
150 |
self.first_time = False
|
151 |
|
152 |
csv_data = []
|
|
|
175 |
csv_data.append(flag_option)
|
176 |
if username is not None:
|
177 |
csv_data.append(username)
|
|
|
178 |
|
179 |
+
|
180 |
+
timestamp = str(datetime.datetime.now())
|
181 |
+
csv_data.append(timestamp)
|
182 |
+
|
183 |
+
generated_id = str(uuid.uuid4())
|
184 |
+
csv_data.append(generated_id)
|
185 |
+
|
186 |
+
# Build the headers
|
187 |
+
headers = (
|
188 |
+
[getattr(component, "label", None) or f"component {idx}" for idx, component in enumerate(self.components)]
|
189 |
+
)
|
190 |
+
if flag_option is not None:
|
191 |
+
headers.append("flag")
|
192 |
+
if username is not None:
|
193 |
+
headers.append("username")
|
194 |
+
headers.append("timestamp")
|
195 |
+
headers.append("id")
|
196 |
+
|
197 |
+
line_count = -1
|
198 |
+
|
199 |
+
if save_to_csv:
|
200 |
+
with self.lock:
|
201 |
+
with open(self.dataset_filepath, "a", newline="", encoding="utf-8") as csvfile:
|
202 |
+
writer = csv.writer(csvfile)
|
203 |
+
writer.writerow(utils.sanitize_list_for_csv(csv_data))
|
204 |
+
with open(self.dataset_filepath, encoding="utf-8") as csvfile:
|
205 |
+
line_count = len(list(csv.reader(csvfile))) - 1
|
206 |
+
|
207 |
+
if save_to_dynamodb == True:
|
208 |
+
|
209 |
+
if RUN_AWS_FUNCTIONS == "1":
|
210 |
+
try:
|
211 |
+
print("Connecting to DynamoDB via existing SSO connection")
|
212 |
+
dynamodb = boto3.resource('dynamodb', region_name=AWS_REGION)
|
213 |
+
#client = boto3.client('dynamodb')
|
214 |
+
|
215 |
+
test_connection = dynamodb.meta.client.list_tables()
|
216 |
+
|
217 |
+
except Exception as e:
|
218 |
+
print("No SSO credentials found:", e)
|
219 |
+
if AWS_ACCESS_KEY and AWS_SECRET_KEY:
|
220 |
+
print("Trying DynamoDB credentials from environment variables")
|
221 |
+
dynamodb = boto3.resource('dynamodb',aws_access_key_id=AWS_ACCESS_KEY,
|
222 |
+
aws_secret_access_key=AWS_SECRET_KEY, region_name=AWS_REGION)
|
223 |
+
# client = boto3.client('dynamodb',aws_access_key_id=AWS_ACCESS_KEY,
|
224 |
+
# aws_secret_access_key=AWS_SECRET_KEY, region_name=AWS_REGION)
|
225 |
+
else:
|
226 |
+
raise Exception("AWS credentials for DynamoDB logging not found")
|
227 |
+
else:
|
228 |
+
raise Exception("AWS credentials for DynamoDB logging not found")
|
229 |
+
|
230 |
+
if dynamodb_table_name is None:
|
231 |
+
raise ValueError("You must provide a dynamodb_table_name if save_to_dynamodb is True")
|
232 |
+
|
233 |
+
if dynamodb_headers:
|
234 |
+
dynamodb_headers = dynamodb_headers
|
235 |
+
if not dynamodb_headers and replacement_headers:
|
236 |
+
dynamodb_headers = replacement_headers
|
237 |
+
elif headers:
|
238 |
+
dynamodb_headers = headers
|
239 |
+
elif not dynamodb_headers:
|
240 |
+
raise ValueError("Headers not found. You must provide dynamodb_headers or replacement_headers to create a new table.")
|
241 |
+
|
242 |
+
if flag_option is not None:
|
243 |
+
if "flag" not in dynamodb_headers:
|
244 |
+
dynamodb_headers.append("flag")
|
245 |
+
if username is not None:
|
246 |
+
if "username" not in dynamodb_headers:
|
247 |
+
dynamodb_headers.append("username")
|
248 |
+
if "timestamp" not in dynamodb_headers:
|
249 |
+
dynamodb_headers.append("timestamp")
|
250 |
+
if "id" not in dynamodb_headers:
|
251 |
+
dynamodb_headers.append("id")
|
252 |
+
|
253 |
+
# Table doesn't exist — create it
|
254 |
+
try:
|
255 |
+
table = dynamodb.Table(dynamodb_table_name)
|
256 |
+
table.load()
|
257 |
+
except botocore.exceptions.ClientError as e:
|
258 |
+
if e.response['Error']['Code'] == 'ResourceNotFoundException':
|
259 |
+
|
260 |
+
#print(f"Creating DynamoDB table '{dynamodb_table_name}'...")
|
261 |
+
#print("dynamodb_headers:", dynamodb_headers)
|
262 |
+
|
263 |
+
attribute_definitions = [
|
264 |
+
{'AttributeName': 'id', 'AttributeType': 'S'} # Only define key attributes here
|
265 |
+
]
|
266 |
+
|
267 |
+
table = dynamodb.create_table(
|
268 |
+
TableName=dynamodb_table_name,
|
269 |
+
KeySchema=[
|
270 |
+
{'AttributeName': 'id', 'KeyType': 'HASH'} # Partition key
|
271 |
+
],
|
272 |
+
AttributeDefinitions=attribute_definitions,
|
273 |
+
BillingMode='PAY_PER_REQUEST'
|
274 |
+
)
|
275 |
+
# Wait until the table exists
|
276 |
+
table.meta.client.get_waiter('table_exists').wait(TableName=dynamodb_table_name)
|
277 |
+
time.sleep(5)
|
278 |
+
print(f"Table '{dynamodb_table_name}' created successfully.")
|
279 |
+
else:
|
280 |
+
raise
|
281 |
+
|
282 |
+
# Prepare the DynamoDB item to upload
|
283 |
+
|
284 |
+
try:
|
285 |
+
item = {
|
286 |
+
'id': str(generated_id), # UUID primary key
|
287 |
+
#'created_by': username if username else "unknown",
|
288 |
+
'timestamp': timestamp,
|
289 |
+
}
|
290 |
+
|
291 |
+
#print("dynamodb_headers:", dynamodb_headers)
|
292 |
+
#print("csv_data:", csv_data)
|
293 |
+
|
294 |
+
# Map the headers to values
|
295 |
+
item.update({header: str(value) for header, value in zip(dynamodb_headers, csv_data)})
|
296 |
+
|
297 |
+
#print("item:", item)
|
298 |
+
|
299 |
+
table.put_item(Item=item)
|
300 |
+
|
301 |
+
print("Successfully uploaded log to DynamoDB")
|
302 |
+
except Exception as e:
|
303 |
+
print("Could not upload log to DynamobDB due to", e)
|
304 |
|
305 |
return line_count
|
tools/custom_image_analyser_engine.py
CHANGED
@@ -775,9 +775,52 @@ def merge_text_bounding_boxes(analyser_results:dict, characters: List[LTChar], c
|
|
775 |
|
776 |
return analysed_bounding_boxes
|
777 |
|
778 |
-
|
779 |
-
|
780 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
781 |
lines = []
|
782 |
current_line = []
|
783 |
for result in sorted(ocr_results, key=lambda x: x.top):
|
@@ -796,26 +839,11 @@ def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:fl
|
|
796 |
# Flatten the sorted lines back into a single list
|
797 |
sorted_results = [result for line in lines for result in line]
|
798 |
|
799 |
-
|
800 |
-
|
801 |
current_line = []
|
802 |
current_bbox = None
|
803 |
-
line_counter = 1
|
804 |
-
|
805 |
-
def create_ocr_result_with_children(combined_results, i, current_bbox, current_line):
|
806 |
-
combined_results["text_line_" + str(i)] = {
|
807 |
-
"line": i,
|
808 |
-
'text': current_bbox.text,
|
809 |
-
'bounding_box': (current_bbox.left, current_bbox.top,
|
810 |
-
current_bbox.left + current_bbox.width,
|
811 |
-
current_bbox.top + current_bbox.height),
|
812 |
-
'words': [{'text': word.text,
|
813 |
-
'bounding_box': (word.left, word.top,
|
814 |
-
word.left + word.width,
|
815 |
-
word.top + word.height)}
|
816 |
-
for word in current_line]
|
817 |
-
}
|
818 |
-
return combined_results["text_line_" + str(i)]
|
819 |
|
820 |
for result in sorted_results:
|
821 |
if not current_line:
|
@@ -838,26 +866,101 @@ def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:fl
|
|
838 |
height=max(current_bbox.height, result.height)
|
839 |
)
|
840 |
current_line.append(result)
|
841 |
-
else:
|
842 |
-
|
843 |
|
844 |
# Commit the current line and start a new one
|
845 |
-
|
846 |
|
847 |
-
|
|
|
848 |
|
849 |
line_counter += 1
|
850 |
current_line = [result]
|
851 |
current_bbox = result
|
852 |
-
|
853 |
# Append the last line
|
854 |
if current_bbox:
|
855 |
-
|
|
|
|
|
|
|
856 |
|
857 |
-
|
|
|
|
|
858 |
|
|
|
859 |
|
860 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
861 |
|
862 |
class CustomImageAnalyzerEngine:
|
863 |
def __init__(
|
@@ -911,7 +1014,7 @@ class CustomImageAnalyzerEngine:
|
|
911 |
def analyze_text(
|
912 |
self,
|
913 |
line_level_ocr_results: List[OCRResult],
|
914 |
-
|
915 |
chosen_redact_comprehend_entities: List[str],
|
916 |
pii_identification_method: str = "Local",
|
917 |
comprehend_client = "",
|
@@ -1036,9 +1139,9 @@ class CustomImageAnalyzerEngine:
|
|
1036 |
combined_results = []
|
1037 |
for i, text_line in enumerate(line_level_ocr_results):
|
1038 |
line_results = next((results for idx, results in all_text_line_results if idx == i), [])
|
1039 |
-
if line_results and i < len(
|
1040 |
-
child_level_key = list(
|
1041 |
-
|
1042 |
|
1043 |
for result in line_results:
|
1044 |
bbox_results = self.map_analyzer_results_to_bounding_boxes(
|
@@ -1052,7 +1155,7 @@ class CustomImageAnalyzerEngine:
|
|
1052 |
)],
|
1053 |
text_line.text,
|
1054 |
text_analyzer_kwargs.get('allow_list', []),
|
1055 |
-
|
1056 |
)
|
1057 |
combined_results.extend(bbox_results)
|
1058 |
|
@@ -1064,14 +1167,14 @@ class CustomImageAnalyzerEngine:
|
|
1064 |
redaction_relevant_ocr_results: List[OCRResult],
|
1065 |
full_text: str,
|
1066 |
allow_list: List[str],
|
1067 |
-
|
1068 |
) -> List[CustomImageRecognizerResult]:
|
1069 |
redaction_bboxes = []
|
1070 |
|
1071 |
for redaction_relevant_ocr_result in redaction_relevant_ocr_results:
|
1072 |
-
#print("
|
1073 |
|
1074 |
-
line_text =
|
1075 |
line_length = len(line_text)
|
1076 |
redaction_text = redaction_relevant_ocr_result.text
|
1077 |
|
@@ -1097,7 +1200,7 @@ class CustomImageAnalyzerEngine:
|
|
1097 |
|
1098 |
# print(f"Found match: '{matched_text}' in line")
|
1099 |
|
1100 |
-
# for word_info in
|
1101 |
# # Check if this word is part of our match
|
1102 |
# if any(word.lower() in word_info['text'].lower() for word in matched_words):
|
1103 |
# matching_word_boxes.append(word_info['bounding_box'])
|
@@ -1106,11 +1209,11 @@ class CustomImageAnalyzerEngine:
|
|
1106 |
# Find the corresponding words in the OCR results
|
1107 |
matching_word_boxes = []
|
1108 |
|
1109 |
-
#print("
|
1110 |
|
1111 |
current_position = 0
|
1112 |
|
1113 |
-
for word_info in
|
1114 |
word_text = word_info['text']
|
1115 |
word_length = len(word_text)
|
1116 |
|
|
|
775 |
|
776 |
return analysed_bounding_boxes
|
777 |
|
778 |
+
def recreate_page_line_level_ocr_results_with_page(page_line_level_ocr_results_with_words: dict):
|
779 |
+
reconstructed_results = []
|
780 |
+
|
781 |
+
# Assume all lines belong to the same page, so we can just read it from one item
|
782 |
+
#page = next(iter(page_line_level_ocr_results_with_words.values()))["page"]
|
783 |
+
|
784 |
+
page = page_line_level_ocr_results_with_words["page"]
|
785 |
+
|
786 |
+
for line_data in page_line_level_ocr_results_with_words["results"].values():
|
787 |
+
bbox = line_data["bounding_box"]
|
788 |
+
text = line_data["text"]
|
789 |
+
|
790 |
+
# Recreate the OCRResult (you'll need the OCRResult class imported)
|
791 |
+
line_result = OCRResult(
|
792 |
+
text=text,
|
793 |
+
left=bbox[0],
|
794 |
+
top=bbox[1],
|
795 |
+
width=bbox[2] - bbox[0],
|
796 |
+
height=bbox[3] - bbox[1],
|
797 |
+
)
|
798 |
+
reconstructed_results.append(line_result)
|
799 |
+
|
800 |
+
page_line_level_ocr_results_with_page = {"page": page, "results": reconstructed_results}
|
801 |
+
|
802 |
+
return page_line_level_ocr_results_with_page
|
803 |
+
|
804 |
+
def create_ocr_result_with_children(combined_results:dict, i:int, current_bbox:dict, current_line:list):
|
805 |
+
combined_results["text_line_" + str(i)] = {
|
806 |
+
"line": i,
|
807 |
+
'text': current_bbox.text,
|
808 |
+
'bounding_box': (current_bbox.left, current_bbox.top,
|
809 |
+
current_bbox.left + current_bbox.width,
|
810 |
+
current_bbox.top + current_bbox.height),
|
811 |
+
'words': [{'text': word.text,
|
812 |
+
'bounding_box': (word.left, word.top,
|
813 |
+
word.left + word.width,
|
814 |
+
word.top + word.height)}
|
815 |
+
for word in current_line]
|
816 |
+
}
|
817 |
+
return combined_results["text_line_" + str(i)]
|
818 |
+
|
819 |
+
def combine_ocr_results(ocr_results: dict, x_threshold: float = 50.0, y_threshold: float = 12.0, page: int = 1):
|
820 |
+
'''
|
821 |
+
Group OCR results into lines based on y_threshold. Create line level ocr results, and word level OCR results
|
822 |
+
'''
|
823 |
+
|
824 |
lines = []
|
825 |
current_line = []
|
826 |
for result in sorted(ocr_results, key=lambda x: x.top):
|
|
|
839 |
# Flatten the sorted lines back into a single list
|
840 |
sorted_results = [result for line in lines for result in line]
|
841 |
|
842 |
+
page_line_level_ocr_results = []
|
843 |
+
page_line_level_ocr_results_with_words = {}
|
844 |
current_line = []
|
845 |
current_bbox = None
|
846 |
+
line_counter = 1
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
847 |
|
848 |
for result in sorted_results:
|
849 |
if not current_line:
|
|
|
866 |
height=max(current_bbox.height, result.height)
|
867 |
)
|
868 |
current_line.append(result)
|
869 |
+
else:
|
|
|
870 |
|
871 |
# Commit the current line and start a new one
|
872 |
+
page_line_level_ocr_results.append(current_bbox)
|
873 |
|
874 |
+
page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
|
875 |
+
#page_line_level_ocr_results_with_words["text_line_" + str(line_counter)]["page"] = page
|
876 |
|
877 |
line_counter += 1
|
878 |
current_line = [result]
|
879 |
current_bbox = result
|
|
|
880 |
# Append the last line
|
881 |
if current_bbox:
|
882 |
+
page_line_level_ocr_results.append(current_bbox)
|
883 |
+
|
884 |
+
page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
|
885 |
+
#page_line_level_ocr_results_with_words["text_line_" + str(line_counter)]["page"] = page
|
886 |
|
887 |
+
# Add page key to the line level results
|
888 |
+
page_line_level_ocr_results_with_page = {"page": page, "results": page_line_level_ocr_results}
|
889 |
+
page_line_level_ocr_results_with_words = {"page": page, "results": page_line_level_ocr_results_with_words}
|
890 |
|
891 |
+
return page_line_level_ocr_results_with_page, page_line_level_ocr_results_with_words
|
892 |
|
893 |
+
|
894 |
+
# Function to combine OCR results into line-level results
|
895 |
+
# def combine_ocr_results(ocr_results:dict, x_threshold:float=50.0, y_threshold:float=12.0):
|
896 |
+
# '''
|
897 |
+
# Group OCR results into lines based on y_threshold. Create line level ocr results, and word level OCR results
|
898 |
+
# '''
|
899 |
+
|
900 |
+
# lines = []
|
901 |
+
# current_line = []
|
902 |
+
# for result in sorted(ocr_results, key=lambda x: x.top):
|
903 |
+
# if not current_line or abs(result.top - current_line[0].top) <= y_threshold:
|
904 |
+
# current_line.append(result)
|
905 |
+
# else:
|
906 |
+
# lines.append(current_line)
|
907 |
+
# current_line = [result]
|
908 |
+
# if current_line:
|
909 |
+
# lines.append(current_line)
|
910 |
+
|
911 |
+
# # Sort each line by left position
|
912 |
+
# for line in lines:
|
913 |
+
# line.sort(key=lambda x: x.left)
|
914 |
+
|
915 |
+
# # Flatten the sorted lines back into a single list
|
916 |
+
# sorted_results = [result for line in lines for result in line]
|
917 |
+
|
918 |
+
# page_line_level_ocr_results = []
|
919 |
+
# page_line_level_ocr_results_with_words = {}
|
920 |
+
# current_line = []
|
921 |
+
# current_bbox = None
|
922 |
+
# line_counter = 1
|
923 |
+
|
924 |
+
# for result in sorted_results:
|
925 |
+
# if not current_line:
|
926 |
+
# # Start a new line
|
927 |
+
# current_line.append(result)
|
928 |
+
# current_bbox = result
|
929 |
+
# else:
|
930 |
+
# # Check if the result is on the same line (y-axis) and close horizontally (x-axis)
|
931 |
+
# last_result = current_line[-1]
|
932 |
+
|
933 |
+
# if abs(result.top - last_result.top) <= y_threshold and \
|
934 |
+
# (result.left - (last_result.left + last_result.width)) <= x_threshold:
|
935 |
+
# # Update the bounding box to include the new word
|
936 |
+
# new_right = max(current_bbox.left + current_bbox.width, result.left + result.width)
|
937 |
+
# current_bbox = OCRResult(
|
938 |
+
# text=f"{current_bbox.text} {result.text}",
|
939 |
+
# left=current_bbox.left,
|
940 |
+
# top=current_bbox.top,
|
941 |
+
# width=new_right - current_bbox.left,
|
942 |
+
# height=max(current_bbox.height, result.height)
|
943 |
+
# )
|
944 |
+
# current_line.append(result)
|
945 |
+
# else:
|
946 |
+
|
947 |
+
# # Commit the current line and start a new one
|
948 |
+
# page_line_level_ocr_results.append(current_bbox)
|
949 |
+
|
950 |
+
# page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
|
951 |
+
|
952 |
+
# line_counter += 1
|
953 |
+
# current_line = [result]
|
954 |
+
# current_bbox = result
|
955 |
+
|
956 |
+
# # Append the last line
|
957 |
+
# if current_bbox:
|
958 |
+
# page_line_level_ocr_results.append(current_bbox)
|
959 |
+
|
960 |
+
# page_line_level_ocr_results_with_words["text_line_" + str(line_counter)] = create_ocr_result_with_children(page_line_level_ocr_results_with_words, line_counter, current_bbox, current_line)
|
961 |
+
|
962 |
+
|
963 |
+
# return page_line_level_ocr_results, page_line_level_ocr_results_with_words
|
964 |
|
965 |
class CustomImageAnalyzerEngine:
|
966 |
def __init__(
|
|
|
1014 |
def analyze_text(
|
1015 |
self,
|
1016 |
line_level_ocr_results: List[OCRResult],
|
1017 |
+
ocr_results_with_words: Dict[str, Dict],
|
1018 |
chosen_redact_comprehend_entities: List[str],
|
1019 |
pii_identification_method: str = "Local",
|
1020 |
comprehend_client = "",
|
|
|
1139 |
combined_results = []
|
1140 |
for i, text_line in enumerate(line_level_ocr_results):
|
1141 |
line_results = next((results for idx, results in all_text_line_results if idx == i), [])
|
1142 |
+
if line_results and i < len(ocr_results_with_words):
|
1143 |
+
child_level_key = list(ocr_results_with_words.keys())[i]
|
1144 |
+
ocr_results_with_words_line_level = ocr_results_with_words[child_level_key]
|
1145 |
|
1146 |
for result in line_results:
|
1147 |
bbox_results = self.map_analyzer_results_to_bounding_boxes(
|
|
|
1155 |
)],
|
1156 |
text_line.text,
|
1157 |
text_analyzer_kwargs.get('allow_list', []),
|
1158 |
+
ocr_results_with_words_line_level
|
1159 |
)
|
1160 |
combined_results.extend(bbox_results)
|
1161 |
|
|
|
1167 |
redaction_relevant_ocr_results: List[OCRResult],
|
1168 |
full_text: str,
|
1169 |
allow_list: List[str],
|
1170 |
+
ocr_results_with_words_child_info: Dict[str, Dict]
|
1171 |
) -> List[CustomImageRecognizerResult]:
|
1172 |
redaction_bboxes = []
|
1173 |
|
1174 |
for redaction_relevant_ocr_result in redaction_relevant_ocr_results:
|
1175 |
+
#print("ocr_results_with_words_child_info:", ocr_results_with_words_child_info)
|
1176 |
|
1177 |
+
line_text = ocr_results_with_words_child_info['text']
|
1178 |
line_length = len(line_text)
|
1179 |
redaction_text = redaction_relevant_ocr_result.text
|
1180 |
|
|
|
1200 |
|
1201 |
# print(f"Found match: '{matched_text}' in line")
|
1202 |
|
1203 |
+
# for word_info in ocr_results_with_words_child_info.get('words', []):
|
1204 |
# # Check if this word is part of our match
|
1205 |
# if any(word.lower() in word_info['text'].lower() for word in matched_words):
|
1206 |
# matching_word_boxes.append(word_info['bounding_box'])
|
|
|
1209 |
# Find the corresponding words in the OCR results
|
1210 |
matching_word_boxes = []
|
1211 |
|
1212 |
+
#print("ocr_results_with_words_child_info:", ocr_results_with_words_child_info)
|
1213 |
|
1214 |
current_position = 0
|
1215 |
|
1216 |
+
for word_info in ocr_results_with_words_child_info.get('words', []):
|
1217 |
word_text = word_info['text']
|
1218 |
word_length = len(word_text)
|
1219 |
|
tools/data_anonymise.py
CHANGED
@@ -1,10 +1,12 @@
|
|
1 |
import re
|
|
|
2 |
import secrets
|
3 |
import base64
|
4 |
import time
|
5 |
import boto3
|
6 |
import botocore
|
7 |
import pandas as pd
|
|
|
8 |
|
9 |
from faker import Faker
|
10 |
from gradio import Progress
|
@@ -226,6 +228,7 @@ def anonymise_data_files(file_paths: List[str],
|
|
226 |
comprehend_query_number:int=0,
|
227 |
aws_access_key_textbox:str='',
|
228 |
aws_secret_key_textbox:str='',
|
|
|
229 |
progress: Progress = Progress(track_tqdm=True)):
|
230 |
"""
|
231 |
This function anonymises data files based on the provided parameters.
|
@@ -252,6 +255,7 @@ def anonymise_data_files(file_paths: List[str],
|
|
252 |
- comprehend_query_number (int, optional): A counter tracking the number of queries to AWS Comprehend.
|
253 |
- aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
|
254 |
- aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
|
|
|
255 |
- progress (Progress, optional): A Progress object to track progress. Defaults to a Progress object with track_tqdm=True.
|
256 |
"""
|
257 |
|
@@ -277,9 +281,16 @@ def anonymise_data_files(file_paths: List[str],
|
|
277 |
if not out_file_paths:
|
278 |
out_file_paths = []
|
279 |
|
280 |
-
|
281 |
-
|
282 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
283 |
else:
|
284 |
in_allow_list_flat = []
|
285 |
|
@@ -306,7 +317,7 @@ def anonymise_data_files(file_paths: List[str],
|
|
306 |
else:
|
307 |
comprehend_client = ""
|
308 |
out_message = "Cannot connect to AWS Comprehend service. Please provide access keys under Textract settings on the Redaction settings tab, or choose another PII identification method."
|
309 |
-
|
310 |
|
311 |
# Check if files and text exist
|
312 |
if not file_paths:
|
@@ -314,7 +325,7 @@ def anonymise_data_files(file_paths: List[str],
|
|
314 |
file_paths=['open_text']
|
315 |
else:
|
316 |
out_message = "Please enter text or a file to redact."
|
317 |
-
|
318 |
|
319 |
# If we have already redacted the last file, return the input out_message and file list to the relevant components
|
320 |
if latest_file_completed >= len(file_paths):
|
@@ -322,18 +333,18 @@ def anonymise_data_files(file_paths: List[str],
|
|
322 |
# Set to a very high number so as not to mess with subsequent file processing by the user
|
323 |
latest_file_completed = 99
|
324 |
final_out_message = '\n'.join(out_message)
|
325 |
-
return final_out_message, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths
|
326 |
|
327 |
file_path_loop = [file_paths[int(latest_file_completed)]]
|
328 |
|
329 |
-
for anon_file in progress.tqdm(file_path_loop, desc="Anonymising files", unit = "
|
330 |
|
331 |
if anon_file=='open_text':
|
332 |
anon_df = pd.DataFrame(data={'text':[in_text]})
|
333 |
chosen_cols=['text']
|
|
|
334 |
sheet_name = ""
|
335 |
file_type = ""
|
336 |
-
out_file_part = anon_file
|
337 |
|
338 |
out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, "", log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=OUTPUT_FOLDER)
|
339 |
else:
|
@@ -350,26 +361,22 @@ def anonymise_data_files(file_paths: List[str],
|
|
350 |
out_message.append("No Excel sheets selected. Please select at least one to anonymise.")
|
351 |
continue
|
352 |
|
353 |
-
anon_xlsx = pd.ExcelFile(anon_file)
|
354 |
-
|
355 |
# Create xlsx file:
|
356 |
-
|
357 |
-
|
358 |
-
from openpyxl import Workbook
|
359 |
|
360 |
-
|
361 |
-
wb.save(anon_xlsx_export_file_name)
|
362 |
|
363 |
# Iterate through the sheet names
|
364 |
-
for sheet_name in in_excel_sheets:
|
365 |
# Read each sheet into a DataFrame
|
366 |
if sheet_name not in anon_xlsx.sheet_names:
|
367 |
continue
|
368 |
|
369 |
anon_df = pd.read_excel(anon_file, sheet_name=sheet_name)
|
370 |
|
371 |
-
out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type,
|
372 |
-
|
373 |
else:
|
374 |
sheet_name = ""
|
375 |
anon_df = read_file(anon_file)
|
@@ -380,23 +387,28 @@ def anonymise_data_files(file_paths: List[str],
|
|
380 |
# Increase latest file completed count unless we are at the last file
|
381 |
if latest_file_completed != len(file_paths):
|
382 |
print("Completed file number:", str(latest_file_completed))
|
383 |
-
latest_file_completed += 1
|
384 |
|
385 |
toc = time.perf_counter()
|
386 |
-
|
387 |
-
|
388 |
-
|
389 |
-
|
390 |
-
|
391 |
|
392 |
out_message.append("Anonymisation of file '" + out_file_part + "' successfully completed in")
|
393 |
|
394 |
out_message_out = '\n'.join(out_message)
|
395 |
out_message_out = out_message_out + " " + out_time
|
396 |
|
|
|
|
|
|
|
397 |
out_message_out = out_message_out + "\n\nGo to to the Redaction settings tab to see redaction logs. Please give feedback on the results below to help improve this app."
|
|
|
|
|
398 |
|
399 |
-
return out_message_out, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths
|
400 |
|
401 |
def anon_wrapper_func(
|
402 |
anon_file: str,
|
@@ -495,7 +507,6 @@ def anon_wrapper_func(
|
|
495 |
anon_df_out = anon_df_out[all_cols_original_order]
|
496 |
|
497 |
# Export file
|
498 |
-
|
499 |
# Rename anonymisation strategy for file path naming
|
500 |
if anon_strat == "replace with 'REDACTED'": anon_strat_txt = "redact_replace"
|
501 |
elif anon_strat == "replace with <ENTITY_NAME>": anon_strat_txt = "redact_entity_type"
|
@@ -507,8 +518,14 @@ def anon_wrapper_func(
|
|
507 |
|
508 |
anon_export_file_name = anon_xlsx_export_file_name
|
509 |
|
|
|
|
|
|
|
|
|
|
|
|
|
510 |
# Create a Pandas Excel writer using XlsxWriter as the engine.
|
511 |
-
with pd.ExcelWriter(anon_xlsx_export_file_name, engine='openpyxl', mode='a') as writer:
|
512 |
# Write each DataFrame to a different worksheet.
|
513 |
anon_df_out.to_excel(writer, sheet_name=excel_sheet_name, index=None)
|
514 |
|
@@ -532,7 +549,7 @@ def anon_wrapper_func(
|
|
532 |
|
533 |
# Print result text to output text box if just anonymising open text
|
534 |
if anon_file=='open_text':
|
535 |
-
out_message = [anon_df_out['text'][0]]
|
536 |
|
537 |
return out_file_paths, out_message, key_string, log_files_output_paths
|
538 |
|
@@ -551,8 +568,16 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
|
|
551 |
# DataFrame to dict
|
552 |
df_dict = df.to_dict(orient="list")
|
553 |
|
554 |
-
if in_allow_list:
|
555 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
556 |
else:
|
557 |
in_allow_list_flat = []
|
558 |
|
@@ -577,11 +602,8 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
|
|
577 |
|
578 |
#analyzer = nlp_analyser #AnalyzerEngine()
|
579 |
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=nlp_analyser)
|
580 |
-
|
581 |
anonymizer = AnonymizerEngine()#conflict_resolution=ConflictResolutionStrategy.MERGE_SIMILAR_OR_CONTAINED)
|
582 |
-
|
583 |
-
batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
|
584 |
-
|
585 |
analyzer_results = []
|
586 |
|
587 |
if pii_identification_method == "Local":
|
@@ -692,12 +714,6 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
|
|
692 |
analyse_time_out = f"Analysing the text took {analyse_toc - analyse_tic:0.1f} seconds."
|
693 |
print(analyse_time_out)
|
694 |
|
695 |
-
# Create faker function (note that it has to receive a value)
|
696 |
-
#fake = Faker("en_UK")
|
697 |
-
|
698 |
-
#def fake_first_name(x):
|
699 |
-
# return fake.first_name()
|
700 |
-
|
701 |
# Set up the anonymization configuration WITHOUT DATE_TIME
|
702 |
simple_replace_config = eval('{"DEFAULT": OperatorConfig("replace", {"new_value": "REDACTED"})}')
|
703 |
replace_config = eval('{"DEFAULT": OperatorConfig("replace")}')
|
@@ -714,9 +730,13 @@ def anonymise_script(df:pd.DataFrame, anon_strat:str, language:str, chosen_redac
|
|
714 |
if anon_strat == "mask": chosen_mask_config = mask_config
|
715 |
if anon_strat == "encrypt":
|
716 |
chosen_mask_config = people_encrypt_config
|
717 |
-
|
718 |
-
key = secrets.token_bytes(16) # 128 bits = 16 bytes
|
719 |
key_string = base64.b64encode(key).decode('utf-8')
|
|
|
|
|
|
|
|
|
|
|
720 |
elif anon_strat == "fake_first_name": chosen_mask_config = fake_first_name_config
|
721 |
|
722 |
# I think in general people will want to keep date / times - removed Mar 2025 as I don't want to assume for people.
|
|
|
1 |
import re
|
2 |
+
import os
|
3 |
import secrets
|
4 |
import base64
|
5 |
import time
|
6 |
import boto3
|
7 |
import botocore
|
8 |
import pandas as pd
|
9 |
+
from openpyxl import Workbook, load_workbook
|
10 |
|
11 |
from faker import Faker
|
12 |
from gradio import Progress
|
|
|
228 |
comprehend_query_number:int=0,
|
229 |
aws_access_key_textbox:str='',
|
230 |
aws_secret_key_textbox:str='',
|
231 |
+
actual_time_taken_number:float=0,
|
232 |
progress: Progress = Progress(track_tqdm=True)):
|
233 |
"""
|
234 |
This function anonymises data files based on the provided parameters.
|
|
|
255 |
- comprehend_query_number (int, optional): A counter tracking the number of queries to AWS Comprehend.
|
256 |
- aws_access_key_textbox (str, optional): AWS access key for account with Textract and Comprehend permissions.
|
257 |
- aws_secret_key_textbox (str, optional): AWS secret key for account with Textract and Comprehend permissions.
|
258 |
+
- actual_time_taken_number (float, optional): Time taken to do the redaction.
|
259 |
- progress (Progress, optional): A Progress object to track progress. Defaults to a Progress object with track_tqdm=True.
|
260 |
"""
|
261 |
|
|
|
281 |
if not out_file_paths:
|
282 |
out_file_paths = []
|
283 |
|
284 |
+
if isinstance(in_allow_list, list):
|
285 |
+
if in_allow_list:
|
286 |
+
in_allow_list_flat = in_allow_list
|
287 |
+
else:
|
288 |
+
in_allow_list_flat = []
|
289 |
+
elif isinstance(in_allow_list, pd.DataFrame):
|
290 |
+
if not in_allow_list.empty:
|
291 |
+
in_allow_list_flat = list(in_allow_list.iloc[:, 0].unique())
|
292 |
+
else:
|
293 |
+
in_allow_list_flat = []
|
294 |
else:
|
295 |
in_allow_list_flat = []
|
296 |
|
|
|
317 |
else:
|
318 |
comprehend_client = ""
|
319 |
out_message = "Cannot connect to AWS Comprehend service. Please provide access keys under Textract settings on the Redaction settings tab, or choose another PII identification method."
|
320 |
+
raise(out_message)
|
321 |
|
322 |
# Check if files and text exist
|
323 |
if not file_paths:
|
|
|
325 |
file_paths=['open_text']
|
326 |
else:
|
327 |
out_message = "Please enter text or a file to redact."
|
328 |
+
raise Exception(out_message)
|
329 |
|
330 |
# If we have already redacted the last file, return the input out_message and file list to the relevant components
|
331 |
if latest_file_completed >= len(file_paths):
|
|
|
333 |
# Set to a very high number so as not to mess with subsequent file processing by the user
|
334 |
latest_file_completed = 99
|
335 |
final_out_message = '\n'.join(out_message)
|
336 |
+
return final_out_message, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths, actual_time_taken_number
|
337 |
|
338 |
file_path_loop = [file_paths[int(latest_file_completed)]]
|
339 |
|
340 |
+
for anon_file in progress.tqdm(file_path_loop, desc="Anonymising files", unit = "files"):
|
341 |
|
342 |
if anon_file=='open_text':
|
343 |
anon_df = pd.DataFrame(data={'text':[in_text]})
|
344 |
chosen_cols=['text']
|
345 |
+
out_file_part = anon_file
|
346 |
sheet_name = ""
|
347 |
file_type = ""
|
|
|
348 |
|
349 |
out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, "", log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=OUTPUT_FOLDER)
|
350 |
else:
|
|
|
361 |
out_message.append("No Excel sheets selected. Please select at least one to anonymise.")
|
362 |
continue
|
363 |
|
|
|
|
|
364 |
# Create xlsx file:
|
365 |
+
anon_xlsx = pd.ExcelFile(anon_file)
|
366 |
+
anon_xlsx_export_file_name = output_folder + out_file_part + "_redacted.xlsx"
|
|
|
367 |
|
368 |
+
|
|
|
369 |
|
370 |
# Iterate through the sheet names
|
371 |
+
for sheet_name in progress.tqdm(in_excel_sheets, desc="Anonymising sheets", unit = "sheets"):
|
372 |
# Read each sheet into a DataFrame
|
373 |
if sheet_name not in anon_xlsx.sheet_names:
|
374 |
continue
|
375 |
|
376 |
anon_df = pd.read_excel(anon_file, sheet_name=sheet_name)
|
377 |
|
378 |
+
out_file_paths, out_message, key_string, log_files_output_paths = anon_wrapper_func(anon_file, anon_df, chosen_cols, out_file_paths, out_file_part, out_message, sheet_name, anon_strat, language, chosen_redact_entities, in_allow_list, file_type, anon_xlsx_export_file_name, log_files_output_paths, in_deny_list, max_fuzzy_spelling_mistakes_num, pii_identification_method, chosen_redact_comprehend_entities, comprehend_query_number, comprehend_client, output_folder=output_folder)
|
379 |
+
|
380 |
else:
|
381 |
sheet_name = ""
|
382 |
anon_df = read_file(anon_file)
|
|
|
387 |
# Increase latest file completed count unless we are at the last file
|
388 |
if latest_file_completed != len(file_paths):
|
389 |
print("Completed file number:", str(latest_file_completed))
|
390 |
+
latest_file_completed += 1
|
391 |
|
392 |
toc = time.perf_counter()
|
393 |
+
out_time_float = toc - tic
|
394 |
+
out_time = f"in {out_time_float:0.1f} seconds."
|
395 |
+
print(out_time)
|
396 |
+
|
397 |
+
actual_time_taken_number += out_time_float
|
398 |
|
399 |
out_message.append("Anonymisation of file '" + out_file_part + "' successfully completed in")
|
400 |
|
401 |
out_message_out = '\n'.join(out_message)
|
402 |
out_message_out = out_message_out + " " + out_time
|
403 |
|
404 |
+
if anon_strat == "encrypt":
|
405 |
+
out_message_out.append(". Your decryption key is " + key_string)
|
406 |
+
|
407 |
out_message_out = out_message_out + "\n\nGo to to the Redaction settings tab to see redaction logs. Please give feedback on the results below to help improve this app."
|
408 |
+
|
409 |
+
out_message_out = re.sub(r'^\n+|^\. ', '', out_message_out).strip()
|
410 |
|
411 |
+
return out_message_out, out_file_paths, out_file_paths, latest_file_completed, log_files_output_paths, log_files_output_paths, actual_time_taken_number
|
412 |
|
413 |
def anon_wrapper_func(
|
414 |
anon_file: str,
|
|
|
507 |
anon_df_out = anon_df_out[all_cols_original_order]
|
508 |
|
509 |
# Export file
|
|
|
510 |
# Rename anonymisation strategy for file path naming
|
511 |
if anon_strat == "replace with 'REDACTED'": anon_strat_txt = "redact_replace"
|
512 |
elif anon_strat == "replace with <ENTITY_NAME>": anon_strat_txt = "redact_entity_type"
|
|
|
518 |
|
519 |
anon_export_file_name = anon_xlsx_export_file_name
|
520 |
|
521 |
+
if not os.path.exists(anon_xlsx_export_file_name):
|
522 |
+
wb = Workbook()
|
523 |
+
ws = wb.active # Get the default active sheet
|
524 |
+
ws.title = excel_sheet_name
|
525 |
+
wb.save(anon_xlsx_export_file_name)
|
526 |
+
|
527 |
# Create a Pandas Excel writer using XlsxWriter as the engine.
|
528 |
+
with pd.ExcelWriter(anon_xlsx_export_file_name, engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
|
529 |
# Write each DataFrame to a different worksheet.
|
530 |
anon_df_out.to_excel(writer, sheet_name=excel_sheet_name, index=None)
|
531 |
|
|
|
549 |
|
550 |
# Print result text to output text box if just anonymising open text
|
551 |
if anon_file=='open_text':
|
552 |
+
out_message = ["'" + anon_df_out['text'][0] + "'"]
|
553 |
|
554 |
return out_file_paths, out_message, key_string, log_files_output_paths
|
555 |
|
|
|
568 |
# DataFrame to dict
|
569 |
df_dict = df.to_dict(orient="list")
|
570 |
|
571 |
+
if isinstance(in_allow_list, list):
|
572 |
+
if in_allow_list:
|
573 |
+
in_allow_list_flat = in_allow_list
|
574 |
+
else:
|
575 |
+
in_allow_list_flat = []
|
576 |
+
elif isinstance(in_allow_list, pd.DataFrame):
|
577 |
+
if not in_allow_list.empty:
|
578 |
+
in_allow_list_flat = list(in_allow_list.iloc[:, 0].unique())
|
579 |
+
else:
|
580 |
+
in_allow_list_flat = []
|
581 |
else:
|
582 |
in_allow_list_flat = []
|
583 |
|
|
|
602 |
|
603 |
#analyzer = nlp_analyser #AnalyzerEngine()
|
604 |
batch_analyzer = BatchAnalyzerEngine(analyzer_engine=nlp_analyser)
|
|
|
605 |
anonymizer = AnonymizerEngine()#conflict_resolution=ConflictResolutionStrategy.MERGE_SIMILAR_OR_CONTAINED)
|
606 |
+
batch_anonymizer = BatchAnonymizerEngine(anonymizer_engine = anonymizer)
|
|
|
|
|
607 |
analyzer_results = []
|
608 |
|
609 |
if pii_identification_method == "Local":
|
|
|
714 |
analyse_time_out = f"Analysing the text took {analyse_toc - analyse_tic:0.1f} seconds."
|
715 |
print(analyse_time_out)
|
716 |
|
|
|
|
|
|
|
|
|
|
|
|
|
717 |
# Set up the anonymization configuration WITHOUT DATE_TIME
|
718 |
simple_replace_config = eval('{"DEFAULT": OperatorConfig("replace", {"new_value": "REDACTED"})}')
|
719 |
replace_config = eval('{"DEFAULT": OperatorConfig("replace")}')
|
|
|
730 |
if anon_strat == "mask": chosen_mask_config = mask_config
|
731 |
if anon_strat == "encrypt":
|
732 |
chosen_mask_config = people_encrypt_config
|
733 |
+
key = secrets.token_bytes(16) # 128 bits = 16 bytes
|
|
|
734 |
key_string = base64.b64encode(key).decode('utf-8')
|
735 |
+
|
736 |
+
# Now inject the key into the operator config
|
737 |
+
for entity, operator in chosen_mask_config.items():
|
738 |
+
if operator.operator_name == "encrypt":
|
739 |
+
operator.params = {"key": key_string}
|
740 |
elif anon_strat == "fake_first_name": chosen_mask_config = fake_first_name_config
|
741 |
|
742 |
# I think in general people will want to keep date / times - removed Mar 2025 as I don't want to assume for people.
|
tools/file_conversion.py
CHANGED
@@ -21,6 +21,7 @@ from PIL import Image
|
|
21 |
from scipy.spatial import cKDTree
|
22 |
import random
|
23 |
import string
|
|
|
24 |
|
25 |
IMAGE_NUM_REGEX = re.compile(r'_(\d+)\.png$')
|
26 |
|
@@ -461,7 +462,8 @@ def prepare_image_or_pdf(
|
|
461 |
input_folder:str=INPUT_FOLDER,
|
462 |
prepare_images:bool=True,
|
463 |
page_sizes:list[dict]=[],
|
464 |
-
textract_output_found:bool = False,
|
|
|
465 |
progress: Progress = Progress(track_tqdm=True)
|
466 |
) -> tuple[List[str], List[str]]:
|
467 |
"""
|
@@ -483,7 +485,8 @@ def prepare_image_or_pdf(
|
|
483 |
output_folder (optional, str): The output folder for file save
|
484 |
prepare_images (optional, bool): A boolean indicating whether to create images for each PDF page. Defaults to True.
|
485 |
page_sizes(optional, List[dict]): A list of dicts containing information about page sizes in various formats.
|
486 |
-
textract_output_found (optional, bool): A boolean indicating whether
|
|
|
487 |
progress (optional, Progress): Progress tracker for the operation
|
488 |
|
489 |
|
@@ -535,7 +538,7 @@ def prepare_image_or_pdf(
|
|
535 |
final_out_message = '\n'.join(out_message)
|
536 |
else:
|
537 |
final_out_message = out_message
|
538 |
-
return final_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df
|
539 |
|
540 |
progress(0.1, desc='Preparing file')
|
541 |
|
@@ -617,11 +620,10 @@ def prepare_image_or_pdf(
|
|
617 |
|
618 |
elif file_extension in ['.csv']:
|
619 |
if '_review_file' in file_path_without_ext:
|
620 |
-
#print("file_path:", file_path)
|
621 |
review_file_csv = read_file(file_path)
|
622 |
all_annotations_object = convert_review_df_to_annotation_json(review_file_csv, image_file_paths, page_sizes)
|
623 |
json_from_csv = True
|
624 |
-
print("Converted CSV review file to image annotation object")
|
625 |
elif '_ocr_output' in file_path_without_ext:
|
626 |
all_line_level_ocr_results_df = read_file(file_path)
|
627 |
json_from_csv = False
|
@@ -639,8 +641,8 @@ def prepare_image_or_pdf(
|
|
639 |
# Assuming file_path is a NamedString or similar
|
640 |
all_annotations_object = json.loads(file_path) # Use loads for string content
|
641 |
|
642 |
-
#
|
643 |
-
elif (file_extension in ['.json']) and (prepare_for_review != True):
|
644 |
print("Saving Textract output")
|
645 |
# Copy it to the output folder so it can be used later.
|
646 |
output_textract_json_file_name = file_path_without_ext
|
@@ -654,6 +656,20 @@ def prepare_image_or_pdf(
|
|
654 |
textract_output_found = True
|
655 |
continue
|
656 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
657 |
# NEW IF STATEMENT
|
658 |
# If you have an annotations object from the above code
|
659 |
if all_annotations_object:
|
@@ -773,7 +789,40 @@ def prepare_image_or_pdf(
|
|
773 |
|
774 |
number_of_pages = len(page_sizes)#len(image_file_paths)
|
775 |
|
776 |
-
return combined_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
777 |
|
778 |
def convert_text_pdf_to_img_pdf(in_file_path:str, out_text_file_path:List[str], image_dpi:float=image_dpi, output_folder:str=OUTPUT_FOLDER, input_folder:str=INPUT_FOLDER):
|
779 |
file_path_without_ext = get_file_name_without_type(in_file_path)
|
@@ -850,121 +899,246 @@ def remove_duplicate_images_with_blank_boxes(data: List[dict]) -> List[dict]:
|
|
850 |
|
851 |
return result
|
852 |
|
853 |
-
def divide_coordinates_by_page_sizes(
|
854 |
-
|
855 |
-
|
856 |
-
|
857 |
-
|
858 |
-
|
859 |
-
|
860 |
-
coord_cols = [xmin, xmax, ymin, ymax]
|
861 |
-
for col in coord_cols:
|
862 |
-
review_file_df.loc[:, col] = pd.to_numeric(review_file_df[col], errors="coerce")
|
863 |
-
|
864 |
-
review_file_df_orig = review_file_df.copy().loc[(review_file_df[xmin] <= 1) & (review_file_df[xmax] <= 1) & (review_file_df[ymin] <= 1) & (review_file_df[ymax] <= 1),:]
|
865 |
-
|
866 |
-
#print("review_file_df_orig:", review_file_df_orig)
|
867 |
-
|
868 |
-
review_file_df_div = review_file_df.loc[(review_file_df[xmin] > 1) & (review_file_df[xmax] > 1) & (review_file_df[ymin] > 1) & (review_file_df[ymax] > 1),:]
|
869 |
-
|
870 |
-
#print("review_file_df_div:", review_file_df_div)
|
871 |
-
|
872 |
-
review_file_df_div.loc[:, "page"] = pd.to_numeric(review_file_df_div["page"], errors="coerce")
|
873 |
|
874 |
-
|
|
|
875 |
|
876 |
-
|
877 |
-
|
878 |
-
|
|
|
|
|
879 |
|
880 |
-
|
881 |
-
|
882 |
-
|
883 |
-
|
|
|
884 |
|
885 |
-
|
886 |
-
|
|
|
|
|
887 |
|
888 |
-
|
889 |
-
|
890 |
-
|
891 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
892 |
|
893 |
-
|
894 |
-
|
895 |
-
|
896 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
897 |
else:
|
898 |
-
|
899 |
|
900 |
-
# Only sort if the DataFrame is not empty and contains the required columns
|
901 |
-
required_sort_columns = {"page", xmin, ymin}
|
902 |
-
if not review_file_df_out.empty and required_sort_columns.issubset(review_file_df_out.columns):
|
903 |
-
review_file_df_out.sort_values(["page", ymin, xmin], inplace=True)
|
904 |
|
905 |
-
|
|
|
906 |
|
907 |
-
|
|
|
|
|
|
|
|
|
|
|
908 |
|
909 |
-
def multiply_coordinates_by_page_sizes(review_file_df: pd.DataFrame, page_sizes_df: pd.DataFrame, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"):
|
910 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
911 |
|
912 |
-
if xmin in review_file_df.columns and not review_file_df.empty:
|
913 |
|
914 |
-
|
915 |
-
|
916 |
-
|
|
|
917 |
|
918 |
-
|
919 |
-
review_file_df_orig = review_file_df.loc[
|
920 |
-
(review_file_df[xmin] > 1) & (review_file_df[xmax] > 1) &
|
921 |
-
(review_file_df[ymin] > 1) & (review_file_df[ymax] > 1), :].copy()
|
922 |
|
923 |
-
|
924 |
-
|
925 |
-
|
|
|
|
|
|
|
|
|
926 |
|
927 |
-
|
928 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
929 |
|
930 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
931 |
|
932 |
-
|
933 |
-
|
934 |
-
|
|
|
935 |
|
936 |
-
|
937 |
-
|
938 |
-
|
939 |
-
review_file_df_na = review_file_df.loc[review_file_df["image_width"].isna()].copy()
|
940 |
|
941 |
-
|
942 |
-
|
943 |
-
|
|
|
944 |
|
945 |
-
# Multiply coordinates by image sizes
|
946 |
-
review_file_df_not_na[xmin] *= review_file_df_not_na["image_width"]
|
947 |
-
review_file_df_not_na[xmax] *= review_file_df_not_na["image_width"]
|
948 |
-
review_file_df_not_na[ymin] *= review_file_df_not_na["image_height"]
|
949 |
-
review_file_df_not_na[ymax] *= review_file_df_not_na["image_height"]
|
950 |
|
951 |
-
|
952 |
-
|
|
|
953 |
|
954 |
-
|
955 |
-
|
956 |
-
if dfs_to_concat: # Ensure there's at least one non-empty DataFrame
|
957 |
-
review_file_df = pd.concat(dfs_to_concat)
|
958 |
-
else:
|
959 |
-
review_file_df = pd.DataFrame() # Return an empty DataFrame instead of raising an error
|
960 |
|
961 |
-
|
962 |
-
required_sort_columns = {"page", "xmin", "ymin"}
|
963 |
-
if not review_file_df.empty and required_sort_columns.issubset(review_file_df.columns):
|
964 |
-
review_file_df.sort_values(["page", "xmin", "ymin"], inplace=True)
|
965 |
|
966 |
-
|
|
|
|
|
|
|
|
|
967 |
|
|
|
968 |
|
969 |
def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
|
970 |
'''
|
@@ -1018,7 +1192,6 @@ def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
|
|
1018 |
|
1019 |
return merged_df
|
1020 |
|
1021 |
-
|
1022 |
def do_proximity_match_all_pages_for_text(df1:pd.DataFrame, df2:pd.DataFrame, threshold:float=0.03):
|
1023 |
'''
|
1024 |
Match text from one dataframe to another based on proximity matching of coordinates across all pages.
|
@@ -1142,12 +1315,12 @@ def convert_annotation_data_to_dataframe(all_annotations: List[Dict[str, Any]]):
|
|
1142 |
# prevents this from being necessary.
|
1143 |
|
1144 |
# 7. Ensure essential columns exist and set column order
|
1145 |
-
essential_box_cols = ["xmin", "xmax", "ymin", "ymax", "text", "id"]
|
1146 |
for col in essential_box_cols:
|
1147 |
if col not in final_df.columns:
|
1148 |
final_df[col] = pd.NA # Add column with NA if it wasn't present in any box
|
1149 |
|
1150 |
-
base_cols = ["image"
|
1151 |
extra_box_cols = [col for col in final_df.columns if col not in base_cols and col not in essential_box_cols]
|
1152 |
final_col_order = base_cols + essential_box_cols + sorted(extra_box_cols)
|
1153 |
|
@@ -1156,6 +1329,8 @@ def convert_annotation_data_to_dataframe(all_annotations: List[Dict[str, Any]]):
|
|
1156 |
# but it's good practice if columns could be missing for other reasons.
|
1157 |
final_df = final_df.reindex(columns=final_col_order, fill_value=pd.NA)
|
1158 |
|
|
|
|
|
1159 |
return final_df
|
1160 |
|
1161 |
def create_annotation_dicts_from_annotation_df(
|
@@ -1185,7 +1360,8 @@ def create_annotation_dicts_from_annotation_df(
|
|
1185 |
available_cols = [col for col in box_cols if col in all_image_annotations_df.columns]
|
1186 |
|
1187 |
if 'text' in all_image_annotations_df.columns:
|
1188 |
-
all_image_annotations_df
|
|
|
1189 |
|
1190 |
if not available_cols:
|
1191 |
print(f"Warning: None of the expected box columns ({box_cols}) found in DataFrame.")
|
@@ -1226,85 +1402,84 @@ def create_annotation_dicts_from_annotation_df(
|
|
1226 |
|
1227 |
return result
|
1228 |
|
1229 |
-
def convert_annotation_json_to_review_df(
|
1230 |
-
|
1231 |
-
|
1232 |
-
|
|
|
|
|
1233 |
'''
|
1234 |
Convert the annotation json data to a dataframe format.
|
1235 |
Add on any text from the initial review_file dataframe by joining based on 'id' if available
|
1236 |
in both sources, otherwise falling back to joining on pages/co-ordinates (if option selected).
|
|
|
|
|
|
|
1237 |
'''
|
1238 |
|
1239 |
# 1. Convert annotations to DataFrame
|
1240 |
-
# Ensure convert_annotation_data_to_dataframe populates the 'id' column
|
1241 |
-
# if 'id' exists in the dictionaries within all_annotations.
|
1242 |
-
|
1243 |
review_file_df = convert_annotation_data_to_dataframe(all_annotations)
|
1244 |
|
1245 |
-
# Only keep rows in review_df where there are coordinates
|
1246 |
-
|
|
|
1247 |
|
1248 |
# Exit early if the initial conversion results in an empty DataFrame
|
1249 |
if review_file_df.empty:
|
1250 |
# Define standard columns for an empty return DataFrame
|
1251 |
-
|
1252 |
-
#
|
1253 |
-
|
1254 |
-
|
1255 |
-
|
1256 |
-
|
1257 |
-
|
1258 |
-
|
1259 |
-
|
1260 |
-
|
1261 |
-
|
|
|
|
|
|
|
1262 |
|
1263 |
|
1264 |
-
|
|
|
|
|
|
|
|
|
1265 |
if not page_sizes_df.empty:
|
1266 |
-
#
|
1267 |
-
|
1268 |
-
|
1269 |
-
|
1270 |
-
|
1271 |
-
|
1272 |
-
|
1273 |
-
|
1274 |
-
print("review_file_df after coord divide:", review_file_df)
|
1275 |
-
|
1276 |
-
# Also apply to redaction_decision_output if it's not empty and has page numbers
|
1277 |
-
if not redaction_decision_output.empty and 'page' in redaction_decision_output.columns:
|
1278 |
-
redaction_decision_output['page'] = pd.to_numeric(redaction_decision_output['page'], errors='coerce')
|
1279 |
-
# Drop rows with invalid pages before division
|
1280 |
-
redaction_decision_output.dropna(subset=['page'], inplace=True)
|
1281 |
-
redaction_decision_output['page'] = redaction_decision_output['page'].astype(int)
|
1282 |
-
redaction_decision_output = divide_coordinates_by_page_sizes(redaction_decision_output, page_sizes_df)
|
1283 |
-
|
1284 |
-
print("redaction_decision_output after coord divide:", redaction_decision_output)
|
1285 |
-
else:
|
1286 |
-
print("Warning: Page sizes DataFrame became empty after processing, skipping coordinate division.")
|
1287 |
|
1288 |
|
1289 |
# 3. Join additional data from redaction_decision_output if provided
|
|
|
|
|
1290 |
if not redaction_decision_output.empty:
|
1291 |
-
# ---
|
1292 |
-
|
1293 |
-
|
1294 |
-
|
|
|
|
|
1295 |
|
1296 |
if id_col_exists_in_review and id_col_exists_in_redaction:
|
1297 |
#print("Attempting to join data based on 'id' column.")
|
1298 |
try:
|
1299 |
-
# Ensure 'id' columns are of
|
1300 |
review_file_df['id'] = review_file_df['id'].astype(str)
|
1301 |
-
# Make a copy to avoid
|
|
|
1302 |
redaction_copy = redaction_decision_output.copy()
|
1303 |
redaction_copy['id'] = redaction_copy['id'].astype(str)
|
1304 |
|
1305 |
-
# Select columns to merge from redaction output.
|
1306 |
-
# Primarily interested in 'text', but keep 'id' for the merge key.
|
1307 |
-
# Add other columns from redaction_copy if needed.
|
1308 |
cols_to_merge = ['id']
|
1309 |
if 'text' in redaction_copy.columns:
|
1310 |
cols_to_merge.append('text')
|
@@ -1312,82 +1487,130 @@ def convert_annotation_json_to_review_df(all_annotations: List[dict],
|
|
1312 |
print("Warning: 'text' column not found in redaction_decision_output. Cannot merge text using 'id'.")
|
1313 |
|
1314 |
# Perform a left merge to keep all annotations and add matching text
|
1315 |
-
#
|
1316 |
-
|
|
|
|
|
1317 |
merged_df = pd.merge(
|
1318 |
review_file_df,
|
1319 |
redaction_copy[cols_to_merge],
|
1320 |
on='id',
|
1321 |
how='left',
|
1322 |
-
suffixes=('',
|
1323 |
)
|
1324 |
|
1325 |
-
# Update the
|
1326 |
-
|
1327 |
-
|
1328 |
-
|
1329 |
-
|
1330 |
-
|
1331 |
-
#
|
1332 |
-
merged_df
|
|
|
|
|
|
|
1333 |
|
1334 |
-
|
1335 |
-
merged_df = merged_df.drop(columns=['text_redaction'])
|
1336 |
|
1337 |
-
|
1338 |
-
final_cols = original_cols
|
1339 |
-
if 'text' not in final_cols and 'text' in merged_df.columns:
|
1340 |
-
final_cols.append('text') # Make sure text column is kept if newly added
|
1341 |
-
# Reorder/select columns if necessary, ensuring 'id' is kept
|
1342 |
-
review_file_df = merged_df[[col for col in final_cols if col in merged_df.columns] + (['id'] if 'id' not in final_cols else [])]
|
1343 |
|
1344 |
-
|
1345 |
-
#print("Successfully joined data using 'id'.")
|
1346 |
-
joined_by_id = True
|
1347 |
|
1348 |
except Exception as e:
|
1349 |
-
print(f"Error during 'id'-based merge: {e}.
|
1350 |
-
# Fall through to proximity match below
|
1351 |
-
|
1352 |
-
# --- Fallback to proximity match ---
|
1353 |
-
|
1354 |
-
|
1355 |
-
|
1356 |
-
|
1357 |
-
|
1358 |
-
|
1359 |
-
|
1360 |
-
|
1361 |
-
|
1362 |
-
|
1363 |
-
|
1364 |
-
|
1365 |
-
|
1366 |
-
|
1367 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1368 |
if 'id' in review_file_df.columns:
|
1369 |
-
|
|
|
|
|
1370 |
|
1371 |
-
|
|
|
1372 |
if col not in review_file_df.columns:
|
1373 |
-
#
|
1374 |
-
#
|
1375 |
-
review_file_df[col] = ''
|
1376 |
|
1377 |
# Select and order the final set of columns
|
1378 |
-
|
|
|
|
|
1379 |
|
1380 |
# 5. Final processing and sorting
|
1381 |
-
#
|
1382 |
if 'color' in review_file_df.columns:
|
1383 |
-
|
|
|
|
|
1384 |
|
1385 |
# Sort the results
|
1386 |
-
sort_columns = ['page', 'ymin', 'xmin', 'label']
|
1387 |
# Ensure sort columns exist before sorting
|
|
|
1388 |
valid_sort_columns = [col for col in sort_columns if col in review_file_df.columns]
|
1389 |
-
if valid_sort_columns:
|
1390 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1391 |
|
1392 |
return review_file_df
|
1393 |
|
@@ -1472,20 +1695,18 @@ def fill_missing_box_ids(data_input: dict) -> dict:
|
|
1472 |
|
1473 |
def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12) -> pd.DataFrame:
|
1474 |
"""
|
1475 |
-
Generates unique alphanumeric IDs for rows in a DataFrame column
|
1476 |
-
where the value is missing (NaN, None) or an empty string.
|
1477 |
|
1478 |
Args:
|
1479 |
df (pd.DataFrame): The input Pandas DataFrame.
|
1480 |
column_name (str): The name of the column to check and fill (defaults to 'id').
|
1481 |
This column will be added if it doesn't exist.
|
1482 |
length (int): The desired length of the generated IDs (defaults to 12).
|
1483 |
-
Cannot exceed the limits that guarantee uniqueness based
|
1484 |
-
on the number of IDs needed and character set size.
|
1485 |
|
1486 |
Returns:
|
1487 |
pd.DataFrame: The DataFrame with missing/empty IDs filled in the specified column.
|
1488 |
-
Note: The function modifies the DataFrame in
|
1489 |
"""
|
1490 |
|
1491 |
# --- Input Validation ---
|
@@ -1497,43 +1718,59 @@ def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12
|
|
1497 |
raise ValueError("'length' must be a positive integer.")
|
1498 |
|
1499 |
# --- Ensure Column Exists ---
|
|
|
1500 |
if column_name not in df.columns:
|
1501 |
print(f"Column '{column_name}' not found. Adding it to the DataFrame.")
|
1502 |
-
|
|
|
|
|
|
|
|
|
|
|
1503 |
|
1504 |
# --- Identify Rows Needing IDs ---
|
1505 |
-
# Check for NaN, None,
|
1506 |
-
|
1507 |
-
|
1508 |
-
|
1509 |
-
|
1510 |
-
|
1511 |
-
|
1512 |
-
|
1513 |
-
|
1514 |
-
|
1515 |
-
|
1516 |
-
|
1517 |
-
|
1518 |
-
is_missing_or_empty = df[column_name].isna()
|
1519 |
|
1520 |
rows_to_fill_index = df.index[is_missing_or_empty]
|
1521 |
num_needed = len(rows_to_fill_index)
|
1522 |
|
1523 |
if num_needed == 0:
|
1524 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
1525 |
return df
|
1526 |
|
1527 |
print(f"Found {num_needed} rows requiring a unique ID in column '{column_name}'.")
|
1528 |
|
1529 |
# --- Get Existing IDs to Ensure Uniqueness ---
|
1530 |
-
|
1531 |
-
|
1532 |
-
|
1533 |
-
|
1534 |
-
|
1535 |
-
|
1536 |
-
|
|
|
|
|
|
|
|
|
|
|
1537 |
|
1538 |
|
1539 |
# --- Generate Unique IDs ---
|
@@ -1543,93 +1780,232 @@ def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12
|
|
1543 |
|
1544 |
max_possible_ids = len(character_set) ** length
|
1545 |
if num_needed > max_possible_ids:
|
1546 |
-
|
1547 |
-
|
|
|
|
|
1548 |
|
1549 |
#print(f"Generating {num_needed} unique IDs of length {length}...")
|
1550 |
for i in range(num_needed):
|
1551 |
attempts = 0
|
1552 |
while True:
|
1553 |
candidate_id = ''.join(random.choices(character_set, k=length))
|
1554 |
-
# Check against *all* existing IDs and *newly* generated ones
|
1555 |
if candidate_id not in existing_ids and candidate_id not in generated_ids_set:
|
1556 |
generated_ids_set.add(candidate_id)
|
1557 |
new_ids_list.append(candidate_id)
|
1558 |
break # Found a unique ID
|
1559 |
attempts += 1
|
1560 |
-
if attempts >
|
1561 |
-
|
1562 |
|
1563 |
-
# Optional progress update
|
1564 |
-
if (i + 1) % 1000 == 0:
|
1565 |
-
|
1566 |
|
1567 |
|
1568 |
# --- Assign New IDs ---
|
1569 |
# Use the previously identified index to assign the new IDs correctly
|
|
|
|
|
|
|
|
|
1570 |
df.loc[rows_to_fill_index, column_name] = new_ids_list
|
1571 |
-
|
|
|
|
|
|
|
1572 |
|
1573 |
-
# The DataFrame 'df' has been modified in place
|
1574 |
return df
|
1575 |
|
1576 |
-
def convert_review_df_to_annotation_json(
|
1577 |
-
|
1578 |
-
|
1579 |
-
'''
|
1580 |
-
|
1581 |
-
|
1582 |
-
|
1583 |
-
|
1584 |
-
for col in float_cols:
|
1585 |
-
review_file_df.loc[:, col] = pd.to_numeric(review_file_df.loc[:, col], errors='coerce')
|
1586 |
-
|
1587 |
-
# Convert relative co-ordinates into image coordinates for the image annotation output object
|
1588 |
-
if page_sizes:
|
1589 |
-
page_sizes_df = pd.DataFrame(page_sizes)
|
1590 |
-
page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
|
1591 |
|
1592 |
-
|
1593 |
-
|
1594 |
-
review_file_df = fill_missing_ids(review_file_df)
|
1595 |
|
1596 |
-
|
1597 |
-
review_file_df
|
1598 |
-
|
1599 |
-
|
1600 |
-
|
1601 |
-
|
|
|
1602 |
|
1603 |
-
|
1604 |
-
|
|
|
|
|
1605 |
|
1606 |
-
|
1607 |
-
|
1608 |
|
1609 |
-
#
|
1610 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1611 |
|
1612 |
-
|
1613 |
-
|
1614 |
-
reported_page_number = int(page_no + 1)
|
1615 |
|
1616 |
-
if reported_page_number in review_file_df["page"].values:
|
1617 |
|
1618 |
-
|
1619 |
-
|
1620 |
-
|
1621 |
-
|
1622 |
-
|
1623 |
-
|
1624 |
-
|
1625 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1626 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1627 |
else:
|
1628 |
-
|
1629 |
-
|
1630 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1631 |
|
1632 |
-
|
1633 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1634 |
|
1635 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
21 |
from scipy.spatial import cKDTree
|
22 |
import random
|
23 |
import string
|
24 |
+
import warnings # To warn about potential type changes
|
25 |
|
26 |
IMAGE_NUM_REGEX = re.compile(r'_(\d+)\.png$')
|
27 |
|
|
|
462 |
input_folder:str=INPUT_FOLDER,
|
463 |
prepare_images:bool=True,
|
464 |
page_sizes:list[dict]=[],
|
465 |
+
textract_output_found:bool = False,
|
466 |
+
local_ocr_output_found:bool = False,
|
467 |
progress: Progress = Progress(track_tqdm=True)
|
468 |
) -> tuple[List[str], List[str]]:
|
469 |
"""
|
|
|
485 |
output_folder (optional, str): The output folder for file save
|
486 |
prepare_images (optional, bool): A boolean indicating whether to create images for each PDF page. Defaults to True.
|
487 |
page_sizes(optional, List[dict]): A list of dicts containing information about page sizes in various formats.
|
488 |
+
textract_output_found (optional, bool): A boolean indicating whether Textract analysis output has already been found. Defaults to False.
|
489 |
+
local_ocr_output_found (optional, bool): A boolean indicating whether local OCR analysis output has already been found. Defaults to False.
|
490 |
progress (optional, Progress): Progress tracker for the operation
|
491 |
|
492 |
|
|
|
538 |
final_out_message = '\n'.join(out_message)
|
539 |
else:
|
540 |
final_out_message = out_message
|
541 |
+
return final_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df, local_ocr_output_found
|
542 |
|
543 |
progress(0.1, desc='Preparing file')
|
544 |
|
|
|
620 |
|
621 |
elif file_extension in ['.csv']:
|
622 |
if '_review_file' in file_path_without_ext:
|
|
|
623 |
review_file_csv = read_file(file_path)
|
624 |
all_annotations_object = convert_review_df_to_annotation_json(review_file_csv, image_file_paths, page_sizes)
|
625 |
json_from_csv = True
|
626 |
+
#print("Converted CSV review file to image annotation object")
|
627 |
elif '_ocr_output' in file_path_without_ext:
|
628 |
all_line_level_ocr_results_df = read_file(file_path)
|
629 |
json_from_csv = False
|
|
|
641 |
# Assuming file_path is a NamedString or similar
|
642 |
all_annotations_object = json.loads(file_path) # Use loads for string content
|
643 |
|
644 |
+
# Save Textract file to folder
|
645 |
+
elif (file_extension in ['.json']) and '_textract' in file_path_without_ext: #(prepare_for_review != True):
|
646 |
print("Saving Textract output")
|
647 |
# Copy it to the output folder so it can be used later.
|
648 |
output_textract_json_file_name = file_path_without_ext
|
|
|
656 |
textract_output_found = True
|
657 |
continue
|
658 |
|
659 |
+
elif (file_extension in ['.json']) and '_ocr_results_with_words' in file_path_without_ext: #(prepare_for_review != True):
|
660 |
+
print("Saving local OCR output")
|
661 |
+
# Copy it to the output folder so it can be used later.
|
662 |
+
output_ocr_results_with_words_json_file_name = file_path_without_ext
|
663 |
+
if not file_path.endswith("_ocr_results_with_words.json"): output_ocr_results_with_words_json_file_name = file_path_without_ext + "_ocr_results_with_words.json"
|
664 |
+
else: output_ocr_results_with_words_json_file_name = file_path_without_ext + ".json"
|
665 |
+
|
666 |
+
out_ocr_results_with_words_path = os.path.join(output_folder, output_ocr_results_with_words_json_file_name)
|
667 |
+
|
668 |
+
# Use shutil to copy the file directly
|
669 |
+
shutil.copy2(file_path, out_ocr_results_with_words_path) # Preserves metadata
|
670 |
+
local_ocr_output_found = True
|
671 |
+
continue
|
672 |
+
|
673 |
# NEW IF STATEMENT
|
674 |
# If you have an annotations object from the above code
|
675 |
if all_annotations_object:
|
|
|
789 |
|
790 |
number_of_pages = len(page_sizes)#len(image_file_paths)
|
791 |
|
792 |
+
return combined_out_message, converted_file_paths, image_file_paths, number_of_pages, number_of_pages, pymupdf_doc, all_annotations_object, review_file_csv, original_cropboxes, page_sizes, textract_output_found, all_img_details, all_line_level_ocr_results_df, local_ocr_output_found
|
793 |
+
|
794 |
+
def load_and_convert_ocr_results_with_words_json(ocr_results_with_words_json_file_path:str, log_files_output_paths:str, page_sizes_df:pd.DataFrame):
|
795 |
+
"""
|
796 |
+
Loads Textract JSON from a file, detects if conversion is needed, and converts if necessary.
|
797 |
+
"""
|
798 |
+
|
799 |
+
if not os.path.exists(ocr_results_with_words_json_file_path):
|
800 |
+
print("No existing OCR results file found.")
|
801 |
+
return [], True, log_files_output_paths # Return empty dict and flag indicating missing file
|
802 |
+
|
803 |
+
no_ocr_results_with_words_file = False
|
804 |
+
print("Found existing OCR results json results file.")
|
805 |
+
|
806 |
+
# Track log files
|
807 |
+
if ocr_results_with_words_json_file_path not in log_files_output_paths:
|
808 |
+
log_files_output_paths.append(ocr_results_with_words_json_file_path)
|
809 |
+
|
810 |
+
try:
|
811 |
+
with open(ocr_results_with_words_json_file_path, 'r', encoding='utf-8') as json_file:
|
812 |
+
ocr_results_with_words_data = json.load(json_file)
|
813 |
+
except json.JSONDecodeError:
|
814 |
+
print("Error: Failed to parse OCR results JSON file. Returning empty data.")
|
815 |
+
return [], True, log_files_output_paths # Indicate failure
|
816 |
+
|
817 |
+
# Check if conversion is needed
|
818 |
+
if "page" and "results" in ocr_results_with_words_data[0]:
|
819 |
+
print("JSON already in the correct format for app. No changes needed.")
|
820 |
+
return ocr_results_with_words_data, False, log_files_output_paths # No conversion required
|
821 |
+
|
822 |
+
else:
|
823 |
+
print("Invalid OCR result JSON format: 'page' or 'results' key missing.")
|
824 |
+
#print("OCR results with words data:", ocr_results_with_words_data)
|
825 |
+
return [], True, log_files_output_paths # Return empty data if JSON is not recognized
|
826 |
|
827 |
def convert_text_pdf_to_img_pdf(in_file_path:str, out_text_file_path:List[str], image_dpi:float=image_dpi, output_folder:str=OUTPUT_FOLDER, input_folder:str=INPUT_FOLDER):
|
828 |
file_path_without_ext = get_file_name_without_type(in_file_path)
|
|
|
899 |
|
900 |
return result
|
901 |
|
902 |
+
def divide_coordinates_by_page_sizes(
|
903 |
+
review_file_df: pd.DataFrame,
|
904 |
+
page_sizes_df: pd.DataFrame,
|
905 |
+
xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
|
906 |
+
) -> pd.DataFrame:
|
907 |
+
"""
|
908 |
+
Optimized function to convert absolute image coordinates (>1) to relative coordinates (<=1).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
909 |
|
910 |
+
Identifies rows with absolute coordinates, merges page size information,
|
911 |
+
divides coordinates by dimensions, and combines with already-relative rows.
|
912 |
|
913 |
+
Args:
|
914 |
+
review_file_df: Input DataFrame with potentially mixed coordinate systems.
|
915 |
+
page_sizes_df: DataFrame with page dimensions ('page', 'image_width',
|
916 |
+
'image_height', 'mediabox_width', 'mediabox_height').
|
917 |
+
xmin, xmax, ymin, ymax: Names of the coordinate columns.
|
918 |
|
919 |
+
Returns:
|
920 |
+
DataFrame with coordinates converted to relative system, sorted.
|
921 |
+
"""
|
922 |
+
if review_file_df.empty or xmin not in review_file_df.columns:
|
923 |
+
return review_file_df # Return early if empty or key column missing
|
924 |
|
925 |
+
# --- Initial Type Conversion ---
|
926 |
+
coord_cols = [xmin, xmax, ymin, ymax]
|
927 |
+
cols_to_convert = coord_cols + ["page"]
|
928 |
+
temp_df = review_file_df.copy() # Work on a copy initially
|
929 |
|
930 |
+
for col in cols_to_convert:
|
931 |
+
if col in temp_df.columns:
|
932 |
+
temp_df[col] = pd.to_numeric(temp_df[col], errors="coerce")
|
933 |
+
else:
|
934 |
+
# If essential 'page' or coord column missing, cannot proceed meaningfully
|
935 |
+
if col == 'page' or col in coord_cols:
|
936 |
+
print(f"Warning: Required column '{col}' not found in review_file_df. Returning original DataFrame.")
|
937 |
+
return review_file_df
|
938 |
+
|
939 |
+
# --- Identify Absolute Coordinates ---
|
940 |
+
# Create mask for rows where *all* coordinates are potentially absolute (> 1)
|
941 |
+
# Handle potential NaNs introduced by to_numeric - treat NaN as not absolute.
|
942 |
+
is_absolute_mask = (
|
943 |
+
(temp_df[xmin] > 1) & (temp_df[xmin].notna()) &
|
944 |
+
(temp_df[xmax] > 1) & (temp_df[xmax].notna()) &
|
945 |
+
(temp_df[ymin] > 1) & (temp_df[ymin].notna()) &
|
946 |
+
(temp_df[ymax] > 1) & (temp_df[ymax].notna())
|
947 |
+
)
|
948 |
|
949 |
+
# --- Separate DataFrames ---
|
950 |
+
df_rel = temp_df[~is_absolute_mask] # Rows already relative or with NaN/mixed coords
|
951 |
+
df_abs = temp_df[is_absolute_mask].copy() # Absolute rows - COPY here to allow modifications
|
952 |
+
|
953 |
+
# --- Process Absolute Coordinates ---
|
954 |
+
if not df_abs.empty:
|
955 |
+
# Merge page sizes if necessary
|
956 |
+
if "image_width" not in df_abs.columns and not page_sizes_df.empty:
|
957 |
+
ps_df_copy = page_sizes_df.copy() # Work on a copy of page sizes
|
958 |
+
|
959 |
+
# Ensure page is numeric for merge key matching
|
960 |
+
ps_df_copy['page'] = pd.to_numeric(ps_df_copy['page'], errors='coerce')
|
961 |
+
|
962 |
+
# Columns to merge from page_sizes
|
963 |
+
merge_cols = ['page', 'image_width', 'image_height', 'mediabox_width', 'mediabox_height']
|
964 |
+
available_merge_cols = [col for col in merge_cols if col in ps_df_copy.columns]
|
965 |
+
|
966 |
+
# Prepare dimension columns in the copy
|
967 |
+
for col in ['image_width', 'image_height', 'mediabox_width', 'mediabox_height']:
|
968 |
+
if col in ps_df_copy.columns:
|
969 |
+
# Replace "<NA>" string if present
|
970 |
+
if ps_df_copy[col].dtype == 'object':
|
971 |
+
ps_df_copy[col] = ps_df_copy[col].replace("<NA>", pd.NA)
|
972 |
+
# Convert to numeric
|
973 |
+
ps_df_copy[col] = pd.to_numeric(ps_df_copy[col], errors='coerce')
|
974 |
+
|
975 |
+
# Perform the merge
|
976 |
+
if 'page' in available_merge_cols: # Check if page exists for merging
|
977 |
+
df_abs = df_abs.merge(
|
978 |
+
ps_df_copy[available_merge_cols],
|
979 |
+
on="page",
|
980 |
+
how="left"
|
981 |
+
)
|
982 |
+
else:
|
983 |
+
print("Warning: 'page' column not found in page_sizes_df. Cannot merge dimensions.")
|
984 |
+
|
985 |
+
|
986 |
+
# Fallback to mediabox dimensions if image dimensions are missing
|
987 |
+
if "image_width" in df_abs.columns and "mediabox_width" in df_abs.columns:
|
988 |
+
# Check if image_width mostly missing - use .isna().all() or check percentage
|
989 |
+
if df_abs["image_width"].isna().all():
|
990 |
+
print("Falling back to mediabox dimensions as image_width is entirely missing.")
|
991 |
+
df_abs["image_width"] = df_abs["image_width"].fillna(df_abs["mediabox_width"])
|
992 |
+
df_abs["image_height"] = df_abs["image_height"].fillna(df_abs["mediabox_height"])
|
993 |
+
else:
|
994 |
+
# Optional: Fill only missing image dims if some exist?
|
995 |
+
# df_abs["image_width"].fillna(df_abs["mediabox_width"], inplace=True)
|
996 |
+
# df_abs["image_height"].fillna(df_abs["mediabox_height"], inplace=True)
|
997 |
+
pass # Current logic only falls back if ALL image_width are NaN
|
998 |
+
|
999 |
+
# Ensure divisor columns are numeric before division
|
1000 |
+
divisors_numeric = True
|
1001 |
+
for col in ["image_width", "image_height"]:
|
1002 |
+
if col in df_abs.columns:
|
1003 |
+
df_abs[col] = pd.to_numeric(df_abs[col], errors='coerce')
|
1004 |
+
else:
|
1005 |
+
print(f"Warning: Dimension column '{col}' missing. Cannot perform division.")
|
1006 |
+
divisors_numeric = False
|
1007 |
+
|
1008 |
+
|
1009 |
+
# Perform division if dimensions are available and numeric
|
1010 |
+
if divisors_numeric and "image_width" in df_abs.columns and "image_height" in df_abs.columns:
|
1011 |
+
# Use np.errstate to suppress warnings about division by zero or NaN if desired
|
1012 |
+
with np.errstate(divide='ignore', invalid='ignore'):
|
1013 |
+
df_abs[xmin] = df_abs[xmin] / df_abs["image_width"]
|
1014 |
+
df_abs[xmax] = df_abs[xmax] / df_abs["image_width"]
|
1015 |
+
df_abs[ymin] = df_abs[ymin] / df_abs["image_height"]
|
1016 |
+
df_abs[ymax] = df_abs[ymax] / df_abs["image_height"]
|
1017 |
+
# Replace potential infinities with NaN (optional, depending on desired outcome)
|
1018 |
+
df_abs.replace([np.inf, -np.inf], np.nan, inplace=True)
|
1019 |
else:
|
1020 |
+
print("Skipping coordinate division due to missing or non-numeric dimension columns.")
|
1021 |
|
|
|
|
|
|
|
|
|
1022 |
|
1023 |
+
# --- Combine Relative and Processed Absolute DataFrames ---
|
1024 |
+
dfs_to_concat = [df for df in [df_rel, df_abs] if not df.empty]
|
1025 |
|
1026 |
+
if dfs_to_concat:
|
1027 |
+
final_df = pd.concat(dfs_to_concat, ignore_index=True)
|
1028 |
+
else:
|
1029 |
+
# If both splits were empty, return an empty DF with original columns
|
1030 |
+
print("Warning: Both relative and absolute splits resulted in empty DataFrames.")
|
1031 |
+
final_df = pd.DataFrame(columns=review_file_df.columns)
|
1032 |
|
|
|
1033 |
|
1034 |
+
# --- Final Sort ---
|
1035 |
+
required_sort_columns = {"page", xmin, ymin}
|
1036 |
+
if not final_df.empty and required_sort_columns.issubset(final_df.columns):
|
1037 |
+
# Ensure sort columns are numeric before sorting
|
1038 |
+
final_df['page'] = pd.to_numeric(final_df['page'], errors='coerce')
|
1039 |
+
final_df[ymin] = pd.to_numeric(final_df[ymin], errors='coerce')
|
1040 |
+
final_df[xmin] = pd.to_numeric(final_df[xmin], errors='coerce')
|
1041 |
+
# Sort by page, ymin, xmin (note order compared to multiply function)
|
1042 |
+
final_df.sort_values(["page", ymin, xmin], inplace=True, na_position='last')
|
1043 |
|
|
|
1044 |
|
1045 |
+
# --- Clean Up Columns ---
|
1046 |
+
# Correctly drop columns and reassign the result
|
1047 |
+
cols_to_drop = ["image_width", "image_height", "mediabox_width", "mediabox_height"]
|
1048 |
+
final_df = final_df.drop(columns=cols_to_drop, errors="ignore")
|
1049 |
|
1050 |
+
return final_df
|
|
|
|
|
|
|
1051 |
|
1052 |
+
def multiply_coordinates_by_page_sizes(
|
1053 |
+
review_file_df: pd.DataFrame,
|
1054 |
+
page_sizes_df: pd.DataFrame,
|
1055 |
+
xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
|
1056 |
+
):
|
1057 |
+
"""
|
1058 |
+
Optimized function to convert relative coordinates to absolute based on page sizes.
|
1059 |
|
1060 |
+
Separates relative (<=1) and absolute (>1) coordinates, merges page sizes
|
1061 |
+
for relative coordinates, calculates absolute pixel values, and recombines.
|
1062 |
+
"""
|
1063 |
+
if review_file_df.empty or xmin not in review_file_df.columns:
|
1064 |
+
return review_file_df # Return early if empty or key column missing
|
1065 |
+
|
1066 |
+
coord_cols = [xmin, xmax, ymin, ymax]
|
1067 |
+
# Initial type conversion for coordinates and page
|
1068 |
+
for col in coord_cols + ["page"]:
|
1069 |
+
if col in review_file_df.columns:
|
1070 |
+
# Use astype for potentially faster conversion if confident,
|
1071 |
+
# but to_numeric is safer for mixed types/errors
|
1072 |
+
review_file_df[col] = pd.to_numeric(review_file_df[col], errors="coerce")
|
1073 |
+
|
1074 |
+
# --- Identify relative coordinates ---
|
1075 |
+
# Create mask for rows where *all* coordinates are potentially relative (<= 1)
|
1076 |
+
# Handle potential NaNs introduced by to_numeric - treat NaN as not relative here.
|
1077 |
+
is_relative_mask = (
|
1078 |
+
(review_file_df[xmin].le(1) & review_file_df[xmin].notna()) &
|
1079 |
+
(review_file_df[xmax].le(1) & review_file_df[xmax].notna()) &
|
1080 |
+
(review_file_df[ymin].le(1) & review_file_df[ymin].notna()) &
|
1081 |
+
(review_file_df[ymax].le(1) & review_file_df[ymax].notna())
|
1082 |
+
)
|
1083 |
|
1084 |
+
# Separate DataFrames (minimal copies)
|
1085 |
+
df_abs = review_file_df[~is_relative_mask].copy() # Keep absolute rows separately
|
1086 |
+
df_rel = review_file_df[is_relative_mask].copy() # Work only with relative rows
|
1087 |
+
|
1088 |
+
if df_rel.empty:
|
1089 |
+
# If no relative coordinates, just sort and return absolute ones (if any)
|
1090 |
+
if not df_abs.empty and {"page", xmin, ymin}.issubset(df_abs.columns):
|
1091 |
+
df_abs.sort_values(["page", xmin, ymin], inplace=True, na_position='last')
|
1092 |
+
return df_abs
|
1093 |
+
|
1094 |
+
# --- Process relative coordinates ---
|
1095 |
+
if "image_width" not in df_rel.columns and not page_sizes_df.empty:
|
1096 |
+
# Prepare page_sizes_df for merge
|
1097 |
+
page_sizes_df = page_sizes_df.copy() # Avoid modifying original page_sizes_df
|
1098 |
+
page_sizes_df['page'] = pd.to_numeric(page_sizes_df['page'], errors='coerce')
|
1099 |
+
# Ensure proper NA handling for image dimensions
|
1100 |
+
page_sizes_df[['image_width', 'image_height']] = page_sizes_df[['image_width','image_height']].replace("<NA>", pd.NA)
|
1101 |
+
page_sizes_df['image_width'] = pd.to_numeric(page_sizes_df['image_width'], errors='coerce')
|
1102 |
+
page_sizes_df['image_height'] = pd.to_numeric(page_sizes_df['image_height'], errors='coerce')
|
1103 |
+
|
1104 |
+
# Merge page sizes
|
1105 |
+
df_rel = df_rel.merge(
|
1106 |
+
page_sizes_df[['page', 'image_width', 'image_height']],
|
1107 |
+
on="page",
|
1108 |
+
how="left"
|
1109 |
+
)
|
1110 |
|
1111 |
+
# Multiply coordinates where image dimensions are available
|
1112 |
+
if "image_width" in df_rel.columns:
|
1113 |
+
# Create mask for rows in df_rel that have valid image dimensions
|
1114 |
+
has_size_mask = df_rel["image_width"].notna() & df_rel["image_height"].notna()
|
1115 |
|
1116 |
+
# Apply multiplication using .loc and the mask (vectorized and efficient)
|
1117 |
+
# Ensure columns are numeric before multiplication (might be redundant if types are good)
|
1118 |
+
# df_rel.loc[has_size_mask, coord_cols + ['image_width', 'image_height']] = df_rel.loc[has_size_mask, coord_cols + ['image_width', 'image_height']].apply(pd.to_numeric, errors='coerce')
|
|
|
1119 |
|
1120 |
+
df_rel.loc[has_size_mask, xmin] *= df_rel.loc[has_size_mask, "image_width"]
|
1121 |
+
df_rel.loc[has_size_mask, xmax] *= df_rel.loc[has_size_mask, "image_width"]
|
1122 |
+
df_rel.loc[has_size_mask, ymin] *= df_rel.loc[has_size_mask, "image_height"]
|
1123 |
+
df_rel.loc[has_size_mask, ymax] *= df_rel.loc[has_size_mask, "image_height"]
|
1124 |
|
|
|
|
|
|
|
|
|
|
|
1125 |
|
1126 |
+
# --- Combine absolute and processed relative DataFrames ---
|
1127 |
+
# Use list comprehension to handle potentially empty DataFrames
|
1128 |
+
dfs_to_concat = [df for df in [df_abs, df_rel] if not df.empty]
|
1129 |
|
1130 |
+
if not dfs_to_concat:
|
1131 |
+
return pd.DataFrame() # Return empty if both are empty
|
|
|
|
|
|
|
|
|
1132 |
|
1133 |
+
final_df = pd.concat(dfs_to_concat, ignore_index=True) # ignore_index is good practice after filtering/concat
|
|
|
|
|
|
|
1134 |
|
1135 |
+
# --- Final Sort ---
|
1136 |
+
required_sort_columns = {"page", xmin, ymin}
|
1137 |
+
if not final_df.empty and required_sort_columns.issubset(final_df.columns):
|
1138 |
+
# Handle potential NaNs in sort columns gracefully
|
1139 |
+
final_df.sort_values(["page", xmin, ymin], inplace=True, na_position='last')
|
1140 |
|
1141 |
+
return final_df
|
1142 |
|
1143 |
def do_proximity_match_by_page_for_text(df1:pd.DataFrame, df2:pd.DataFrame):
|
1144 |
'''
|
|
|
1192 |
|
1193 |
return merged_df
|
1194 |
|
|
|
1195 |
def do_proximity_match_all_pages_for_text(df1:pd.DataFrame, df2:pd.DataFrame, threshold:float=0.03):
|
1196 |
'''
|
1197 |
Match text from one dataframe to another based on proximity matching of coordinates across all pages.
|
|
|
1315 |
# prevents this from being necessary.
|
1316 |
|
1317 |
# 7. Ensure essential columns exist and set column order
|
1318 |
+
essential_box_cols = ["xmin", "xmax", "ymin", "ymax", "text", "id", "label"]
|
1319 |
for col in essential_box_cols:
|
1320 |
if col not in final_df.columns:
|
1321 |
final_df[col] = pd.NA # Add column with NA if it wasn't present in any box
|
1322 |
|
1323 |
+
base_cols = ["image"]
|
1324 |
extra_box_cols = [col for col in final_df.columns if col not in base_cols and col not in essential_box_cols]
|
1325 |
final_col_order = base_cols + essential_box_cols + sorted(extra_box_cols)
|
1326 |
|
|
|
1329 |
# but it's good practice if columns could be missing for other reasons.
|
1330 |
final_df = final_df.reindex(columns=final_col_order, fill_value=pd.NA)
|
1331 |
|
1332 |
+
final_df = final_df.dropna(subset=["xmin", "xmax", "ymin", "ymax", "text", "id", "label"])
|
1333 |
+
|
1334 |
return final_df
|
1335 |
|
1336 |
def create_annotation_dicts_from_annotation_df(
|
|
|
1360 |
available_cols = [col for col in box_cols if col in all_image_annotations_df.columns]
|
1361 |
|
1362 |
if 'text' in all_image_annotations_df.columns:
|
1363 |
+
all_image_annotations_df['text'] = all_image_annotations_df['text'].fillna('')
|
1364 |
+
#all_image_annotations_df.loc[all_image_annotations_df['text'].isnull(), 'text'] = ''
|
1365 |
|
1366 |
if not available_cols:
|
1367 |
print(f"Warning: None of the expected box columns ({box_cols}) found in DataFrame.")
|
|
|
1402 |
|
1403 |
return result
|
1404 |
|
1405 |
+
def convert_annotation_json_to_review_df(
|
1406 |
+
all_annotations: List[dict],
|
1407 |
+
redaction_decision_output: pd.DataFrame = pd.DataFrame(),
|
1408 |
+
page_sizes: List[dict] = [],
|
1409 |
+
do_proximity_match: bool = True
|
1410 |
+
) -> pd.DataFrame:
|
1411 |
'''
|
1412 |
Convert the annotation json data to a dataframe format.
|
1413 |
Add on any text from the initial review_file dataframe by joining based on 'id' if available
|
1414 |
in both sources, otherwise falling back to joining on pages/co-ordinates (if option selected).
|
1415 |
+
|
1416 |
+
Refactored for improved efficiency, prioritizing ID-based join and conditionally applying
|
1417 |
+
coordinate division and proximity matching.
|
1418 |
'''
|
1419 |
|
1420 |
# 1. Convert annotations to DataFrame
|
|
|
|
|
|
|
1421 |
review_file_df = convert_annotation_data_to_dataframe(all_annotations)
|
1422 |
|
1423 |
+
# Only keep rows in review_df where there are coordinates (assuming xmin is representative)
|
1424 |
+
# Use .notna() for robustness with potential None or NaN values
|
1425 |
+
review_file_df.dropna(subset=['xmin', 'ymin', 'xmax', 'ymax'], how='any', inplace=True)
|
1426 |
|
1427 |
# Exit early if the initial conversion results in an empty DataFrame
|
1428 |
if review_file_df.empty:
|
1429 |
# Define standard columns for an empty return DataFrame
|
1430 |
+
# Ensure 'id' is included if it was potentially expected based on input structure
|
1431 |
+
# We don't know the columns from convert_annotation_data_to_dataframe without seeing it,
|
1432 |
+
# but let's assume a standard set and add 'id' if it appeared.
|
1433 |
+
standard_cols = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text"]
|
1434 |
+
if 'id' in review_file_df.columns:
|
1435 |
+
standard_cols.append('id')
|
1436 |
+
return pd.DataFrame(columns=standard_cols)
|
1437 |
+
|
1438 |
+
# Ensure 'id' column exists for logic flow, even if empty
|
1439 |
+
if 'id' not in review_file_df.columns:
|
1440 |
+
review_file_df['id'] = ''
|
1441 |
+
# Do the same for redaction_decision_output if it's not empty
|
1442 |
+
if not redaction_decision_output.empty and 'id' not in redaction_decision_output.columns:
|
1443 |
+
redaction_decision_output['id'] = ''
|
1444 |
|
1445 |
|
1446 |
+
# 2. Process page sizes if provided - needed potentially for coordinate division later
|
1447 |
+
# Process this once upfront if the data is available
|
1448 |
+
page_sizes_df = pd.DataFrame() # Initialize as empty
|
1449 |
+
if page_sizes:
|
1450 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
1451 |
if not page_sizes_df.empty:
|
1452 |
+
# Safely convert page column to numeric and then int
|
1453 |
+
page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
|
1454 |
+
page_sizes_df.dropna(subset=["page"], inplace=True)
|
1455 |
+
if not page_sizes_df.empty: # Check again after dropping NaNs
|
1456 |
+
page_sizes_df["page"] = page_sizes_df["page"].astype(int)
|
1457 |
+
else:
|
1458 |
+
print("Warning: Page sizes DataFrame became empty after processing, coordinate division will be skipped.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1459 |
|
1460 |
|
1461 |
# 3. Join additional data from redaction_decision_output if provided
|
1462 |
+
text_added_successfully = False # Flag to track if text was added by any method
|
1463 |
+
|
1464 |
if not redaction_decision_output.empty:
|
1465 |
+
# --- Attempt to join data based on 'id' column first ---
|
1466 |
+
|
1467 |
+
# Check if 'id' columns are present and have non-null values in *both* dataframes
|
1468 |
+
id_col_exists_in_review = 'id' in review_file_df.columns and not review_file_df['id'].isnull().all() and not (review_file_df['id'] == '').all()
|
1469 |
+
id_col_exists_in_redaction = 'id' in redaction_decision_output.columns and not redaction_decision_output['id'].isnull().all() and not (redaction_decision_output['id'] == '').all()
|
1470 |
+
|
1471 |
|
1472 |
if id_col_exists_in_review and id_col_exists_in_redaction:
|
1473 |
#print("Attempting to join data based on 'id' column.")
|
1474 |
try:
|
1475 |
+
# Ensure 'id' columns are of string type for robust merging
|
1476 |
review_file_df['id'] = review_file_df['id'].astype(str)
|
1477 |
+
# Make a copy if needed, but try to avoid if redaction_decision_output isn't modified later
|
1478 |
+
# Let's use a copy for safety as in the original code
|
1479 |
redaction_copy = redaction_decision_output.copy()
|
1480 |
redaction_copy['id'] = redaction_copy['id'].astype(str)
|
1481 |
|
1482 |
+
# Select columns to merge from redaction output. Prioritize 'text'.
|
|
|
|
|
1483 |
cols_to_merge = ['id']
|
1484 |
if 'text' in redaction_copy.columns:
|
1485 |
cols_to_merge.append('text')
|
|
|
1487 |
print("Warning: 'text' column not found in redaction_decision_output. Cannot merge text using 'id'.")
|
1488 |
|
1489 |
# Perform a left merge to keep all annotations and add matching text
|
1490 |
+
# Use a suffix for the text column from the right DataFrame
|
1491 |
+
original_text_col_exists = 'text' in review_file_df.columns
|
1492 |
+
merge_suffix = '_redaction' if original_text_col_exists else ''
|
1493 |
+
|
1494 |
merged_df = pd.merge(
|
1495 |
review_file_df,
|
1496 |
redaction_copy[cols_to_merge],
|
1497 |
on='id',
|
1498 |
how='left',
|
1499 |
+
suffixes=('', merge_suffix)
|
1500 |
)
|
1501 |
|
1502 |
+
# Update the 'text' column if a new one was brought in
|
1503 |
+
if 'text' + merge_suffix in merged_df.columns:
|
1504 |
+
redaction_text_col = 'text' + merge_suffix
|
1505 |
+
if original_text_col_exists:
|
1506 |
+
# Combine: Use text from redaction where available, otherwise keep original
|
1507 |
+
merged_df['text'] = merged_df[redaction_text_col].combine_first(merged_df['text'])
|
1508 |
+
# Drop the temporary column
|
1509 |
+
merged_df = merged_df.drop(columns=[redaction_text_col])
|
1510 |
+
else:
|
1511 |
+
# Redaction output had text, but review_file_df didn't. Rename the new column.
|
1512 |
+
merged_df = merged_df.rename(columns={redaction_text_col: 'text'})
|
1513 |
|
1514 |
+
text_added_successfully = True # Indicate text was potentially added
|
|
|
1515 |
|
1516 |
+
review_file_df = merged_df # Update the main DataFrame
|
|
|
|
|
|
|
|
|
|
|
1517 |
|
1518 |
+
#print("Successfully attempted to join data using 'id'.") # Note: Text might not have been in redaction data
|
|
|
|
|
1519 |
|
1520 |
except Exception as e:
|
1521 |
+
print(f"Error during 'id'-based merge: {e}. Checking for proximity match fallback.")
|
1522 |
+
# Fall through to proximity match logic below
|
1523 |
+
|
1524 |
+
# --- Fallback to proximity match if ID join wasn't possible/successful and enabled ---
|
1525 |
+
# Note: If id_col_exists_in_review or id_col_exists_in_redaction was False,
|
1526 |
+
# the block above was skipped, and we naturally fall here.
|
1527 |
+
# If an error occurred in the try block, joined_by_id would implicitly be False
|
1528 |
+
# because text_added_successfully wasn't set to True.
|
1529 |
+
|
1530 |
+
# Only attempt proximity match if text wasn't added by ID join and proximity is requested
|
1531 |
+
if not text_added_successfully and do_proximity_match:
|
1532 |
+
print("Attempting proximity match to add text data.")
|
1533 |
+
|
1534 |
+
# Ensure 'page' columns are numeric before coordinate division and proximity match
|
1535 |
+
# (Assuming divide_coordinates_by_page_sizes and do_proximity_match_all_pages_for_text need this)
|
1536 |
+
if 'page' in review_file_df.columns:
|
1537 |
+
review_file_df['page'] = pd.to_numeric(review_file_df['page'], errors='coerce').fillna(-1).astype(int) # Use -1 for NaN pages
|
1538 |
+
review_file_df = review_file_df[review_file_df['page'] != -1] # Drop rows where page conversion failed
|
1539 |
+
if not redaction_decision_output.empty and 'page' in redaction_decision_output.columns:
|
1540 |
+
redaction_decision_output['page'] = pd.to_numeric(redaction_decision_output['page'], errors='coerce').fillna(-1).astype(int)
|
1541 |
+
redaction_decision_output = redaction_decision_output[redaction_decision_output['page'] != -1]
|
1542 |
+
|
1543 |
+
# Perform coordinate division IF page_sizes were processed and DataFrame is not empty
|
1544 |
+
if not page_sizes_df.empty:
|
1545 |
+
# Apply coordinate division *before* proximity match
|
1546 |
+
review_file_df = divide_coordinates_by_page_sizes(review_file_df, page_sizes_df)
|
1547 |
+
if not redaction_decision_output.empty:
|
1548 |
+
redaction_decision_output = divide_coordinates_by_page_sizes(redaction_decision_output, page_sizes_df)
|
1549 |
+
|
1550 |
+
# Now perform the proximity match
|
1551 |
+
# Note: Potential DataFrame copies happen inside do_proximity_match based on its implementation
|
1552 |
+
if not redaction_decision_output.empty:
|
1553 |
+
try:
|
1554 |
+
review_file_df = do_proximity_match_all_pages_for_text(
|
1555 |
+
df1=review_file_df, # Pass directly, avoid caller copy if possible by modifying function signature
|
1556 |
+
df2=redaction_decision_output # Pass directly
|
1557 |
+
)
|
1558 |
+
# Assuming do_proximity_match_all_pages_for_text adds/updates the 'text' column
|
1559 |
+
if 'text' in review_file_df.columns:
|
1560 |
+
text_added_successfully = True
|
1561 |
+
print("Proximity match completed.")
|
1562 |
+
except Exception as e:
|
1563 |
+
print(f"Error during proximity match: {e}. Text data may not be added.")
|
1564 |
+
|
1565 |
+
elif not text_added_successfully and not do_proximity_match:
|
1566 |
+
print("Skipping joining text data (ID join not possible/failed, proximity match disabled).")
|
1567 |
+
|
1568 |
+
# 4. Ensure required columns exist and are ordered
|
1569 |
+
# Define base required columns. 'id' and 'text' are conditionally added.
|
1570 |
+
required_columns_base = ["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax"]
|
1571 |
+
final_columns = required_columns_base[:] # Start with base columns
|
1572 |
+
|
1573 |
+
# Add 'id' and 'text' if they exist in the DataFrame at this point
|
1574 |
if 'id' in review_file_df.columns:
|
1575 |
+
final_columns.append('id')
|
1576 |
+
if 'text' in review_file_df.columns:
|
1577 |
+
final_columns.append('text') # Add text column if it was created/merged
|
1578 |
|
1579 |
+
# Add any missing required columns with a default value (e.g., blank string)
|
1580 |
+
for col in final_columns:
|
1581 |
if col not in review_file_df.columns:
|
1582 |
+
# Use appropriate default based on expected type, '' for text/id, np.nan for coords?
|
1583 |
+
# Sticking to '' as in original for simplicity, but consider data types.
|
1584 |
+
review_file_df[col] = '' # Or np.nan for numerical, but coords already checked by dropna
|
1585 |
|
1586 |
# Select and order the final set of columns
|
1587 |
+
# Ensure all selected columns actually exist after adding defaults
|
1588 |
+
review_file_df = review_file_df[[col for col in final_columns if col in review_file_df.columns]]
|
1589 |
+
|
1590 |
|
1591 |
# 5. Final processing and sorting
|
1592 |
+
# Convert colours from list to tuple if necessary - apply is okay here unless lists are vast
|
1593 |
if 'color' in review_file_df.columns:
|
1594 |
+
# Check if the column actually contains lists before applying lambda
|
1595 |
+
if review_file_df['color'].apply(lambda x: isinstance(x, list)).any():
|
1596 |
+
review_file_df["color"] = review_file_df["color"].apply(lambda x: tuple(x) if isinstance(x, list) else x)
|
1597 |
|
1598 |
# Sort the results
|
|
|
1599 |
# Ensure sort columns exist before sorting
|
1600 |
+
sort_columns = ['page', 'ymin', 'xmin', 'label']
|
1601 |
valid_sort_columns = [col for col in sort_columns if col in review_file_df.columns]
|
1602 |
+
if valid_sort_columns and not review_file_df.empty: # Only sort non-empty df
|
1603 |
+
# Convert potential numeric sort columns to appropriate types if necessary
|
1604 |
+
# (e.g., 'page', 'ymin', 'xmin') to ensure correct sorting.
|
1605 |
+
# dropna(subset=[...], inplace=True) earlier should handle NaNs in coords.
|
1606 |
+
# page conversion already done before proximity match.
|
1607 |
+
try:
|
1608 |
+
review_file_df = review_file_df.sort_values(valid_sort_columns)
|
1609 |
+
except TypeError as e:
|
1610 |
+
print(f"Warning: Could not sort DataFrame due to type error in sort columns: {e}")
|
1611 |
+
# Proceed without sorting
|
1612 |
+
|
1613 |
+
review_file_df = review_file_df.dropna(subset=["xmin", "xmax", "ymin", "ymax", "text", "id", "label"])
|
1614 |
|
1615 |
return review_file_df
|
1616 |
|
|
|
1695 |
|
1696 |
def fill_missing_ids(df: pd.DataFrame, column_name: str = 'id', length: int = 12) -> pd.DataFrame:
|
1697 |
"""
|
1698 |
+
Optimized: Generates unique alphanumeric IDs for rows in a DataFrame column
|
1699 |
+
where the value is missing (NaN, None) or an empty/whitespace string.
|
1700 |
|
1701 |
Args:
|
1702 |
df (pd.DataFrame): The input Pandas DataFrame.
|
1703 |
column_name (str): The name of the column to check and fill (defaults to 'id').
|
1704 |
This column will be added if it doesn't exist.
|
1705 |
length (int): The desired length of the generated IDs (defaults to 12).
|
|
|
|
|
1706 |
|
1707 |
Returns:
|
1708 |
pd.DataFrame: The DataFrame with missing/empty IDs filled in the specified column.
|
1709 |
+
Note: The function modifies the DataFrame directly (in-place).
|
1710 |
"""
|
1711 |
|
1712 |
# --- Input Validation ---
|
|
|
1718 |
raise ValueError("'length' must be a positive integer.")
|
1719 |
|
1720 |
# --- Ensure Column Exists ---
|
1721 |
+
original_dtype = None
|
1722 |
if column_name not in df.columns:
|
1723 |
print(f"Column '{column_name}' not found. Adding it to the DataFrame.")
|
1724 |
+
# Initialize with None (which Pandas often treats as NaN but allows object dtype)
|
1725 |
+
df[column_name] = None
|
1726 |
+
# Set original_dtype to object so it likely becomes string later
|
1727 |
+
original_dtype = object
|
1728 |
+
else:
|
1729 |
+
original_dtype = df[column_name].dtype
|
1730 |
|
1731 |
# --- Identify Rows Needing IDs ---
|
1732 |
+
# 1. Check for actual null values (NaN, None, NaT)
|
1733 |
+
is_null = df[column_name].isna()
|
1734 |
+
|
1735 |
+
# 2. Check for empty or whitespace-only strings AFTER converting potential values to string
|
1736 |
+
# Only apply string checks on rows that are *not* null to avoid errors/warnings
|
1737 |
+
# Fill NaN temporarily for string operations, then check length or equality
|
1738 |
+
is_empty_str = pd.Series(False, index=df.index) # Default to False
|
1739 |
+
if not is_null.all(): # Only check strings if there are non-null values
|
1740 |
+
temp_str_col = df.loc[~is_null, column_name].astype(str).str.strip()
|
1741 |
+
is_empty_str.loc[~is_null] = (temp_str_col == '')
|
1742 |
+
|
1743 |
+
# Combine the conditions
|
1744 |
+
is_missing_or_empty = is_null | is_empty_str
|
|
|
1745 |
|
1746 |
rows_to_fill_index = df.index[is_missing_or_empty]
|
1747 |
num_needed = len(rows_to_fill_index)
|
1748 |
|
1749 |
if num_needed == 0:
|
1750 |
+
# Ensure final column type is consistent if nothing was done
|
1751 |
+
if pd.api.types.is_object_dtype(original_dtype) or pd.api.types.is_string_dtype(original_dtype):
|
1752 |
+
pass # Likely already object or string
|
1753 |
+
else:
|
1754 |
+
# If original was numeric/etc., but might contain strings now? Unlikely here.
|
1755 |
+
pass # Or convert to object: df[column_name] = df[column_name].astype(object)
|
1756 |
+
# print(f"No missing or empty values found requiring IDs in column '{column_name}'.")
|
1757 |
return df
|
1758 |
|
1759 |
print(f"Found {num_needed} rows requiring a unique ID in column '{column_name}'.")
|
1760 |
|
1761 |
# --- Get Existing IDs to Ensure Uniqueness ---
|
1762 |
+
# Consider only rows that are *not* missing/empty
|
1763 |
+
valid_rows = df.loc[~is_missing_or_empty, column_name]
|
1764 |
+
# Drop any remaining nulls (shouldn't be any based on mask, but belts and braces)
|
1765 |
+
valid_rows = valid_rows.dropna()
|
1766 |
+
# Convert to string *only* if not already string/object, then filter out empty strings again
|
1767 |
+
if not pd.api.types.is_object_dtype(valid_rows.dtype) and not pd.api.types.is_string_dtype(valid_rows.dtype):
|
1768 |
+
existing_ids = set(valid_rows.astype(str).str.strip())
|
1769 |
+
else: # Already string or object, just strip and convert to set
|
1770 |
+
existing_ids = set(valid_rows.astype(str).str.strip()) # astype(str) handles mixed types in object column
|
1771 |
+
|
1772 |
+
# Remove empty string from existing IDs if it's there after stripping
|
1773 |
+
existing_ids.discard('')
|
1774 |
|
1775 |
|
1776 |
# --- Generate Unique IDs ---
|
|
|
1780 |
|
1781 |
max_possible_ids = len(character_set) ** length
|
1782 |
if num_needed > max_possible_ids:
|
1783 |
+
raise ValueError(f"Cannot generate {num_needed} unique IDs with length {length}. Maximum possible is {max_possible_ids}.")
|
1784 |
+
|
1785 |
+
# Pre-calculate safety break limit
|
1786 |
+
max_attempts_per_id = max(1000, num_needed * 10) # Adjust multiplier as needed
|
1787 |
|
1788 |
#print(f"Generating {num_needed} unique IDs of length {length}...")
|
1789 |
for i in range(num_needed):
|
1790 |
attempts = 0
|
1791 |
while True:
|
1792 |
candidate_id = ''.join(random.choices(character_set, k=length))
|
1793 |
+
# Check against *all* known existing IDs and *newly* generated ones
|
1794 |
if candidate_id not in existing_ids and candidate_id not in generated_ids_set:
|
1795 |
generated_ids_set.add(candidate_id)
|
1796 |
new_ids_list.append(candidate_id)
|
1797 |
break # Found a unique ID
|
1798 |
attempts += 1
|
1799 |
+
if attempts > max_attempts_per_id : # Safety break
|
1800 |
+
raise RuntimeError(f"Failed to generate a unique ID after {attempts} attempts. Check length, character set, or density of existing IDs.")
|
1801 |
|
1802 |
+
# Optional progress update
|
1803 |
+
# if (i + 1) % 1000 == 0:
|
1804 |
+
# print(f"Generated {i+1}/{num_needed} IDs...")
|
1805 |
|
1806 |
|
1807 |
# --- Assign New IDs ---
|
1808 |
# Use the previously identified index to assign the new IDs correctly
|
1809 |
+
# Assigning string IDs might change the column's dtype to 'object'
|
1810 |
+
if not pd.api.types.is_object_dtype(original_dtype) and not pd.api.types.is_string_dtype(original_dtype):
|
1811 |
+
warnings.warn(f"Column '{column_name}' dtype might change from '{original_dtype}' to 'object' due to string ID assignment.", UserWarning)
|
1812 |
+
|
1813 |
df.loc[rows_to_fill_index, column_name] = new_ids_list
|
1814 |
+
print(f"Successfully assigned {len(new_ids_list)} new unique IDs to column '{column_name}'.")
|
1815 |
+
|
1816 |
+
# Optional: Convert the entire column to string type at the end for consistency
|
1817 |
+
# df[column_name] = df[column_name].astype(str)
|
1818 |
|
|
|
1819 |
return df
|
1820 |
|
1821 |
+
def convert_review_df_to_annotation_json(
|
1822 |
+
review_file_df: pd.DataFrame,
|
1823 |
+
image_paths: List[str], # List of image file paths
|
1824 |
+
page_sizes: List[Dict], # List of dicts like [{'page': 1, 'image_path': '...', 'image_width': W, 'image_height': H}, ...]
|
1825 |
+
xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax" # Coordinate column names
|
1826 |
+
) -> List[Dict]:
|
1827 |
+
"""
|
1828 |
+
Optimized function to convert review DataFrame to Gradio Annotation JSON format.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1829 |
|
1830 |
+
Ensures absolute coordinates, handles missing IDs, deduplicates based on key fields,
|
1831 |
+
selects final columns, and structures data per image/page based on page_sizes.
|
|
|
1832 |
|
1833 |
+
Args:
|
1834 |
+
review_file_df: Input DataFrame with annotation data.
|
1835 |
+
image_paths: List of image file paths (Note: currently unused if page_sizes provides paths).
|
1836 |
+
page_sizes: REQUIRED list of dictionaries, each containing 'page',
|
1837 |
+
'image_path', 'image_width', and 'image_height'. Defines
|
1838 |
+
output structure and dimensions for coordinate conversion.
|
1839 |
+
xmin, xmax, ymin, ymax: Names of the coordinate columns.
|
1840 |
|
1841 |
+
Returns:
|
1842 |
+
List of dictionaries suitable for Gradio Annotation output, one dict per image/page.
|
1843 |
+
"""
|
1844 |
+
review_file_df = review_file_df.dropna(subset=["xmin", "xmax", "ymin", "ymax", "text", "id", "label"])
|
1845 |
|
1846 |
+
if not page_sizes:
|
1847 |
+
raise ValueError("page_sizes argument is required and cannot be empty.")
|
1848 |
|
1849 |
+
# --- Prepare Page Sizes DataFrame ---
|
1850 |
+
try:
|
1851 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
1852 |
+
required_ps_cols = {'page', 'image_path', 'image_width', 'image_height'}
|
1853 |
+
if not required_ps_cols.issubset(page_sizes_df.columns):
|
1854 |
+
missing = required_ps_cols - set(page_sizes_df.columns)
|
1855 |
+
raise ValueError(f"page_sizes is missing required keys: {missing}")
|
1856 |
+
# Convert page sizes columns to appropriate numeric types early
|
1857 |
+
page_sizes_df['page'] = pd.to_numeric(page_sizes_df['page'], errors='coerce')
|
1858 |
+
page_sizes_df['image_width'] = pd.to_numeric(page_sizes_df['image_width'], errors='coerce')
|
1859 |
+
page_sizes_df['image_height'] = pd.to_numeric(page_sizes_df['image_height'], errors='coerce')
|
1860 |
+
# Use nullable Int64 for page number consistency
|
1861 |
+
page_sizes_df['page'] = page_sizes_df['page'].astype('Int64')
|
1862 |
|
1863 |
+
except Exception as e:
|
1864 |
+
raise ValueError(f"Error processing page_sizes: {e}") from e
|
|
|
1865 |
|
|
|
1866 |
|
1867 |
+
# Handle empty input DataFrame gracefully
|
1868 |
+
if review_file_df.empty:
|
1869 |
+
print("Input review_file_df is empty. Proceeding to generate JSON structure with empty boxes.")
|
1870 |
+
# Ensure essential columns exist even if empty for later steps
|
1871 |
+
for col in [xmin, xmax, ymin, ymax, "page", "label", "color", "id", "text"]:
|
1872 |
+
if col not in review_file_df.columns:
|
1873 |
+
review_file_df[col] = pd.NA
|
1874 |
+
else:
|
1875 |
+
# --- Coordinate Conversion (if needed) ---
|
1876 |
+
coord_cols_to_check = [c for c in [xmin, xmax, ymin, ymax] if c in review_file_df.columns]
|
1877 |
+
needs_multiplication = False
|
1878 |
+
if coord_cols_to_check:
|
1879 |
+
temp_df_numeric = review_file_df[coord_cols_to_check].apply(pd.to_numeric, errors='coerce')
|
1880 |
+
if temp_df_numeric.le(1).any().any(): # Check if any numeric coord <= 1 exists
|
1881 |
+
needs_multiplication = True
|
1882 |
+
|
1883 |
+
if needs_multiplication:
|
1884 |
+
#print("Relative coordinates detected or suspected, running multiplication...")
|
1885 |
+
review_file_df = multiply_coordinates_by_page_sizes(
|
1886 |
+
review_file_df.copy(), # Pass a copy to avoid modifying original outside function
|
1887 |
+
page_sizes_df,
|
1888 |
+
xmin, xmax, ymin, ymax
|
1889 |
+
)
|
1890 |
+
else:
|
1891 |
+
#print("No relative coordinates detected or required columns missing, skipping multiplication.")
|
1892 |
+
# Still ensure essential coordinate/page columns are numeric if they exist
|
1893 |
+
cols_to_convert = [c for c in [xmin, xmax, ymin, ymax, "page"] if c in review_file_df.columns]
|
1894 |
+
for col in cols_to_convert:
|
1895 |
+
review_file_df[col] = pd.to_numeric(review_file_df[col], errors='coerce')
|
1896 |
|
1897 |
+
# Handle potential case where multiplication returns an empty DF
|
1898 |
+
if review_file_df.empty:
|
1899 |
+
print("DataFrame became empty after coordinate processing.")
|
1900 |
+
# Re-add essential columns if they were lost
|
1901 |
+
for col in [xmin, xmax, ymin, ymax, "page", "label", "color", "id", "text"]:
|
1902 |
+
if col not in review_file_df.columns:
|
1903 |
+
review_file_df[col] = pd.NA
|
1904 |
+
|
1905 |
+
# --- Fill Missing IDs ---
|
1906 |
+
review_file_df = fill_missing_ids(review_file_df.copy()) # Pass a copy
|
1907 |
+
|
1908 |
+
# --- Deduplicate Based on Key Fields ---
|
1909 |
+
base_dedupe_cols = ["page", xmin, ymin, xmax, ymax, "label", "id"]
|
1910 |
+
# Identify which deduplication columns actually exist in the DataFrame
|
1911 |
+
cols_for_dedupe = [col for col in base_dedupe_cols if col in review_file_df.columns]
|
1912 |
+
# Add 'image' column for deduplication IF it exists (matches original logic intent)
|
1913 |
+
if "image" in review_file_df.columns:
|
1914 |
+
cols_for_dedupe.append("image")
|
1915 |
+
|
1916 |
+
# Ensure placeholder columns exist if they are needed for deduplication
|
1917 |
+
# (e.g., 'label', 'id' should be present after fill_missing_ids)
|
1918 |
+
for col in ['label', 'id']:
|
1919 |
+
if col in cols_for_dedupe and col not in review_file_df.columns:
|
1920 |
+
# This might indicate an issue in fill_missing_ids or prior steps
|
1921 |
+
print(f"Warning: Column '{col}' needed for dedupe but not found. Adding NA.")
|
1922 |
+
review_file_df[col] = "" # Add default empty string
|
1923 |
+
|
1924 |
+
if cols_for_dedupe: # Only attempt dedupe if we have columns to check
|
1925 |
+
#print(f"Deduplicating based on columns: {cols_for_dedupe}")
|
1926 |
+
# Convert relevant columns to string before dedupe to avoid type issues with mixed data (optional, depends on data)
|
1927 |
+
# for col in cols_for_dedupe:
|
1928 |
+
# review_file_df[col] = review_file_df[col].astype(str)
|
1929 |
+
review_file_df = review_file_df.drop_duplicates(subset=cols_for_dedupe)
|
1930 |
else:
|
1931 |
+
print("Skipping deduplication: No valid columns found to deduplicate by.")
|
1932 |
+
|
1933 |
+
|
1934 |
+
# --- Select and Prepare Final Output Columns ---
|
1935 |
+
required_final_cols = ["page", "label", "color", xmin, ymin, xmax, ymax, "id", "text"]
|
1936 |
+
# Identify which of the desired final columns exist in the (now potentially deduplicated) DataFrame
|
1937 |
+
available_final_cols = [col for col in required_final_cols if col in review_file_df.columns]
|
1938 |
+
|
1939 |
+
# Ensure essential output columns exist, adding defaults if missing AFTER deduplication
|
1940 |
+
for col in required_final_cols:
|
1941 |
+
if col not in review_file_df.columns:
|
1942 |
+
print(f"Adding missing final column '{col}' with default value.")
|
1943 |
+
if col in ['label', 'id', 'text']:
|
1944 |
+
review_file_df[col] = "" # Default empty string
|
1945 |
+
elif col == 'color':
|
1946 |
+
review_file_df[col] = None # Default None or a default color tuple
|
1947 |
+
else: # page, coordinates
|
1948 |
+
review_file_df[col] = pd.NA # Default NA for numeric/page
|
1949 |
+
available_final_cols.append(col) # Add to list of available columns
|
1950 |
+
|
1951 |
+
# Select only the final desired columns in the correct order
|
1952 |
+
review_file_df = review_file_df[available_final_cols]
|
1953 |
+
|
1954 |
+
# --- Final Formatting ---
|
1955 |
+
if not review_file_df.empty:
|
1956 |
+
# Convert list colors to tuples (important for some downstream uses)
|
1957 |
+
if 'color' in review_file_df.columns:
|
1958 |
+
review_file_df['color'] = review_file_df['color'].apply(
|
1959 |
+
lambda x: tuple(x) if isinstance(x, list) else x
|
1960 |
+
)
|
1961 |
+
# Ensure page column is nullable integer type for reliable grouping
|
1962 |
+
if 'page' in review_file_df.columns:
|
1963 |
+
review_file_df['page'] = review_file_df['page'].astype('Int64')
|
1964 |
+
|
1965 |
+
# --- Group Annotations by Page ---
|
1966 |
+
if 'page' in review_file_df.columns:
|
1967 |
+
grouped_annotations = review_file_df.groupby('page')
|
1968 |
+
group_keys = set(grouped_annotations.groups.keys()) # Use set for faster lookups
|
1969 |
+
else:
|
1970 |
+
# Cannot group if page column is missing
|
1971 |
+
print("Error: 'page' column missing, cannot group annotations.")
|
1972 |
+
grouped_annotations = None
|
1973 |
+
group_keys = set()
|
1974 |
+
|
1975 |
|
1976 |
+
# --- Build JSON Structure ---
|
1977 |
+
json_data = []
|
1978 |
+
output_cols_for_boxes = [col for col in ["label", "color", xmin, ymin, xmax, ymax, "id", "text"] if col in review_file_df.columns]
|
1979 |
+
|
1980 |
+
# Iterate through page_sizes_df to define the structure (one entry per image path)
|
1981 |
+
for _, row in page_sizes_df.iterrows():
|
1982 |
+
page_num = row['page'] # Already Int64
|
1983 |
+
pdf_image_path = row['image_path']
|
1984 |
+
annotation_boxes = [] # Default to empty list
|
1985 |
|
1986 |
+
# Check if the page exists in the grouped annotations (using the faster set lookup)
|
1987 |
+
# Check pd.notna because page_num could be <NA> if conversion failed
|
1988 |
+
if pd.notna(page_num) and page_num in group_keys and grouped_annotations:
|
1989 |
+
try:
|
1990 |
+
page_group_df = grouped_annotations.get_group(page_num)
|
1991 |
+
# Convert the group to list of dicts, selecting only needed box properties
|
1992 |
+
# Handle potential NaN coordinates before conversion to JSON
|
1993 |
+
annotation_boxes = page_group_df[output_cols_for_boxes].replace({np.nan: None}).to_dict(orient='records')
|
1994 |
+
|
1995 |
+
# Optional: Round coordinates here if needed AFTER potential multiplication
|
1996 |
+
# for box in annotation_boxes:
|
1997 |
+
# for coord in [xmin, ymin, xmax, ymax]:
|
1998 |
+
# if coord in box and box[coord] is not None:
|
1999 |
+
# box[coord] = round(float(box[coord]), 2) # Example: round to 2 decimals
|
2000 |
+
|
2001 |
+
except KeyError:
|
2002 |
+
print(f"Warning: Group key {page_num} not found despite being in group_keys (should not happen).")
|
2003 |
+
annotation_boxes = [] # Keep empty
|
2004 |
+
|
2005 |
+
# Append the structured data for this image/page
|
2006 |
+
json_data.append({
|
2007 |
+
"image": pdf_image_path,
|
2008 |
+
"boxes": annotation_boxes
|
2009 |
+
})
|
2010 |
+
|
2011 |
+
return json_data
|
tools/file_redaction.py
CHANGED
@@ -20,8 +20,8 @@ from gradio import Progress
|
|
20 |
from collections import defaultdict # For efficient grouping
|
21 |
|
22 |
from tools.config import OUTPUT_FOLDER, IMAGES_DPI, MAX_IMAGE_PIXELS, RUN_AWS_FUNCTIONS, AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION, PAGE_BREAK_VALUE, MAX_TIME_VALUE, LOAD_TRUNCATED_IMAGES, INPUT_FOLDER
|
23 |
-
from tools.custom_image_analyser_engine import CustomImageAnalyzerEngine, OCRResult, combine_ocr_results, CustomImageRecognizerResult, run_page_text_redaction, merge_text_bounding_boxes
|
24 |
-
from tools.file_conversion import convert_annotation_json_to_review_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords, is_pdf, is_pdf_or_image, prepare_image_or_pdf, divide_coordinates_by_page_sizes, multiply_coordinates_by_page_sizes, convert_annotation_data_to_dataframe, divide_coordinates_by_page_sizes, create_annotation_dicts_from_annotation_df, remove_duplicate_images_with_blank_boxes, fill_missing_ids, fill_missing_box_ids
|
25 |
from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
|
26 |
from tools.helper_functions import get_file_name_without_type, clean_unicode_text, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option
|
27 |
from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult, load_and_convert_textract_json
|
@@ -101,6 +101,8 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
101 |
input_folder:str=INPUT_FOLDER,
|
102 |
total_textract_query_number:int=0,
|
103 |
ocr_file_path:str="",
|
|
|
|
|
104 |
prepare_images:bool=True,
|
105 |
progress=gr.Progress(track_tqdm=True)):
|
106 |
'''
|
@@ -149,7 +151,9 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
149 |
- review_file_path (str, optional): The latest review file path created by the app
|
150 |
- input_folder (str, optional): The custom input path, if provided
|
151 |
- total_textract_query_number (int, optional): The number of textract queries up until this point.
|
152 |
-
- ocr_file_path (str, optional): The latest ocr file path created by the app
|
|
|
|
|
153 |
- prepare_images (bool, optional): Boolean to determine whether to load images for the PDF.
|
154 |
- progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
|
155 |
|
@@ -179,9 +183,16 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
179 |
out_file_paths = []
|
180 |
estimate_total_processing_time = 0
|
181 |
estimated_time_taken_state = 0
|
|
|
|
|
|
|
|
|
|
|
182 |
# If not the first time around, and the current page loop has been set to a huge number (been through all pages), reset current page to 0
|
183 |
elif (first_loop_state == False) & (current_loop_page == 999):
|
184 |
current_loop_page = 0
|
|
|
|
|
185 |
|
186 |
# Choose the correct file to prepare
|
187 |
if isinstance(file_paths, str): file_paths_list = [os.path.abspath(file_paths)]
|
@@ -219,6 +230,8 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
219 |
elif out_message:
|
220 |
combined_out_message = combined_out_message + '\n' + out_message
|
221 |
|
|
|
|
|
222 |
# Only send across review file if redaction has been done
|
223 |
if pii_identification_method != no_redaction_option:
|
224 |
|
@@ -226,10 +239,15 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
226 |
#review_file_path = [x for x in out_file_paths if "review_file" in x]
|
227 |
if review_file_path: review_out_file_paths.append(review_file_path)
|
228 |
|
|
|
|
|
|
|
|
|
|
|
229 |
estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
|
230 |
print("Estimated total processing time:", str(estimate_total_processing_time))
|
231 |
|
232 |
-
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path
|
233 |
|
234 |
#if first_loop_state == False:
|
235 |
# Prepare documents and images as required if they don't already exist
|
@@ -258,9 +276,8 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
258 |
|
259 |
|
260 |
# Call prepare_image_or_pdf only if needed
|
261 |
-
if prepare_images_flag is not None
|
262 |
-
|
263 |
-
out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df = prepare_image_or_pdf(
|
264 |
file_paths_loop, text_extraction_method, 0, out_message, True,
|
265 |
annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
|
266 |
output_folder, prepare_images=prepare_images_flag, page_sizes=page_sizes, input_folder=input_folder
|
@@ -275,11 +292,15 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
275 |
page_sizes = page_sizes_df.to_dict(orient="records")
|
276 |
|
277 |
number_of_pages = pymupdf_doc.page_count
|
|
|
278 |
|
279 |
# If we have reached the last page, return message and outputs
|
280 |
if current_loop_page >= number_of_pages:
|
281 |
print("Reached last page of document:", current_loop_page)
|
282 |
|
|
|
|
|
|
|
283 |
# Set to a very high number so as not to mix up with subsequent file processing by the user
|
284 |
current_loop_page = 999
|
285 |
if out_message:
|
@@ -292,7 +313,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
292 |
#review_file_path = [x for x in out_file_paths if "review_file" in x]
|
293 |
if review_file_path: review_out_file_paths.append(review_file_path)
|
294 |
|
295 |
-
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path
|
296 |
|
297 |
# Load/create allow list
|
298 |
# If string, assume file path
|
@@ -333,7 +354,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
333 |
# Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
|
334 |
if pii_identification_method == aws_pii_detector:
|
335 |
if aws_access_key_textbox and aws_secret_key_textbox:
|
336 |
-
print("Connecting to Comprehend using AWS access key and secret keys from
|
337 |
comprehend_client = boto3.client('comprehend',
|
338 |
aws_access_key_id=aws_access_key_textbox,
|
339 |
aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
|
@@ -356,7 +377,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
356 |
# Try to connect to AWS Textract Client if using that text extraction method
|
357 |
if text_extraction_method == textract_option:
|
358 |
if aws_access_key_textbox and aws_secret_key_textbox:
|
359 |
-
print("Connecting to Textract using AWS access key and secret keys from
|
360 |
textract_client = boto3.client('textract',
|
361 |
aws_access_key_id=aws_access_key_textbox,
|
362 |
aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
|
@@ -401,7 +422,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
401 |
is_a_pdf = is_pdf(file_path) == True
|
402 |
if is_a_pdf == False and text_extraction_method == text_ocr_option:
|
403 |
# If user has not submitted a pdf, assume it's an image
|
404 |
-
print("File is not a
|
405 |
text_extraction_method = tesseract_ocr_option
|
406 |
else:
|
407 |
out_message = "No file selected"
|
@@ -422,7 +443,7 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
422 |
|
423 |
print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
|
424 |
|
425 |
-
pymupdf_doc, all_pages_decision_process_table, out_file_paths, new_textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number = redact_image_pdf(file_path,
|
426 |
pdf_image_file_paths,
|
427 |
language,
|
428 |
chosen_redact_entities,
|
@@ -448,7 +469,9 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
448 |
max_fuzzy_spelling_mistakes_num,
|
449 |
match_fuzzy_whole_phrase_bool,
|
450 |
page_sizes_df,
|
451 |
-
text_extraction_only,
|
|
|
|
|
452 |
log_files_output_paths=log_files_output_paths,
|
453 |
output_folder=output_folder)
|
454 |
|
@@ -599,7 +622,10 @@ def choose_and_run_redactor(file_paths:List[str],
|
|
599 |
if not review_file_path: review_out_file_paths = [prepared_pdf_file_paths[-1]]
|
600 |
else: review_out_file_paths = [prepared_pdf_file_paths[-1], review_file_path]
|
601 |
|
602 |
-
|
|
|
|
|
|
|
603 |
|
604 |
def convert_pikepdf_coords_to_pymupdf(pymupdf_page:Page, pikepdf_bbox, type="pikepdf_annot"):
|
605 |
'''
|
@@ -862,17 +888,6 @@ def convert_pikepdf_annotations_to_result_annotation_box(page:Page, annot:dict,
|
|
862 |
|
863 |
rect = Rect(pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2)
|
864 |
|
865 |
-
# if image or image_dimensions:
|
866 |
-
# print("Dividing result by image coordinates")
|
867 |
-
|
868 |
-
# image_x1, image_y1, image_x2, image_y2 = convert_pymupdf_to_image_coords(page, pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2, image, image_dimensions=image_dimensions)
|
869 |
-
|
870 |
-
# img_annotation_box["xmin"] = image_x1
|
871 |
-
# img_annotation_box["ymin"] = image_y1
|
872 |
-
# img_annotation_box["xmax"] = image_x2
|
873 |
-
# img_annotation_box["ymax"] = image_y2
|
874 |
-
|
875 |
-
# else:
|
876 |
convert_df = pd.DataFrame({
|
877 |
"page": [page_no],
|
878 |
"xmin": [pymupdf_x1],
|
@@ -1016,9 +1031,6 @@ def redact_page_with_pymupdf(page:Page, page_annotations:dict, image:Image=None,
|
|
1016 |
|
1017 |
img_annotation_box = fill_missing_box_ids(img_annotation_box)
|
1018 |
|
1019 |
-
#print("image_dimensions:", image_dimensions)
|
1020 |
-
#print("annot:", annot)
|
1021 |
-
|
1022 |
all_image_annotation_boxes.append(img_annotation_box)
|
1023 |
|
1024 |
# Redact the annotations from the document
|
@@ -1178,7 +1190,9 @@ def redact_image_pdf(file_path:str,
|
|
1178 |
max_fuzzy_spelling_mistakes_num:int=1,
|
1179 |
match_fuzzy_whole_phrase_bool:bool=True,
|
1180 |
page_sizes_df:pd.DataFrame=pd.DataFrame(),
|
1181 |
-
text_extraction_only:bool=False,
|
|
|
|
|
1182 |
page_break_val:int=int(PAGE_BREAK_VALUE),
|
1183 |
log_files_output_paths:List=[],
|
1184 |
max_time:int=int(MAX_TIME_VALUE),
|
@@ -1250,7 +1264,6 @@ def redact_image_pdf(file_path:str,
|
|
1250 |
print(out_message_warning)
|
1251 |
#raise Exception(out_message)
|
1252 |
|
1253 |
-
|
1254 |
number_of_pages = pymupdf_doc.page_count
|
1255 |
print("Number of pages:", str(number_of_pages))
|
1256 |
|
@@ -1268,14 +1281,24 @@ def redact_image_pdf(file_path:str,
|
|
1268 |
textract_data, is_missing, log_files_output_paths = load_and_convert_textract_json(textract_json_file_path, log_files_output_paths, page_sizes_df)
|
1269 |
original_textract_data = textract_data.copy()
|
1270 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1271 |
###
|
1272 |
if current_loop_page == 0: page_loop_start = 0
|
1273 |
else: page_loop_start = current_loop_page
|
1274 |
|
1275 |
progress_bar = tqdm(range(page_loop_start, number_of_pages), unit="pages remaining", desc="Redacting pages")
|
1276 |
|
1277 |
-
all_pages_decision_process_table_list = [all_pages_decision_process_table]
|
1278 |
all_line_level_ocr_results_df_list = [all_line_level_ocr_results_df]
|
|
|
1279 |
|
1280 |
# Go through each page
|
1281 |
for page_no in progress_bar:
|
@@ -1283,10 +1306,9 @@ def redact_image_pdf(file_path:str,
|
|
1283 |
handwriting_or_signature_boxes = []
|
1284 |
page_signature_recogniser_results = []
|
1285 |
page_handwriting_recogniser_results = []
|
|
|
1286 |
page_break_return = False
|
1287 |
reported_page_number = str(page_no + 1)
|
1288 |
-
|
1289 |
-
#print("page_sizes_df for row:", page_sizes_df.loc[page_sizes_df["page"] == (page_no + 1)])
|
1290 |
|
1291 |
# Try to find image location
|
1292 |
try:
|
@@ -1328,14 +1350,50 @@ def redact_image_pdf(file_path:str,
|
|
1328 |
|
1329 |
# Step 1: Perform OCR. Either with Tesseract, or with AWS Textract
|
1330 |
|
1331 |
-
# If using Tesseract
|
1332 |
if text_extraction_method == tesseract_ocr_option:
|
1333 |
#print("image_path:", image_path)
|
1334 |
#print("print(type(image_path)):", print(type(image_path)))
|
1335 |
#if not isinstance(image_path, image_path.image_path) or not isinstance(image_path, str): raise Exception("image_path object for page", reported_page_number, "not found, cannot perform local OCR analysis.")
|
1336 |
|
1337 |
-
|
1338 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1339 |
|
1340 |
# Check if page exists in existing textract data. If not, send to service to analyse
|
1341 |
if text_extraction_method == textract_option:
|
@@ -1399,16 +1457,28 @@ def redact_image_pdf(file_path:str,
|
|
1399 |
# If the page exists, retrieve the data
|
1400 |
text_blocks = next(page['data'] for page in textract_data["pages"] if page['page_no'] == reported_page_number)
|
1401 |
|
1402 |
-
|
1403 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1404 |
|
1405 |
if pii_identification_method != no_redaction_option:
|
1406 |
# Step 2: Analyse text and identify PII
|
1407 |
if chosen_redact_entities or chosen_redact_comprehend_entities:
|
1408 |
|
1409 |
page_redaction_bounding_boxes, comprehend_query_number_new = image_analyser.analyze_text(
|
1410 |
-
page_line_level_ocr_results,
|
1411 |
-
|
1412 |
chosen_redact_comprehend_entities = chosen_redact_comprehend_entities,
|
1413 |
pii_identification_method = pii_identification_method,
|
1414 |
comprehend_client=comprehend_client,
|
@@ -1423,7 +1493,7 @@ def redact_image_pdf(file_path:str,
|
|
1423 |
else: page_redaction_bounding_boxes = []
|
1424 |
|
1425 |
# Merge redaction bounding boxes that are close together
|
1426 |
-
page_merged_redaction_bboxes = merge_img_bboxes(page_redaction_bounding_boxes,
|
1427 |
|
1428 |
else: page_merged_redaction_bboxes = []
|
1429 |
|
@@ -1449,7 +1519,6 @@ def redact_image_pdf(file_path:str,
|
|
1449 |
# Assume image_path is an image
|
1450 |
image = image_path
|
1451 |
|
1452 |
-
print("image:", image)
|
1453 |
|
1454 |
fill = (0, 0, 0) # Fill colour for redactions
|
1455 |
draw = ImageDraw.Draw(image)
|
@@ -1510,19 +1579,6 @@ def redact_image_pdf(file_path:str,
|
|
1510 |
decision_process_table = fill_missing_ids(decision_process_table)
|
1511 |
#decision_process_table.to_csv("output/decision_process_table_with_ids.csv")
|
1512 |
|
1513 |
-
|
1514 |
-
# Convert to DataFrame and add to ongoing logging table
|
1515 |
-
line_level_ocr_results_df = pd.DataFrame([{
|
1516 |
-
'page': reported_page_number,
|
1517 |
-
'text': result.text,
|
1518 |
-
'left': result.left,
|
1519 |
-
'top': result.top,
|
1520 |
-
'width': result.width,
|
1521 |
-
'height': result.height
|
1522 |
-
} for result in page_line_level_ocr_results])
|
1523 |
-
|
1524 |
-
all_line_level_ocr_results_df_list.append(line_level_ocr_results_df)
|
1525 |
-
|
1526 |
toc = time.perf_counter()
|
1527 |
|
1528 |
time_taken = toc - tic
|
@@ -1547,6 +1603,8 @@ def redact_image_pdf(file_path:str,
|
|
1547 |
# Append new annotation if it doesn't exist
|
1548 |
annotations_all_pages.append(page_image_annotations)
|
1549 |
|
|
|
|
|
1550 |
if text_extraction_method == textract_option:
|
1551 |
if original_textract_data != textract_data:
|
1552 |
# Write the updated existing textract data back to the JSON file
|
@@ -1556,12 +1614,21 @@ def redact_image_pdf(file_path:str,
|
|
1556 |
if textract_json_file_path not in log_files_output_paths:
|
1557 |
log_files_output_paths.append(textract_json_file_path)
|
1558 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1559 |
all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
|
1560 |
all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
|
1561 |
|
1562 |
current_loop_page += 1
|
1563 |
|
1564 |
-
return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number
|
1565 |
|
1566 |
# If it's an image file
|
1567 |
if is_pdf(file_path) == False:
|
@@ -1594,10 +1661,20 @@ def redact_image_pdf(file_path:str,
|
|
1594 |
if textract_json_file_path not in log_files_output_paths:
|
1595 |
log_files_output_paths.append(textract_json_file_path)
|
1596 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1597 |
all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
|
1598 |
all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
|
1599 |
|
1600 |
-
return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number
|
1601 |
|
1602 |
if text_extraction_method == textract_option:
|
1603 |
# Write the updated existing textract data back to the JSON file
|
@@ -1609,15 +1686,24 @@ def redact_image_pdf(file_path:str,
|
|
1609 |
if textract_json_file_path not in log_files_output_paths:
|
1610 |
log_files_output_paths.append(textract_json_file_path)
|
1611 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1612 |
all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
|
1613 |
all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
|
1614 |
|
1615 |
-
# Convert decision table to relative coordinates
|
1616 |
all_pages_decision_process_table = divide_coordinates_by_page_sizes(all_pages_decision_process_table, page_sizes_df, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax")
|
1617 |
|
1618 |
all_line_level_ocr_results_df = divide_coordinates_by_page_sizes(all_line_level_ocr_results_df, page_sizes_df, xmin="left", xmax="width", ymin="top", ymax="height")
|
1619 |
|
1620 |
-
return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number
|
1621 |
|
1622 |
|
1623 |
###
|
@@ -1631,8 +1717,6 @@ def get_text_container_characters(text_container:LTTextContainer):
|
|
1631 |
for line in text_container
|
1632 |
if isinstance(line, LTTextLine) or isinstance(line, LTTextLineHorizontal)
|
1633 |
for char in line]
|
1634 |
-
|
1635 |
-
#print("Initial characters:", characters)
|
1636 |
|
1637 |
return characters
|
1638 |
return []
|
@@ -1762,9 +1846,6 @@ def create_text_redaction_process_results(analyser_results, analysed_bounding_bo
|
|
1762 |
analysed_bounding_boxes_df_new = pd.concat([analysed_bounding_boxes_df_new, analysed_bounding_boxes_df_text], axis = 1)
|
1763 |
analysed_bounding_boxes_df_new['page'] = page_num + 1
|
1764 |
|
1765 |
-
#analysed_bounding_boxes_df_new = fill_missing_ids(analysed_bounding_boxes_df_new)
|
1766 |
-
analysed_bounding_boxes_df_new.to_csv("output/analysed_bounding_boxes_df_new_with_ids.csv")
|
1767 |
-
|
1768 |
decision_process_table = pd.concat([decision_process_table, analysed_bounding_boxes_df_new], axis = 0).drop('result', axis=1)
|
1769 |
|
1770 |
return decision_process_table
|
@@ -1772,7 +1853,6 @@ def create_text_redaction_process_results(analyser_results, analysed_bounding_bo
|
|
1772 |
def create_pikepdf_annotations_for_bounding_boxes(analysed_bounding_boxes):
|
1773 |
pikepdf_redaction_annotations_on_page = []
|
1774 |
for analysed_bounding_box in analysed_bounding_boxes:
|
1775 |
-
#print("analysed_bounding_box:", analysed_bounding_boxes)
|
1776 |
|
1777 |
bounding_box = analysed_bounding_box["boundingBox"]
|
1778 |
annotation = Dictionary(
|
@@ -1997,7 +2077,6 @@ def redact_text_pdf(
|
|
1997 |
pass
|
1998 |
#print("Not redacting page:", page_no)
|
1999 |
|
2000 |
-
#print("page_image_annotations after page", reported_page_number, "are", page_image_annotations)
|
2001 |
|
2002 |
# Join extracted text outputs for all lines together
|
2003 |
if not page_text_ocr_outputs.empty:
|
|
|
20 |
from collections import defaultdict # For efficient grouping
|
21 |
|
22 |
from tools.config import OUTPUT_FOLDER, IMAGES_DPI, MAX_IMAGE_PIXELS, RUN_AWS_FUNCTIONS, AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION, PAGE_BREAK_VALUE, MAX_TIME_VALUE, LOAD_TRUNCATED_IMAGES, INPUT_FOLDER
|
23 |
+
from tools.custom_image_analyser_engine import CustomImageAnalyzerEngine, OCRResult, combine_ocr_results, CustomImageRecognizerResult, run_page_text_redaction, merge_text_bounding_boxes, recreate_page_line_level_ocr_results_with_page
|
24 |
+
from tools.file_conversion import convert_annotation_json_to_review_df, redact_whole_pymupdf_page, redact_single_box, convert_pymupdf_to_image_coords, is_pdf, is_pdf_or_image, prepare_image_or_pdf, divide_coordinates_by_page_sizes, multiply_coordinates_by_page_sizes, convert_annotation_data_to_dataframe, divide_coordinates_by_page_sizes, create_annotation_dicts_from_annotation_df, remove_duplicate_images_with_blank_boxes, fill_missing_ids, fill_missing_box_ids, load_and_convert_ocr_results_with_words_json
|
25 |
from tools.load_spacy_model_custom_recognisers import nlp_analyser, score_threshold, custom_entities, custom_recogniser, custom_word_list_recogniser, CustomWordFuzzyRecognizer
|
26 |
from tools.helper_functions import get_file_name_without_type, clean_unicode_text, tesseract_ocr_option, text_ocr_option, textract_option, local_pii_detector, aws_pii_detector, no_redaction_option
|
27 |
from tools.aws_textract import analyse_page_with_textract, json_to_ocrresult, load_and_convert_textract_json
|
|
|
101 |
input_folder:str=INPUT_FOLDER,
|
102 |
total_textract_query_number:int=0,
|
103 |
ocr_file_path:str="",
|
104 |
+
all_page_line_level_ocr_results = [],
|
105 |
+
all_page_line_level_ocr_results_with_words = [],
|
106 |
prepare_images:bool=True,
|
107 |
progress=gr.Progress(track_tqdm=True)):
|
108 |
'''
|
|
|
151 |
- review_file_path (str, optional): The latest review file path created by the app
|
152 |
- input_folder (str, optional): The custom input path, if provided
|
153 |
- total_textract_query_number (int, optional): The number of textract queries up until this point.
|
154 |
+
- ocr_file_path (str, optional): The latest ocr file path created by the app.
|
155 |
+
- all_page_line_level_ocr_results (list, optional): All line level text on the page with bounding boxes.
|
156 |
+
- all_page_line_level_ocr_results_with_words (list, optional): All word level text on the page with bounding boxes.
|
157 |
- prepare_images (bool, optional): Boolean to determine whether to load images for the PDF.
|
158 |
- progress (gr.Progress, optional): A progress tracker for the redaction process. Defaults to a Progress object with track_tqdm set to True.
|
159 |
|
|
|
183 |
out_file_paths = []
|
184 |
estimate_total_processing_time = 0
|
185 |
estimated_time_taken_state = 0
|
186 |
+
comprehend_query_number = 0
|
187 |
+
total_textract_query_number = 0
|
188 |
+
elif current_loop_page == 0:
|
189 |
+
comprehend_query_number = 0
|
190 |
+
total_textract_query_number = 0
|
191 |
# If not the first time around, and the current page loop has been set to a huge number (been through all pages), reset current page to 0
|
192 |
elif (first_loop_state == False) & (current_loop_page == 999):
|
193 |
current_loop_page = 0
|
194 |
+
total_textract_query_number = 0
|
195 |
+
comprehend_query_number = 0
|
196 |
|
197 |
# Choose the correct file to prepare
|
198 |
if isinstance(file_paths, str): file_paths_list = [os.path.abspath(file_paths)]
|
|
|
230 |
elif out_message:
|
231 |
combined_out_message = combined_out_message + '\n' + out_message
|
232 |
|
233 |
+
combined_out_message = re.sub(r'^\n+', '', combined_out_message).strip()
|
234 |
+
|
235 |
# Only send across review file if redaction has been done
|
236 |
if pii_identification_method != no_redaction_option:
|
237 |
|
|
|
239 |
#review_file_path = [x for x in out_file_paths if "review_file" in x]
|
240 |
if review_file_path: review_out_file_paths.append(review_file_path)
|
241 |
|
242 |
+
if not isinstance(pymupdf_doc, list):
|
243 |
+
number_of_pages = pymupdf_doc.page_count
|
244 |
+
if total_textract_query_number > number_of_pages:
|
245 |
+
total_textract_query_number = number_of_pages
|
246 |
+
|
247 |
estimate_total_processing_time = sum_numbers_before_seconds(combined_out_message)
|
248 |
print("Estimated total processing time:", str(estimate_total_processing_time))
|
249 |
|
250 |
+
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
|
251 |
|
252 |
#if first_loop_state == False:
|
253 |
# Prepare documents and images as required if they don't already exist
|
|
|
276 |
|
277 |
|
278 |
# Call prepare_image_or_pdf only if needed
|
279 |
+
if prepare_images_flag is not None:
|
280 |
+
out_message, prepared_pdf_file_paths, pdf_image_file_paths, annotate_max_pages, annotate_max_pages_bottom, pymupdf_doc, annotations_all_pages, review_file_state, document_cropboxes, page_sizes, textract_output_found, all_img_details_state, placeholder_ocr_results_df, local_ocr_output_found_checkbox = prepare_image_or_pdf(
|
|
|
281 |
file_paths_loop, text_extraction_method, 0, out_message, True,
|
282 |
annotate_max_pages, annotations_all_pages, document_cropboxes, redact_whole_page_list,
|
283 |
output_folder, prepare_images=prepare_images_flag, page_sizes=page_sizes, input_folder=input_folder
|
|
|
292 |
page_sizes = page_sizes_df.to_dict(orient="records")
|
293 |
|
294 |
number_of_pages = pymupdf_doc.page_count
|
295 |
+
|
296 |
|
297 |
# If we have reached the last page, return message and outputs
|
298 |
if current_loop_page >= number_of_pages:
|
299 |
print("Reached last page of document:", current_loop_page)
|
300 |
|
301 |
+
if total_textract_query_number > number_of_pages:
|
302 |
+
total_textract_query_number = number_of_pages
|
303 |
+
|
304 |
# Set to a very high number so as not to mix up with subsequent file processing by the user
|
305 |
current_loop_page = 999
|
306 |
if out_message:
|
|
|
313 |
#review_file_path = [x for x in out_file_paths if "review_file" in x]
|
314 |
if review_file_path: review_out_file_paths.append(review_file_path)
|
315 |
|
316 |
+
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page,precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = False, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
|
317 |
|
318 |
# Load/create allow list
|
319 |
# If string, assume file path
|
|
|
354 |
# Try to connect to AWS services directly only if RUN_AWS_FUNCTIONS environmental variable is 1, otherwise an environment variable or direct textbox input is needed.
|
355 |
if pii_identification_method == aws_pii_detector:
|
356 |
if aws_access_key_textbox and aws_secret_key_textbox:
|
357 |
+
print("Connecting to Comprehend using AWS access key and secret keys from user input.")
|
358 |
comprehend_client = boto3.client('comprehend',
|
359 |
aws_access_key_id=aws_access_key_textbox,
|
360 |
aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
|
|
|
377 |
# Try to connect to AWS Textract Client if using that text extraction method
|
378 |
if text_extraction_method == textract_option:
|
379 |
if aws_access_key_textbox and aws_secret_key_textbox:
|
380 |
+
print("Connecting to Textract using AWS access key and secret keys from user input.")
|
381 |
textract_client = boto3.client('textract',
|
382 |
aws_access_key_id=aws_access_key_textbox,
|
383 |
aws_secret_access_key=aws_secret_key_textbox, region_name=AWS_REGION)
|
|
|
422 |
is_a_pdf = is_pdf(file_path) == True
|
423 |
if is_a_pdf == False and text_extraction_method == text_ocr_option:
|
424 |
# If user has not submitted a pdf, assume it's an image
|
425 |
+
print("File is not a PDF, assuming that image analysis needs to be used.")
|
426 |
text_extraction_method = tesseract_ocr_option
|
427 |
else:
|
428 |
out_message = "No file selected"
|
|
|
443 |
|
444 |
print("Redacting file " + pdf_file_name_with_ext + " as an image-based file")
|
445 |
|
446 |
+
pymupdf_doc, all_pages_decision_process_table, out_file_paths, new_textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words = redact_image_pdf(file_path,
|
447 |
pdf_image_file_paths,
|
448 |
language,
|
449 |
chosen_redact_entities,
|
|
|
469 |
max_fuzzy_spelling_mistakes_num,
|
470 |
match_fuzzy_whole_phrase_bool,
|
471 |
page_sizes_df,
|
472 |
+
text_extraction_only,
|
473 |
+
all_page_line_level_ocr_results,
|
474 |
+
all_page_line_level_ocr_results_with_words,
|
475 |
log_files_output_paths=log_files_output_paths,
|
476 |
output_folder=output_folder)
|
477 |
|
|
|
622 |
if not review_file_path: review_out_file_paths = [prepared_pdf_file_paths[-1]]
|
623 |
else: review_out_file_paths = [prepared_pdf_file_paths[-1], review_file_path]
|
624 |
|
625 |
+
if total_textract_query_number > number_of_pages:
|
626 |
+
total_textract_query_number = number_of_pages
|
627 |
+
|
628 |
+
return combined_out_message, out_file_paths, out_file_paths, gr.Number(value=latest_file_completed, label="Number of documents redacted", interactive=False, visible=False), log_files_output_paths, log_files_output_paths, estimated_time_taken_state, all_request_metadata_str, pymupdf_doc, annotations_all_pages, gr.Number(value=current_loop_page, precision=0, interactive=False, label = "Last redacted page in document", visible=False), gr.Checkbox(value = True, label="Page break reached", visible=False), all_line_level_ocr_results_df, all_pages_decision_process_table, comprehend_query_number, review_out_file_paths, annotate_max_pages, annotate_max_pages, prepared_pdf_file_paths, pdf_image_file_paths, review_file_state, page_sizes, duplication_file_path_outputs, duplication_file_path_outputs, review_file_path, total_textract_query_number, ocr_file_path, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
|
629 |
|
630 |
def convert_pikepdf_coords_to_pymupdf(pymupdf_page:Page, pikepdf_bbox, type="pikepdf_annot"):
|
631 |
'''
|
|
|
888 |
|
889 |
rect = Rect(pymupdf_x1, pymupdf_y1, pymupdf_x2, pymupdf_y2)
|
890 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
891 |
convert_df = pd.DataFrame({
|
892 |
"page": [page_no],
|
893 |
"xmin": [pymupdf_x1],
|
|
|
1031 |
|
1032 |
img_annotation_box = fill_missing_box_ids(img_annotation_box)
|
1033 |
|
|
|
|
|
|
|
1034 |
all_image_annotation_boxes.append(img_annotation_box)
|
1035 |
|
1036 |
# Redact the annotations from the document
|
|
|
1190 |
max_fuzzy_spelling_mistakes_num:int=1,
|
1191 |
match_fuzzy_whole_phrase_bool:bool=True,
|
1192 |
page_sizes_df:pd.DataFrame=pd.DataFrame(),
|
1193 |
+
text_extraction_only:bool=False,
|
1194 |
+
all_page_line_level_ocr_results = [],
|
1195 |
+
all_page_line_level_ocr_results_with_words = [],
|
1196 |
page_break_val:int=int(PAGE_BREAK_VALUE),
|
1197 |
log_files_output_paths:List=[],
|
1198 |
max_time:int=int(MAX_TIME_VALUE),
|
|
|
1264 |
print(out_message_warning)
|
1265 |
#raise Exception(out_message)
|
1266 |
|
|
|
1267 |
number_of_pages = pymupdf_doc.page_count
|
1268 |
print("Number of pages:", str(number_of_pages))
|
1269 |
|
|
|
1281 |
textract_data, is_missing, log_files_output_paths = load_and_convert_textract_json(textract_json_file_path, log_files_output_paths, page_sizes_df)
|
1282 |
original_textract_data = textract_data.copy()
|
1283 |
|
1284 |
+
print("Successfully loaded in Textract analysis results from file")
|
1285 |
+
|
1286 |
+
# If running local OCR option, check if file already exists. If it does, load in existing data
|
1287 |
+
if text_extraction_method == tesseract_ocr_option:
|
1288 |
+
all_page_line_level_ocr_results_with_words_json_file_path = output_folder + file_name + "_ocr_results_with_words.json"
|
1289 |
+
all_page_line_level_ocr_results_with_words, is_missing, log_files_output_paths = load_and_convert_ocr_results_with_words_json(all_page_line_level_ocr_results_with_words_json_file_path, log_files_output_paths, page_sizes_df)
|
1290 |
+
original_all_page_line_level_ocr_results_with_words = all_page_line_level_ocr_results_with_words.copy()
|
1291 |
+
|
1292 |
+
print("Loaded in local OCR analysis results from file")
|
1293 |
+
|
1294 |
###
|
1295 |
if current_loop_page == 0: page_loop_start = 0
|
1296 |
else: page_loop_start = current_loop_page
|
1297 |
|
1298 |
progress_bar = tqdm(range(page_loop_start, number_of_pages), unit="pages remaining", desc="Redacting pages")
|
1299 |
|
|
|
1300 |
all_line_level_ocr_results_df_list = [all_line_level_ocr_results_df]
|
1301 |
+
all_pages_decision_process_table_list = [all_pages_decision_process_table]
|
1302 |
|
1303 |
# Go through each page
|
1304 |
for page_no in progress_bar:
|
|
|
1306 |
handwriting_or_signature_boxes = []
|
1307 |
page_signature_recogniser_results = []
|
1308 |
page_handwriting_recogniser_results = []
|
1309 |
+
page_line_level_ocr_results_with_words = []
|
1310 |
page_break_return = False
|
1311 |
reported_page_number = str(page_no + 1)
|
|
|
|
|
1312 |
|
1313 |
# Try to find image location
|
1314 |
try:
|
|
|
1350 |
|
1351 |
# Step 1: Perform OCR. Either with Tesseract, or with AWS Textract
|
1352 |
|
1353 |
+
# If using Tesseract
|
1354 |
if text_extraction_method == tesseract_ocr_option:
|
1355 |
#print("image_path:", image_path)
|
1356 |
#print("print(type(image_path)):", print(type(image_path)))
|
1357 |
#if not isinstance(image_path, image_path.image_path) or not isinstance(image_path, str): raise Exception("image_path object for page", reported_page_number, "not found, cannot perform local OCR analysis.")
|
1358 |
|
1359 |
+
# Check for existing page_line_level_ocr_results_with_words object:
|
1360 |
+
|
1361 |
+
# page_line_level_ocr_results = (
|
1362 |
+
# all_page_line_level_ocr_results.get('results', [])
|
1363 |
+
# if all_page_line_level_ocr_results.get('page') == reported_page_number
|
1364 |
+
# else []
|
1365 |
+
# )
|
1366 |
+
|
1367 |
+
if all_page_line_level_ocr_results_with_words:
|
1368 |
+
# Find the first dict where 'page' matches
|
1369 |
+
|
1370 |
+
#print("all_page_line_level_ocr_results_with_words:", all_page_line_level_ocr_results_with_words)
|
1371 |
+
|
1372 |
+
print("All pages available:", [item.get('page') for item in all_page_line_level_ocr_results_with_words])
|
1373 |
+
#print("Looking for page:", reported_page_number)
|
1374 |
+
|
1375 |
+
matching_page = next(
|
1376 |
+
(item for item in all_page_line_level_ocr_results_with_words if int(item.get('page', -1)) == int(reported_page_number)),
|
1377 |
+
None
|
1378 |
+
)
|
1379 |
+
|
1380 |
+
#print("matching_page:", matching_page)
|
1381 |
+
|
1382 |
+
page_line_level_ocr_results_with_words = matching_page if matching_page else []
|
1383 |
+
else: page_line_level_ocr_results_with_words = []
|
1384 |
+
|
1385 |
+
if page_line_level_ocr_results_with_words:
|
1386 |
+
print("Found OCR results for page in existing OCR with words object")
|
1387 |
+
page_line_level_ocr_results = recreate_page_line_level_ocr_results_with_page(page_line_level_ocr_results_with_words)
|
1388 |
+
else:
|
1389 |
+
page_word_level_ocr_results = image_analyser.perform_ocr(image_path)
|
1390 |
+
|
1391 |
+
print("page_word_level_ocr_results:", page_word_level_ocr_results)
|
1392 |
+
page_line_level_ocr_results, page_line_level_ocr_results_with_words = combine_ocr_results(page_word_level_ocr_results, page=reported_page_number)
|
1393 |
+
|
1394 |
+
all_page_line_level_ocr_results_with_words.append(page_line_level_ocr_results_with_words)
|
1395 |
+
|
1396 |
+
print("All pages available:", [item.get('page') for item in all_page_line_level_ocr_results_with_words])
|
1397 |
|
1398 |
# Check if page exists in existing textract data. If not, send to service to analyse
|
1399 |
if text_extraction_method == textract_option:
|
|
|
1457 |
# If the page exists, retrieve the data
|
1458 |
text_blocks = next(page['data'] for page in textract_data["pages"] if page['page_no'] == reported_page_number)
|
1459 |
|
1460 |
+
page_line_level_ocr_results, handwriting_or_signature_boxes, page_signature_recogniser_results, page_handwriting_recogniser_results, page_line_level_ocr_results_with_words = json_to_ocrresult(text_blocks, page_width, page_height, reported_page_number)
|
1461 |
+
|
1462 |
+
# Convert to DataFrame and add to ongoing logging table
|
1463 |
+
line_level_ocr_results_df = pd.DataFrame([{
|
1464 |
+
'page': page_line_level_ocr_results['page'],
|
1465 |
+
'text': result.text,
|
1466 |
+
'left': result.left,
|
1467 |
+
'top': result.top,
|
1468 |
+
'width': result.width,
|
1469 |
+
'height': result.height
|
1470 |
+
} for result in page_line_level_ocr_results['results']])
|
1471 |
+
|
1472 |
+
all_line_level_ocr_results_df_list.append(line_level_ocr_results_df)
|
1473 |
+
|
1474 |
|
1475 |
if pii_identification_method != no_redaction_option:
|
1476 |
# Step 2: Analyse text and identify PII
|
1477 |
if chosen_redact_entities or chosen_redact_comprehend_entities:
|
1478 |
|
1479 |
page_redaction_bounding_boxes, comprehend_query_number_new = image_analyser.analyze_text(
|
1480 |
+
page_line_level_ocr_results['results'],
|
1481 |
+
page_line_level_ocr_results_with_words['results'],
|
1482 |
chosen_redact_comprehend_entities = chosen_redact_comprehend_entities,
|
1483 |
pii_identification_method = pii_identification_method,
|
1484 |
comprehend_client=comprehend_client,
|
|
|
1493 |
else: page_redaction_bounding_boxes = []
|
1494 |
|
1495 |
# Merge redaction bounding boxes that are close together
|
1496 |
+
page_merged_redaction_bboxes = merge_img_bboxes(page_redaction_bounding_boxes, page_line_level_ocr_results_with_words['results'], page_signature_recogniser_results, page_handwriting_recogniser_results, handwrite_signature_checkbox)
|
1497 |
|
1498 |
else: page_merged_redaction_bboxes = []
|
1499 |
|
|
|
1519 |
# Assume image_path is an image
|
1520 |
image = image_path
|
1521 |
|
|
|
1522 |
|
1523 |
fill = (0, 0, 0) # Fill colour for redactions
|
1524 |
draw = ImageDraw.Draw(image)
|
|
|
1579 |
decision_process_table = fill_missing_ids(decision_process_table)
|
1580 |
#decision_process_table.to_csv("output/decision_process_table_with_ids.csv")
|
1581 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1582 |
toc = time.perf_counter()
|
1583 |
|
1584 |
time_taken = toc - tic
|
|
|
1603 |
# Append new annotation if it doesn't exist
|
1604 |
annotations_all_pages.append(page_image_annotations)
|
1605 |
|
1606 |
+
|
1607 |
+
|
1608 |
if text_extraction_method == textract_option:
|
1609 |
if original_textract_data != textract_data:
|
1610 |
# Write the updated existing textract data back to the JSON file
|
|
|
1614 |
if textract_json_file_path not in log_files_output_paths:
|
1615 |
log_files_output_paths.append(textract_json_file_path)
|
1616 |
|
1617 |
+
if text_extraction_method == tesseract_ocr_option:
|
1618 |
+
if original_all_page_line_level_ocr_results_with_words != all_page_line_level_ocr_results_with_words:
|
1619 |
+
# Write the updated existing textract data back to the JSON file
|
1620 |
+
with open(all_page_line_level_ocr_results_with_words_json_file_path, 'w') as json_file:
|
1621 |
+
json.dump(all_page_line_level_ocr_results_with_words, json_file, separators=(",", ":")) # indent=4 makes the JSON file pretty-printed
|
1622 |
+
|
1623 |
+
if all_page_line_level_ocr_results_with_words_json_file_path not in log_files_output_paths:
|
1624 |
+
log_files_output_paths.append(all_page_line_level_ocr_results_with_words_json_file_path)
|
1625 |
+
|
1626 |
all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
|
1627 |
all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
|
1628 |
|
1629 |
current_loop_page += 1
|
1630 |
|
1631 |
+
return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
|
1632 |
|
1633 |
# If it's an image file
|
1634 |
if is_pdf(file_path) == False:
|
|
|
1661 |
if textract_json_file_path not in log_files_output_paths:
|
1662 |
log_files_output_paths.append(textract_json_file_path)
|
1663 |
|
1664 |
+
if text_extraction_method == tesseract_ocr_option:
|
1665 |
+
if original_all_page_line_level_ocr_results_with_words != all_page_line_level_ocr_results_with_words:
|
1666 |
+
# Write the updated existing textract data back to the JSON file
|
1667 |
+
with open(all_page_line_level_ocr_results_with_words_json_file_path, 'w') as json_file:
|
1668 |
+
json.dump(all_page_line_level_ocr_results_with_words, json_file, separators=(",", ":")) # indent=4 makes the JSON file pretty-printed
|
1669 |
+
|
1670 |
+
if all_page_line_level_ocr_results_with_words_json_file_path not in log_files_output_paths:
|
1671 |
+
log_files_output_paths.append(all_page_line_level_ocr_results_with_words_json_file_path)
|
1672 |
+
|
1673 |
+
|
1674 |
all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
|
1675 |
all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
|
1676 |
|
1677 |
+
return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
|
1678 |
|
1679 |
if text_extraction_method == textract_option:
|
1680 |
# Write the updated existing textract data back to the JSON file
|
|
|
1686 |
if textract_json_file_path not in log_files_output_paths:
|
1687 |
log_files_output_paths.append(textract_json_file_path)
|
1688 |
|
1689 |
+
if text_extraction_method == tesseract_ocr_option:
|
1690 |
+
if original_all_page_line_level_ocr_results_with_words != all_page_line_level_ocr_results_with_words:
|
1691 |
+
# Write the updated existing textract data back to the JSON file
|
1692 |
+
with open(all_page_line_level_ocr_results_with_words_json_file_path, 'w') as json_file:
|
1693 |
+
json.dump(all_page_line_level_ocr_results_with_words, json_file, separators=(",", ":")) # indent=4 makes the JSON file pretty-printed
|
1694 |
+
|
1695 |
+
if all_page_line_level_ocr_results_with_words_json_file_path not in log_files_output_paths:
|
1696 |
+
log_files_output_paths.append(all_page_line_level_ocr_results_with_words_json_file_path)
|
1697 |
+
|
1698 |
all_pages_decision_process_table = pd.concat(all_pages_decision_process_table_list)
|
1699 |
all_line_level_ocr_results_df = pd.concat(all_line_level_ocr_results_df_list)
|
1700 |
|
1701 |
+
# Convert decision table and ocr results to relative coordinates
|
1702 |
all_pages_decision_process_table = divide_coordinates_by_page_sizes(all_pages_decision_process_table, page_sizes_df, xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax")
|
1703 |
|
1704 |
all_line_level_ocr_results_df = divide_coordinates_by_page_sizes(all_line_level_ocr_results_df, page_sizes_df, xmin="left", xmax="width", ymin="top", ymax="height")
|
1705 |
|
1706 |
+
return pymupdf_doc, all_pages_decision_process_table, log_files_output_paths, textract_request_metadata, annotations_all_pages, current_loop_page, page_break_return, all_line_level_ocr_results_df, comprehend_query_number, all_page_line_level_ocr_results, all_page_line_level_ocr_results_with_words
|
1707 |
|
1708 |
|
1709 |
###
|
|
|
1717 |
for line in text_container
|
1718 |
if isinstance(line, LTTextLine) or isinstance(line, LTTextLineHorizontal)
|
1719 |
for char in line]
|
|
|
|
|
1720 |
|
1721 |
return characters
|
1722 |
return []
|
|
|
1846 |
analysed_bounding_boxes_df_new = pd.concat([analysed_bounding_boxes_df_new, analysed_bounding_boxes_df_text], axis = 1)
|
1847 |
analysed_bounding_boxes_df_new['page'] = page_num + 1
|
1848 |
|
|
|
|
|
|
|
1849 |
decision_process_table = pd.concat([decision_process_table, analysed_bounding_boxes_df_new], axis = 0).drop('result', axis=1)
|
1850 |
|
1851 |
return decision_process_table
|
|
|
1853 |
def create_pikepdf_annotations_for_bounding_boxes(analysed_bounding_boxes):
|
1854 |
pikepdf_redaction_annotations_on_page = []
|
1855 |
for analysed_bounding_box in analysed_bounding_boxes:
|
|
|
1856 |
|
1857 |
bounding_box = analysed_bounding_box["boundingBox"]
|
1858 |
annotation = Dictionary(
|
|
|
2077 |
pass
|
2078 |
#print("Not redacting page:", page_no)
|
2079 |
|
|
|
2080 |
|
2081 |
# Join extracted text outputs for all lines together
|
2082 |
if not page_text_ocr_outputs.empty:
|
tools/helper_functions.py
CHANGED
@@ -9,7 +9,7 @@ import unicodedata
|
|
9 |
from typing import List
|
10 |
from math import ceil
|
11 |
from gradio_image_annotation import image_annotator
|
12 |
-
from tools.config import CUSTOM_HEADER_VALUE, CUSTOM_HEADER, OUTPUT_FOLDER, INPUT_FOLDER, SESSION_OUTPUT_FOLDER, AWS_USER_POOL_ID,
|
13 |
|
14 |
# Names for options labels
|
15 |
text_ocr_option = "Local model - selectable text"
|
@@ -39,6 +39,12 @@ def reset_ocr_results_state():
|
|
39 |
def reset_review_vars():
|
40 |
return pd.DataFrame(), pd.DataFrame()
|
41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
42 |
def load_in_default_allow_list(allow_list_file_path):
|
43 |
if isinstance(allow_list_file_path, str):
|
44 |
allow_list_file_path = [allow_list_file_path]
|
@@ -201,9 +207,6 @@ def put_columns_in_df(in_file:List[str]):
|
|
201 |
df = pd.read_excel(file_name, sheet_name=sheet_name)
|
202 |
|
203 |
# Process the DataFrame (e.g., print its contents)
|
204 |
-
print(f"Sheet Name: {sheet_name}")
|
205 |
-
print(df.head()) # Print the first few rows
|
206 |
-
|
207 |
new_choices.extend(list(df.columns))
|
208 |
|
209 |
all_sheet_names.extend(new_sheet_names)
|
@@ -226,7 +229,17 @@ def check_for_existing_textract_file(doc_file_name_no_extension_textbox:str, out
|
|
226 |
textract_output_path = os.path.join(output_folder, doc_file_name_no_extension_textbox + "_textract.json")
|
227 |
|
228 |
if os.path.exists(textract_output_path):
|
229 |
-
print("Existing Textract file found.")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
230 |
return True
|
231 |
|
232 |
else:
|
@@ -306,8 +319,8 @@ async def get_connection_params(request: gr.Request,
|
|
306 |
output_folder_textbox:str=OUTPUT_FOLDER,
|
307 |
input_folder_textbox:str=INPUT_FOLDER,
|
308 |
session_output_folder:str=SESSION_OUTPUT_FOLDER,
|
309 |
-
textract_document_upload_input_folder:str=
|
310 |
-
textract_document_upload_output_folder:str=
|
311 |
s3_textract_document_logs_subfolder:str=TEXTRACT_JOBS_S3_LOC,
|
312 |
local_textract_document_logs_subfolder:str=TEXTRACT_JOBS_LOCAL_LOC):
|
313 |
|
@@ -477,9 +490,10 @@ def calculate_time_taken(number_of_pages:str,
|
|
477 |
pii_identification_method:str,
|
478 |
textract_output_found_checkbox:bool,
|
479 |
only_extract_text_radio:bool,
|
|
|
480 |
convert_page_time:float=0.5,
|
481 |
-
textract_page_time:float=1,
|
482 |
-
comprehend_page_time:float=1,
|
483 |
local_text_extraction_page_time:float=0.3,
|
484 |
local_pii_redaction_page_time:float=0.5,
|
485 |
local_ocr_extraction_page_time:float=1.5,
|
@@ -494,7 +508,9 @@ def calculate_time_taken(number_of_pages:str,
|
|
494 |
- number_of_pages: The number of pages in the uploaded document(s).
|
495 |
- text_extract_method_radio: The method of text extraction.
|
496 |
- pii_identification_method_drop: The method of personally-identifiable information removal.
|
|
|
497 |
- only_extract_text_radio (bool, optional): Option to only extract text from the document rather than redact.
|
|
|
498 |
- textract_page_time (float, optional): Approximate time to query AWS Textract.
|
499 |
- comprehend_page_time (float, optional): Approximate time to query text on a page with AWS Comprehend.
|
500 |
- local_text_redaction_page_time (float, optional): Approximate time to extract text on a page with the local text redaction option.
|
@@ -522,7 +538,8 @@ def calculate_time_taken(number_of_pages:str,
|
|
522 |
if textract_output_found_checkbox != True:
|
523 |
page_extraction_time_taken = number_of_pages * textract_page_time
|
524 |
elif text_extract_method_radio == local_ocr_option:
|
525 |
-
|
|
|
526 |
elif text_extract_method_radio == text_ocr_option:
|
527 |
page_conversion_time_taken = number_of_pages * local_text_extraction_page_time
|
528 |
|
|
|
9 |
from typing import List
|
10 |
from math import ceil
|
11 |
from gradio_image_annotation import image_annotator
|
12 |
+
from tools.config import CUSTOM_HEADER_VALUE, CUSTOM_HEADER, OUTPUT_FOLDER, INPUT_FOLDER, SESSION_OUTPUT_FOLDER, AWS_USER_POOL_ID, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER, TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC
|
13 |
|
14 |
# Names for options labels
|
15 |
text_ocr_option = "Local model - selectable text"
|
|
|
39 |
def reset_review_vars():
|
40 |
return pd.DataFrame(), pd.DataFrame()
|
41 |
|
42 |
+
def reset_data_vars():
|
43 |
+
return 0, [], 0
|
44 |
+
|
45 |
+
def reset_aws_call_vars():
|
46 |
+
return 0, 0
|
47 |
+
|
48 |
def load_in_default_allow_list(allow_list_file_path):
|
49 |
if isinstance(allow_list_file_path, str):
|
50 |
allow_list_file_path = [allow_list_file_path]
|
|
|
207 |
df = pd.read_excel(file_name, sheet_name=sheet_name)
|
208 |
|
209 |
# Process the DataFrame (e.g., print its contents)
|
|
|
|
|
|
|
210 |
new_choices.extend(list(df.columns))
|
211 |
|
212 |
all_sheet_names.extend(new_sheet_names)
|
|
|
229 |
textract_output_path = os.path.join(output_folder, doc_file_name_no_extension_textbox + "_textract.json")
|
230 |
|
231 |
if os.path.exists(textract_output_path):
|
232 |
+
print("Existing Textract analysis output file found.")
|
233 |
+
return True
|
234 |
+
|
235 |
+
else:
|
236 |
+
return False
|
237 |
+
|
238 |
+
def check_for_existing_local_ocr_file(doc_file_name_no_extension_textbox:str, output_folder:str=OUTPUT_FOLDER):
|
239 |
+
local_ocr_output_path = os.path.join(output_folder, doc_file_name_no_extension_textbox + "_ocr_results_with_words.json")
|
240 |
+
|
241 |
+
if os.path.exists(local_ocr_output_path):
|
242 |
+
print("Existing local OCR analysis output file found.")
|
243 |
return True
|
244 |
|
245 |
else:
|
|
|
319 |
output_folder_textbox:str=OUTPUT_FOLDER,
|
320 |
input_folder_textbox:str=INPUT_FOLDER,
|
321 |
session_output_folder:str=SESSION_OUTPUT_FOLDER,
|
322 |
+
textract_document_upload_input_folder:str=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER,
|
323 |
+
textract_document_upload_output_folder:str=TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER,
|
324 |
s3_textract_document_logs_subfolder:str=TEXTRACT_JOBS_S3_LOC,
|
325 |
local_textract_document_logs_subfolder:str=TEXTRACT_JOBS_LOCAL_LOC):
|
326 |
|
|
|
490 |
pii_identification_method:str,
|
491 |
textract_output_found_checkbox:bool,
|
492 |
only_extract_text_radio:bool,
|
493 |
+
local_ocr_output_found_checkbox:bool,
|
494 |
convert_page_time:float=0.5,
|
495 |
+
textract_page_time:float=1.2,
|
496 |
+
comprehend_page_time:float=1.2,
|
497 |
local_text_extraction_page_time:float=0.3,
|
498 |
local_pii_redaction_page_time:float=0.5,
|
499 |
local_ocr_extraction_page_time:float=1.5,
|
|
|
508 |
- number_of_pages: The number of pages in the uploaded document(s).
|
509 |
- text_extract_method_radio: The method of text extraction.
|
510 |
- pii_identification_method_drop: The method of personally-identifiable information removal.
|
511 |
+
- textract_output_found_checkbox (bool, optional): Boolean indicating if AWS Textract text extraction outputs have been found.
|
512 |
- only_extract_text_radio (bool, optional): Option to only extract text from the document rather than redact.
|
513 |
+
- local_ocr_output_found_checkbox (bool, optional): Boolean indicating if local OCR text extraction outputs have been found.
|
514 |
- textract_page_time (float, optional): Approximate time to query AWS Textract.
|
515 |
- comprehend_page_time (float, optional): Approximate time to query text on a page with AWS Comprehend.
|
516 |
- local_text_redaction_page_time (float, optional): Approximate time to extract text on a page with the local text redaction option.
|
|
|
538 |
if textract_output_found_checkbox != True:
|
539 |
page_extraction_time_taken = number_of_pages * textract_page_time
|
540 |
elif text_extract_method_radio == local_ocr_option:
|
541 |
+
if local_ocr_output_found_checkbox != True:
|
542 |
+
page_extraction_time_taken = number_of_pages * local_ocr_extraction_page_time
|
543 |
elif text_extract_method_radio == text_ocr_option:
|
544 |
page_conversion_time_taken = number_of_pages * local_text_extraction_page_time
|
545 |
|
tools/redaction_review.py
CHANGED
@@ -6,12 +6,11 @@ import numpy as np
|
|
6 |
from xml.etree.ElementTree import Element, SubElement, tostring, parse
|
7 |
from xml.dom import minidom
|
8 |
import uuid
|
9 |
-
from typing import List
|
10 |
from gradio_image_annotation import image_annotator
|
11 |
from gradio_image_annotation.image_annotator import AnnotatedImageData
|
12 |
from pymupdf import Document, Rect
|
13 |
import pymupdf
|
14 |
-
#from fitz
|
15 |
from PIL import ImageDraw, Image
|
16 |
|
17 |
from tools.config import OUTPUT_FOLDER, CUSTOM_BOX_COLOUR, MAX_IMAGE_PIXELS, INPUT_FOLDER
|
@@ -55,7 +54,6 @@ def update_zoom(current_zoom_level:int, annotate_current_page:int, decrease:bool
|
|
55 |
|
56 |
return current_zoom_level, annotate_current_page
|
57 |
|
58 |
-
|
59 |
def update_dropdown_list_based_on_dataframe(df:pd.DataFrame, column:str) -> List["str"]:
|
60 |
'''
|
61 |
Gather unique elements from a string pandas Series, then append 'ALL' to the start and return the list.
|
@@ -166,49 +164,205 @@ def update_recogniser_dataframes(page_image_annotator_object:AnnotatedImageData,
|
|
166 |
|
167 |
return recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_out, recogniser_entities_drop, text_entities_drop, page_entities_drop
|
168 |
|
169 |
-
def undo_last_removal(backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base):
|
170 |
return backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
|
171 |
|
172 |
-
def update_annotator_page_from_review_df(
|
173 |
-
|
174 |
-
|
175 |
-
|
176 |
-
|
177 |
-
|
178 |
-
|
|
|
|
|
|
|
179 |
'''
|
180 |
-
Update the visible annotation object with the latest review file information
|
|
|
181 |
'''
|
182 |
-
|
183 |
-
|
184 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
185 |
|
|
|
186 |
if not review_df.empty:
|
187 |
-
#
|
188 |
-
#
|
189 |
-
if
|
|
|
190 |
|
191 |
-
|
192 |
-
if gradio_annotator_current_page_number > 0: page_num_reported = gradio_annotator_current_page_number
|
193 |
-
elif gradio_annotator_current_page_number == 0: page_num_reported = 1 # minimum possible reported page is 1
|
194 |
-
else:
|
195 |
-
gradio_annotator_current_page_number = 0
|
196 |
-
page_num_reported = 1
|
197 |
|
198 |
-
|
199 |
-
page_max_reported = len(out_image_annotations_state)
|
200 |
-
if page_num_reported > page_max_reported: page_num_reported = page_max_reported
|
201 |
|
202 |
-
|
203 |
-
|
|
|
|
|
|
|
|
|
204 |
|
205 |
-
|
|
|
206 |
|
207 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
208 |
|
209 |
-
|
|
|
|
|
210 |
|
211 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
212 |
|
213 |
def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
|
214 |
selected_rows_df: pd.DataFrame,
|
@@ -216,7 +370,7 @@ def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
|
|
216 |
page_sizes:List[dict],
|
217 |
image_annotations_state:dict,
|
218 |
recogniser_entity_dataframe_base:pd.DataFrame):
|
219 |
-
'''
|
220 |
Remove selected items from the review dataframe from the annotation object and review dataframe.
|
221 |
'''
|
222 |
|
@@ -253,149 +407,267 @@ def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
|
|
253 |
|
254 |
return out_review_df, out_image_annotations_state, out_recogniser_entity_dataframe_base, backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
|
255 |
|
256 |
-
def
|
257 |
-
|
258 |
-
|
259 |
-
|
260 |
-
|
261 |
-
text_dropdown_value:str="ALL",
|
262 |
-
recogniser_dataframe_base:gr.Dataframe=gr.Dataframe(pd.DataFrame(data={"page":[], "label":[], "text":[], "id":[]}), type="pandas", headers=["page", "label", "text", "id"], show_fullscreen_button=True, wrap=True, show_search='filter', max_height=400, static_columns=[0,1,2,3]),
|
263 |
-
zoom:int=100,
|
264 |
-
review_df:pd.DataFrame=[],
|
265 |
-
page_sizes:List[dict]=[],
|
266 |
-
doc_full_file_name_textbox:str='',
|
267 |
-
input_folder:str=INPUT_FOLDER):
|
268 |
-
'''
|
269 |
-
Update a gradio_image_annotation object with new annotation data.
|
270 |
-
'''
|
271 |
-
zoom_str = str(zoom) + '%'
|
272 |
-
|
273 |
-
#print("all_image_annotations at start of update_annotator_object_and_filter_df[-1]:", all_image_annotations[-1])
|
274 |
-
|
275 |
-
if not gradio_annotator_current_page_number: gradio_annotator_current_page_number = 0
|
276 |
-
|
277 |
-
# Check bounding values for current page and page max
|
278 |
-
if gradio_annotator_current_page_number > 0: page_num_reported = gradio_annotator_current_page_number
|
279 |
-
elif gradio_annotator_current_page_number == 0: page_num_reported = 1 # minimum possible reported page is 1
|
280 |
-
else:
|
281 |
-
gradio_annotator_current_page_number = 0
|
282 |
-
page_num_reported = 1
|
283 |
|
284 |
-
|
285 |
-
|
286 |
-
|
287 |
|
288 |
-
|
|
|
|
|
|
|
|
|
289 |
|
290 |
-
|
291 |
-
|
292 |
|
293 |
-
|
294 |
-
|
295 |
-
|
|
|
|
|
|
|
296 |
|
297 |
-
|
298 |
-
page_sizes_df = pd.DataFrame(page_sizes)
|
299 |
|
300 |
-
|
|
|
|
|
301 |
|
302 |
-
|
303 |
|
304 |
-
|
305 |
-
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
|
306 |
-
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
|
307 |
-
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = replaced_image_path
|
308 |
-
|
309 |
-
else:
|
310 |
-
if not page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].isnull().all():
|
311 |
-
width = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
|
312 |
-
height = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
|
313 |
-
else:
|
314 |
-
image = Image.open(current_image_path)
|
315 |
-
width = image.width
|
316 |
-
height = image.height
|
317 |
|
|
|
318 |
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
|
319 |
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
320 |
|
321 |
-
|
|
|
322 |
|
323 |
-
|
324 |
|
325 |
-
|
326 |
-
|
|
|
327 |
|
328 |
-
|
329 |
-
|
330 |
-
|
331 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
332 |
|
333 |
-
|
334 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
335 |
|
336 |
-
|
337 |
-
|
338 |
|
339 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
340 |
|
341 |
-
|
342 |
-
|
|
|
|
|
343 |
|
344 |
-
|
|
|
|
|
|
|
|
|
345 |
|
346 |
-
#print("all_image_annotations_df[-1] just before creating annotation dicts:", all_image_annotations_df.iloc[-1, :])
|
347 |
|
348 |
-
|
|
|
|
|
|
|
349 |
|
350 |
-
#print("all_image_annotations[-1] after creating annotation dicts:", all_image_annotations[-1])
|
351 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
352 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
353 |
|
354 |
-
|
355 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
356 |
|
357 |
-
current_page_image_annotator_object = all_image_annotations[page_num_reported_zero_indexed]
|
358 |
|
359 |
-
#
|
|
|
360 |
|
361 |
-
page_number_reported_gradio = gr.Number(label = "Current page", value=page_num_reported, precision=0)
|
362 |
|
363 |
-
###
|
364 |
-
# If no data, present a blank page
|
365 |
-
if not all_image_annotations:
|
366 |
-
print("No all_image_annotation object found")
|
367 |
-
page_num_reported = 1
|
368 |
|
369 |
-
|
370 |
-
|
371 |
-
|
372 |
-
|
373 |
-
|
374 |
-
|
375 |
-
|
376 |
-
height=zoom_str,
|
377 |
-
width=zoom_str,
|
378 |
-
box_min_size=1,
|
379 |
-
box_selected_thickness=2,
|
380 |
-
handle_size=4,
|
381 |
-
sources=None,#["upload"],
|
382 |
-
show_clear_button=False,
|
383 |
-
show_share_button=False,
|
384 |
-
show_remove_button=False,
|
385 |
-
handles_cursor=True,
|
386 |
-
interactive=True,
|
387 |
-
use_default_label=True
|
388 |
-
)
|
389 |
-
|
390 |
-
return out_image_annotator, page_number_reported_gradio, page_number_reported_gradio, page_num_reported, recogniser_entities_dropdown_value, recogniser_dataframe_out_gr, recogniser_dataframe_modified, text_entities_drop, page_entities_drop, page_sizes, all_image_annotations
|
391 |
-
|
392 |
else:
|
393 |
-
### Present image_annotator outputs
|
394 |
out_image_annotator = image_annotator(
|
395 |
value = current_page_image_annotator_object,
|
396 |
boxes_alpha=0.1,
|
397 |
box_thickness=1,
|
398 |
-
label_list=recogniser_entities_list,
|
399 |
label_colors=recogniser_colour_list,
|
400 |
show_label=False,
|
401 |
height=zoom_str,
|
@@ -408,41 +680,23 @@ def update_annotator_object_and_filter_df(
|
|
408 |
show_share_button=False,
|
409 |
show_remove_button=False,
|
410 |
handles_cursor=True,
|
411 |
-
interactive=True
|
412 |
)
|
413 |
|
414 |
-
#
|
415 |
-
#
|
416 |
-
|
417 |
-
return out_image_annotator,
|
418 |
-
|
419 |
-
|
420 |
-
|
421 |
-
|
422 |
-
|
423 |
-
|
424 |
-
|
425 |
-
|
426 |
-
|
427 |
-
|
428 |
-
|
429 |
-
page_zero_index = page - 1
|
430 |
-
|
431 |
-
if isinstance(all_image_annotations[page_zero_index]["image"], np.ndarray) or "placeholder_image" in all_image_annotations[page_zero_index]["image"] or isinstance(page_image_annotator_object['image'], np.ndarray):
|
432 |
-
page_sizes_df = pd.DataFrame(page_sizes)
|
433 |
-
page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
|
434 |
-
|
435 |
-
# Check for matching pages
|
436 |
-
matching_paths = page_sizes_df.loc[page_sizes_df['page'] == page, "image_path"].unique()
|
437 |
-
|
438 |
-
if matching_paths.size > 0:
|
439 |
-
image_path = matching_paths[0]
|
440 |
-
page_image_annotator_object['image'] = image_path
|
441 |
-
all_image_annotations[page_zero_index]["image"] = image_path
|
442 |
-
else:
|
443 |
-
print(f"No image path found for page {page}.")
|
444 |
-
|
445 |
-
return page_image_annotator_object, all_image_annotations
|
446 |
|
447 |
def update_all_page_annotation_object_based_on_previous_page(
|
448 |
page_image_annotator_object:AnnotatedImageData,
|
@@ -459,12 +713,9 @@ def update_all_page_annotation_object_based_on_previous_page(
|
|
459 |
previous_page_zero_index = previous_page -1
|
460 |
|
461 |
if not current_page: current_page = 1
|
462 |
-
|
463 |
-
#
|
464 |
-
|
465 |
-
page_image_annotator_object, all_image_annotations = replace_images_in_image_annotation_object(all_image_annotations, page_image_annotator_object, page_sizes, previous_page)
|
466 |
-
|
467 |
-
#print("page_image_annotator_object after replace_images in update_all_page_annotation_object:", page_image_annotator_object)
|
468 |
|
469 |
if clear_all == False: all_image_annotations[previous_page_zero_index] = page_image_annotator_object
|
470 |
else: all_image_annotations[previous_page_zero_index]["boxes"] = []
|
@@ -493,7 +744,7 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
|
|
493 |
page_image_annotator_object = all_image_annotations[current_page - 1]
|
494 |
|
495 |
# This replaces the numpy array image object with the image file path
|
496 |
-
page_image_annotator_object, all_image_annotations =
|
497 |
page_image_annotator_object['image'] = all_image_annotations[current_page - 1]["image"]
|
498 |
|
499 |
if not page_image_annotator_object:
|
@@ -529,7 +780,7 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
|
|
529 |
# Check if all elements are integers in the range 0-255
|
530 |
if all(isinstance(c, int) and 0 <= c <= 255 for c in fill):
|
531 |
pass
|
532 |
-
|
533 |
else:
|
534 |
print(f"Invalid color values: {fill}. Defaulting to black.")
|
535 |
fill = (0, 0, 0) # Default to black if invalid
|
@@ -553,7 +804,6 @@ def apply_redactions_to_review_df_and_files(page_image_annotator_object:Annotate
|
|
553 |
doc = [image]
|
554 |
|
555 |
elif file_extension in '.csv':
|
556 |
-
#print("This is a csv")
|
557 |
pdf_doc = []
|
558 |
|
559 |
# If working with pdfs
|
@@ -797,11 +1047,9 @@ def df_select_callback(df: pd.DataFrame, evt: gr.SelectData):
|
|
797 |
|
798 |
row_value_df = pd.DataFrame(data={"page":[row_value_page], "label":[row_value_label], "text":[row_value_text], "id":[row_value_id]})
|
799 |
|
800 |
-
return
|
801 |
|
802 |
def df_select_callback_textract_api(df: pd.DataFrame, evt: gr.SelectData):
|
803 |
-
|
804 |
-
#print("evt.data:", evt._data)
|
805 |
|
806 |
row_value_job_id = evt.row_value[0] # This is the page number value
|
807 |
# row_value_label = evt.row_value[1] # This is the label number value
|
@@ -829,59 +1077,108 @@ def df_select_callback_ocr(df: pd.DataFrame, evt: gr.SelectData):
|
|
829 |
|
830 |
return row_value_page, row_value_df
|
831 |
|
832 |
-
def update_selected_review_df_row_colour(
|
|
|
|
|
|
|
|
|
|
|
|
|
833 |
'''
|
834 |
Update the colour of a single redaction box based on the values in a selection row
|
|
|
835 |
'''
|
836 |
-
colour_tuple = str(tuple(colour))
|
837 |
|
838 |
-
|
|
|
|
|
|
|
|
|
839 |
if "id" not in review_df.columns:
|
840 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
841 |
|
842 |
-
# Reset existing highlight colours
|
843 |
-
review_df.loc[review_df["id"]==previous_id, "color"] = review_df.loc[review_df["id"]==previous_id, "color"].apply(lambda _: previous_colour)
|
844 |
-
review_df.loc[review_df["color"].astype(str)==colour, "color"] = review_df.loc[review_df["color"].astype(str)==colour, "color"].apply(lambda _: '(0, 0, 0)')
|
845 |
|
846 |
if not redaction_row_selection.empty and not review_df.empty:
|
847 |
use_id = (
|
848 |
-
"id" in redaction_row_selection.columns
|
849 |
-
and "id" in review_df.columns
|
850 |
-
and not redaction_row_selection["id"].isnull().all()
|
851 |
and not review_df["id"].isnull().all()
|
852 |
)
|
853 |
|
854 |
-
selected_merge_cols = ["id"] if use_id else ["label", "page", "text"]
|
855 |
|
856 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
857 |
|
858 |
-
|
859 |
-
|
860 |
-
|
861 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
862 |
|
863 |
-
if not filtered_reviews.empty:
|
864 |
-
previous_colour = str(filtered_reviews["color"].values[0])
|
865 |
-
previous_id = filtered_reviews["id"].values[0]
|
866 |
-
review_df.loc[review_df["_merge"]=="both", "color"] = review_df.loc[review_df["_merge"] == "both", "color"].apply(lambda _: colour)
|
867 |
else:
|
868 |
-
|
869 |
-
|
870 |
-
|
871 |
-
|
872 |
-
previous_id =''
|
873 |
|
874 |
-
review_df.drop("_merge", axis=1, inplace=True)
|
875 |
|
876 |
-
# Ensure
|
877 |
-
#
|
878 |
-
|
879 |
-
|
880 |
-
|
881 |
-
|
882 |
-
#print("review_df after divide:", review_df)
|
883 |
|
884 |
-
review_df = review_df[["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]]
|
885 |
|
886 |
return review_df, previous_id, previous_colour
|
887 |
|
@@ -988,8 +1285,6 @@ def create_xfdf(review_file_df:pd.DataFrame, pdf_path:str, pymupdf_doc:object, i
|
|
988 |
page_sizes_df = pd.DataFrame(page_sizes)
|
989 |
|
990 |
# If there are no image coordinates, then convert coordinates to pymupdf coordinates prior to export
|
991 |
-
#print("Using pymupdf coordinates for conversion.")
|
992 |
-
|
993 |
pages_are_images = False
|
994 |
|
995 |
if "mediabox_width" not in review_file_df.columns:
|
@@ -1041,33 +1336,9 @@ def create_xfdf(review_file_df:pd.DataFrame, pdf_path:str, pymupdf_doc:object, i
|
|
1041 |
raise ValueError(f"Invalid cropbox format: {document_cropboxes[page_python_format]}")
|
1042 |
else:
|
1043 |
print("Document cropboxes not found.")
|
1044 |
-
|
1045 |
|
1046 |
pdf_page_height = pymupdf_page.mediabox.height
|
1047 |
-
pdf_page_width = pymupdf_page.mediabox.width
|
1048 |
-
|
1049 |
-
# Check if image dimensions for page exist in page_sizes_df
|
1050 |
-
# image_dimensions = {}
|
1051 |
-
|
1052 |
-
# image_dimensions['image_width'] = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
|
1053 |
-
# image_dimensions['image_height'] = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
|
1054 |
-
|
1055 |
-
# if pd.isna(image_dimensions['image_width']):
|
1056 |
-
# image_dimensions = {}
|
1057 |
-
|
1058 |
-
# image = image_paths[page_python_format]
|
1059 |
-
|
1060 |
-
# if image_dimensions:
|
1061 |
-
# image_page_width, image_page_height = image_dimensions["image_width"], image_dimensions["image_height"]
|
1062 |
-
# if isinstance(image, str) and 'placeholder' not in image:
|
1063 |
-
# image = Image.open(image)
|
1064 |
-
# image_page_width, image_page_height = image.size
|
1065 |
-
# else:
|
1066 |
-
# try:
|
1067 |
-
# image = Image.open(image)
|
1068 |
-
# image_page_width, image_page_height = image.size
|
1069 |
-
# except Exception as e:
|
1070 |
-
# print("Could not get image sizes due to:", e)
|
1071 |
|
1072 |
# Create redaction annotation
|
1073 |
redact_annot = SubElement(annots, 'redact')
|
@@ -1345,8 +1616,6 @@ def convert_xfdf_to_dataframe(file_paths_list:List[str], pymupdf_doc, image_path
|
|
1345 |
# Optionally, you can add the image path or other relevant information
|
1346 |
df.loc[_, 'image'] = image_path
|
1347 |
|
1348 |
-
#print('row:', row)
|
1349 |
-
|
1350 |
out_file_path = output_folder + file_path_name + "_review_file.csv"
|
1351 |
df.to_csv(out_file_path, index=None)
|
1352 |
|
|
|
6 |
from xml.etree.ElementTree import Element, SubElement, tostring, parse
|
7 |
from xml.dom import minidom
|
8 |
import uuid
|
9 |
+
from typing import List, Tuple
|
10 |
from gradio_image_annotation import image_annotator
|
11 |
from gradio_image_annotation.image_annotator import AnnotatedImageData
|
12 |
from pymupdf import Document, Rect
|
13 |
import pymupdf
|
|
|
14 |
from PIL import ImageDraw, Image
|
15 |
|
16 |
from tools.config import OUTPUT_FOLDER, CUSTOM_BOX_COLOUR, MAX_IMAGE_PIXELS, INPUT_FOLDER
|
|
|
54 |
|
55 |
return current_zoom_level, annotate_current_page
|
56 |
|
|
|
57 |
def update_dropdown_list_based_on_dataframe(df:pd.DataFrame, column:str) -> List["str"]:
|
58 |
'''
|
59 |
Gather unique elements from a string pandas Series, then append 'ALL' to the start and return the list.
|
|
|
164 |
|
165 |
return recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_out, recogniser_entities_drop, text_entities_drop, page_entities_drop
|
166 |
|
167 |
+
def undo_last_removal(backup_review_state:pd.DataFrame, backup_image_annotations_state:list[dict], backup_recogniser_entity_dataframe_base:pd.DataFrame):
|
168 |
return backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
|
169 |
|
170 |
+
def update_annotator_page_from_review_df(
|
171 |
+
review_df: pd.DataFrame,
|
172 |
+
image_file_paths:List[str], # Note: This input doesn't seem used in the original logic flow after the first line was removed
|
173 |
+
page_sizes:List[dict],
|
174 |
+
current_image_annotations_state:List[str], # This should ideally be List[dict] based on its usage
|
175 |
+
current_page_annotator:object, # Should be dict or a custom annotation object for one page
|
176 |
+
selected_recogniser_entity_df_row:pd.DataFrame,
|
177 |
+
input_folder:str,
|
178 |
+
doc_full_file_name_textbox:str
|
179 |
+
) -> Tuple[object, List[dict], int, List[dict], pd.DataFrame, int]: # Correcting return types based on usage
|
180 |
'''
|
181 |
+
Update the visible annotation object and related objects with the latest review file information,
|
182 |
+
optimizing by processing only the current page's data.
|
183 |
'''
|
184 |
+
# Assume current_image_annotations_state is List[dict] and current_page_annotator is dict
|
185 |
+
out_image_annotations_state: List[dict] = list(current_image_annotations_state) # Make a copy to avoid modifying input in place
|
186 |
+
out_current_page_annotator: dict = current_page_annotator
|
187 |
+
|
188 |
+
# Get the target page number from the selected row
|
189 |
+
# Safely access the page number, handling potential errors or empty DataFrame
|
190 |
+
gradio_annotator_current_page_number: int = 0
|
191 |
+
annotate_previous_page: int = 0 # Renaming for clarity if needed, matches original output
|
192 |
+
if not selected_recogniser_entity_df_row.empty and 'page' in selected_recogniser_entity_df_row.columns:
|
193 |
+
try:
|
194 |
+
# Use .iloc[0] and .item() for robust scalar extraction
|
195 |
+
gradio_annotator_current_page_number = int(selected_recogniser_entity_df_row['page'].iloc[0])
|
196 |
+
annotate_previous_page = gradio_annotator_current_page_number # Store original page number
|
197 |
+
except (IndexError, ValueError, TypeError):
|
198 |
+
print("Warning: Could not extract valid page number from selected_recogniser_entity_df_row. Defaulting to page 0 (or 1).")
|
199 |
+
gradio_annotator_current_page_number = 1 # Or 0 depending on 1-based vs 0-based indexing elsewhere
|
200 |
+
|
201 |
+
# Ensure page number is valid and 1-based for external display/logic
|
202 |
+
if gradio_annotator_current_page_number <= 0:
|
203 |
+
gradio_annotator_current_page_number = 1
|
204 |
+
|
205 |
+
page_max_reported = len(out_image_annotations_state)
|
206 |
+
if gradio_annotator_current_page_number > page_max_reported:
|
207 |
+
gradio_annotator_current_page_number = page_max_reported # Cap at max pages
|
208 |
+
|
209 |
+
page_num_reported_zero_indexed = gradio_annotator_current_page_number - 1
|
210 |
+
|
211 |
+
# Process page sizes DataFrame early, as it's needed for image path handling and potentially coordinate multiplication
|
212 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
213 |
+
if not page_sizes_df.empty:
|
214 |
+
# Safely convert page column to numeric and then int
|
215 |
+
page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
|
216 |
+
page_sizes_df.dropna(subset=["page"], inplace=True)
|
217 |
+
if not page_sizes_df.empty:
|
218 |
+
page_sizes_df["page"] = page_sizes_df["page"].astype(int)
|
219 |
+
else:
|
220 |
+
print("Warning: Page sizes DataFrame became empty after processing.")
|
221 |
|
222 |
+
# --- OPTIMIZATION: Process only the current page's data from review_df ---
|
223 |
if not review_df.empty:
|
224 |
+
# Filter review_df for the current page
|
225 |
+
# Ensure 'page' column in review_df is comparable to page_num_reported
|
226 |
+
if 'page' in review_df.columns:
|
227 |
+
review_df['page'] = pd.to_numeric(review_df['page'], errors='coerce').fillna(-1).astype(int)
|
228 |
|
229 |
+
current_image_path = out_image_annotations_state[page_num_reported_zero_indexed]['image']
|
|
|
|
|
|
|
|
|
|
|
230 |
|
231 |
+
replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(doc_full_file_name_textbox, current_image_path, page_sizes_df, gradio_annotator_current_page_number, input_folder)
|
|
|
|
|
232 |
|
233 |
+
# page_sizes_df has been changed - save back to page_sizes_object
|
234 |
+
page_sizes = page_sizes_df.to_dict(orient='records')
|
235 |
+
review_df.loc[review_df["page"]==gradio_annotator_current_page_number, 'image'] = replaced_image_path
|
236 |
+
images_list = list(page_sizes_df["image_path"])
|
237 |
+
images_list[page_num_reported_zero_indexed] = replaced_image_path
|
238 |
+
out_image_annotations_state[page_num_reported_zero_indexed]['image'] = replaced_image_path
|
239 |
|
240 |
+
current_page_review_df = review_df[review_df['page'] == gradio_annotator_current_page_number].copy()
|
241 |
+
current_page_review_df = multiply_coordinates_by_page_sizes(current_page_review_df, page_sizes_df)
|
242 |
|
243 |
+
else:
|
244 |
+
print(f"Warning: 'page' column not found in review_df. Cannot filter for page {gradio_annotator_current_page_number}. Skipping update from review_df.")
|
245 |
+
current_page_review_df = pd.DataFrame() # Empty dataframe if filter fails
|
246 |
+
|
247 |
+
if not current_page_review_df.empty:
|
248 |
+
# Convert the current page's review data to annotation list format for *this page*
|
249 |
+
|
250 |
+
current_page_annotations_list = []
|
251 |
+
# Define expected annotation dict keys, including 'image', 'page', coords, 'label', 'text', 'color' etc.
|
252 |
+
# Assuming review_df has compatible columns
|
253 |
+
expected_annotation_keys = ['label', 'color', 'xmin', 'ymin', 'xmax', 'ymax', 'text', 'id'] # Add/remove as needed
|
254 |
+
|
255 |
+
# Ensure necessary columns exist in current_page_review_df before converting rows
|
256 |
+
for key in expected_annotation_keys:
|
257 |
+
if key not in current_page_review_df.columns:
|
258 |
+
# Add missing column with default value
|
259 |
+
# Use np.nan for numeric, '' for string/object
|
260 |
+
default_value = np.nan if key in ['xmin', 'ymin', 'xmax', 'ymax'] else ''
|
261 |
+
current_page_review_df[key] = default_value
|
262 |
+
|
263 |
+
# Convert filtered DataFrame rows to list of dicts
|
264 |
+
# Using .to_dict(orient='records') is efficient for this
|
265 |
+
current_page_annotations_list_raw = current_page_review_df[expected_annotation_keys].to_dict(orient='records')
|
266 |
+
|
267 |
+
current_page_annotations_list = current_page_annotations_list_raw
|
268 |
+
|
269 |
+
# Update the annotations state for the current page
|
270 |
+
# Each entry in out_image_annotations_state seems to be a dict containing keys like 'image', 'page', 'annotations' (List[dict])
|
271 |
+
# Need to update the 'annotations' list for the specific page.
|
272 |
+
# Find the entry for the current page in the state
|
273 |
+
page_state_entry_found = False
|
274 |
+
for i, page_state_entry in enumerate(out_image_annotations_state):
|
275 |
+
# Assuming page_state_entry has a 'page' key (1-based)
|
276 |
+
|
277 |
+
match = re.search(r"(\d+)\.png$", page_state_entry['image'])
|
278 |
+
if match: page_no = int(match.group(1))
|
279 |
+
else: page_no = -1
|
280 |
+
|
281 |
+
if 'image' in page_state_entry and page_no == page_num_reported_zero_indexed:
|
282 |
+
# Replace the annotations list for this page with the new list from review_df
|
283 |
+
out_image_annotations_state[i]['boxes'] = current_page_annotations_list
|
284 |
+
|
285 |
+
# Update the image path as well, based on review_df if available, or keep existing
|
286 |
+
# Assuming review_df has an 'image' column for this page
|
287 |
+
if 'image' in current_page_review_df.columns and not current_page_review_df.empty:
|
288 |
+
# Use the image path from the first row of the filtered review_df
|
289 |
+
out_image_annotations_state[i]['image'] = current_page_review_df['image'].iloc[0]
|
290 |
+
page_state_entry_found = True
|
291 |
+
break
|
292 |
+
|
293 |
+
if not page_state_entry_found:
|
294 |
+
# This scenario might happen if the current_image_annotations_state didn't initially contain
|
295 |
+
# an entry for this page number. Depending on the application logic, you might need to
|
296 |
+
# add a new entry here, but based on the original code's structure, it seems
|
297 |
+
# out_image_annotations_state is pre-populated for all pages.
|
298 |
+
print(f"Warning: Entry for page {gradio_annotator_current_page_number} not found in current_image_annotations_state. Cannot update page annotations.")
|
299 |
+
|
300 |
+
|
301 |
+
# --- Image Path and Page Size Handling (already seems focused on current page, keep similar logic) ---
|
302 |
+
# Get the image path for the current page from the updated state
|
303 |
+
# Ensure the entry exists before accessing
|
304 |
+
current_image_path = None
|
305 |
+
if len(out_image_annotations_state) > page_num_reported_zero_indexed and 'image' in out_image_annotations_state[page_num_reported_zero_indexed]:
|
306 |
+
current_image_path = out_image_annotations_state[page_num_reported_zero_indexed]['image']
|
307 |
+
else:
|
308 |
+
print(f"Warning: Could not get image path from state for page index {page_num_reported_zero_indexed}.")
|
309 |
+
|
310 |
+
|
311 |
+
# Replace placeholder image with real image path if needed
|
312 |
+
if current_image_path and not page_sizes_df.empty:
|
313 |
+
try:
|
314 |
+
replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(
|
315 |
+
doc_full_file_name_textbox, current_image_path, page_sizes_df,
|
316 |
+
gradio_annotator_current_page_number, input_folder # Use 1-based page number
|
317 |
+
)
|
318 |
|
319 |
+
# Update state and review_df with the potentially replaced image path
|
320 |
+
if len(out_image_annotations_state) > page_num_reported_zero_indexed:
|
321 |
+
out_image_annotations_state[page_num_reported_zero_indexed]['image'] = replaced_image_path
|
322 |
|
323 |
+
if 'page' in review_df.columns and 'image' in review_df.columns:
|
324 |
+
review_df.loc[review_df["page"]==gradio_annotator_current_page_number, 'image'] = replaced_image_path
|
325 |
+
|
326 |
+
except Exception as e:
|
327 |
+
print(f"Error during image path replacement for page {gradio_annotator_current_page_number}: {e}")
|
328 |
+
|
329 |
+
|
330 |
+
# Save back page_sizes_df to page_sizes list format
|
331 |
+
if not page_sizes_df.empty:
|
332 |
+
page_sizes = page_sizes_df.to_dict(orient='records')
|
333 |
+
else:
|
334 |
+
page_sizes = [] # Ensure page_sizes is a list if df is empty
|
335 |
+
|
336 |
+
# --- Re-evaluate Coordinate Multiplication and Duplicate Removal ---
|
337 |
+
# The original code multiplied coordinates for the *entire* document and removed duplicates
|
338 |
+
# across the *entire* document *after* converting the full review_df to state.
|
339 |
+
# With the optimized approach, we updated only one page's annotations in the state.
|
340 |
+
|
341 |
+
# Let's assume remove_duplicate_images_with_blank_boxes expects the raw list of dicts state format:
|
342 |
+
try:
|
343 |
+
out_image_annotations_state = remove_duplicate_images_with_blank_boxes(out_image_annotations_state)
|
344 |
+
except Exception as e:
|
345 |
+
print(f"Error during duplicate removal: {e}. Proceeding without duplicate removal.")
|
346 |
+
|
347 |
+
|
348 |
+
# Select the current page's annotation object from the (potentially updated) state
|
349 |
+
if len(out_image_annotations_state) > page_num_reported_zero_indexed:
|
350 |
+
out_current_page_annotator = out_image_annotations_state[page_num_reported_zero_indexed]
|
351 |
+
else:
|
352 |
+
print(f"Warning: Cannot select current page annotator object for index {page_num_reported_zero_indexed}.")
|
353 |
+
out_current_page_annotator = {} # Or None, depending on expected output type
|
354 |
+
|
355 |
+
|
356 |
+
# The original code returns gradio_annotator_current_page_number as the 3rd value,
|
357 |
+
# which was potentially updated by bounding checks. Keep this.
|
358 |
+
final_page_number_returned = gradio_annotator_current_page_number
|
359 |
+
|
360 |
+
return (out_current_page_annotator,
|
361 |
+
out_image_annotations_state,
|
362 |
+
final_page_number_returned,
|
363 |
+
page_sizes,
|
364 |
+
review_df, # review_df might have its 'page' column type changed, keep it as is or revert if necessary
|
365 |
+
annotate_previous_page) # The original page number from selected_recogniser_entity_df_row
|
366 |
|
367 |
def exclude_selected_items_from_redaction(review_df: pd.DataFrame,
|
368 |
selected_rows_df: pd.DataFrame,
|
|
|
370 |
page_sizes:List[dict],
|
371 |
image_annotations_state:dict,
|
372 |
recogniser_entity_dataframe_base:pd.DataFrame):
|
373 |
+
'''
|
374 |
Remove selected items from the review dataframe from the annotation object and review dataframe.
|
375 |
'''
|
376 |
|
|
|
407 |
|
408 |
return out_review_df, out_image_annotations_state, out_recogniser_entity_dataframe_base, backup_review_state, backup_image_annotations_state, backup_recogniser_entity_dataframe_base
|
409 |
|
410 |
+
def replace_annotator_object_img_np_array_with_page_sizes_image_path(
|
411 |
+
all_image_annotations:List[dict],
|
412 |
+
page_image_annotator_object:AnnotatedImageData,
|
413 |
+
page_sizes:List[dict],
|
414 |
+
page:int):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
415 |
|
416 |
+
'''
|
417 |
+
Check if the image value in an AnnotatedImageData dict is a placeholder or np.array. If either of these, replace the value with the file path of the image that is hopefully already loaded into the app related to this page.
|
418 |
+
'''
|
419 |
|
420 |
+
page_zero_index = page - 1
|
421 |
+
|
422 |
+
if isinstance(all_image_annotations[page_zero_index]["image"], np.ndarray) or "placeholder_image" in all_image_annotations[page_zero_index]["image"] or isinstance(page_image_annotator_object['image'], np.ndarray):
|
423 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
424 |
+
page_sizes_df[["page"]] = page_sizes_df[["page"]].apply(pd.to_numeric, errors="coerce")
|
425 |
|
426 |
+
# Check for matching pages
|
427 |
+
matching_paths = page_sizes_df.loc[page_sizes_df['page'] == page, "image_path"].unique()
|
428 |
|
429 |
+
if matching_paths.size > 0:
|
430 |
+
image_path = matching_paths[0]
|
431 |
+
page_image_annotator_object['image'] = image_path
|
432 |
+
all_image_annotations[page_zero_index]["image"] = image_path
|
433 |
+
else:
|
434 |
+
print(f"No image path found for page {page}.")
|
435 |
|
436 |
+
return page_image_annotator_object, all_image_annotations
|
|
|
437 |
|
438 |
+
def replace_placeholder_image_with_real_image(doc_full_file_name_textbox:str, current_image_path:str, page_sizes_df:pd.DataFrame, page_num_reported:int, input_folder:str):
|
439 |
+
''' If image path is still not valid, load in a new image an overwrite it. Then replace all items in the image annotation object for all pages based on the updated information.'''
|
440 |
+
page_num_reported_zero_indexed = page_num_reported - 1
|
441 |
|
442 |
+
if not os.path.exists(current_image_path):
|
443 |
|
444 |
+
page_num, replaced_image_path, width, height = process_single_page_for_image_conversion(doc_full_file_name_textbox, page_num_reported_zero_indexed, input_folder=input_folder)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
445 |
|
446 |
+
# Overwrite page_sizes values
|
447 |
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
|
448 |
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
|
449 |
+
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = replaced_image_path
|
450 |
+
|
451 |
+
else:
|
452 |
+
if not page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].isnull().all():
|
453 |
+
width = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"].max()
|
454 |
+
height = page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"].max()
|
455 |
+
else:
|
456 |
+
image = Image.open(current_image_path)
|
457 |
+
width = image.width
|
458 |
+
height = image.height
|
459 |
|
460 |
+
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_width"] = width
|
461 |
+
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_height"] = height
|
462 |
|
463 |
+
page_sizes_df.loc[page_sizes_df['page']==page_num_reported, "image_path"] = current_image_path
|
464 |
|
465 |
+
replaced_image_path = current_image_path
|
466 |
+
|
467 |
+
return replaced_image_path, page_sizes_df
|
468 |
|
469 |
+
def update_annotator_object_and_filter_df(
|
470 |
+
all_image_annotations:List[AnnotatedImageData],
|
471 |
+
gradio_annotator_current_page_number:int,
|
472 |
+
recogniser_entities_dropdown_value:str="ALL",
|
473 |
+
page_dropdown_value:str="ALL",
|
474 |
+
text_dropdown_value:str="ALL",
|
475 |
+
recogniser_dataframe_base:gr.Dataframe=None, # Simplified default
|
476 |
+
zoom:int=100,
|
477 |
+
review_df:pd.DataFrame=None, # Use None for default empty DataFrame
|
478 |
+
page_sizes:List[dict]=[],
|
479 |
+
doc_full_file_name_textbox:str='',
|
480 |
+
input_folder:str=INPUT_FOLDER
|
481 |
+
) -> Tuple[image_annotator, gr.Number, gr.Number, int, str, gr.Dataframe, pd.DataFrame, List[str], List[str], List[dict], List[AnnotatedImageData]]:
|
482 |
+
'''
|
483 |
+
Update a gradio_image_annotation object with new annotation data for the current page
|
484 |
+
and update filter dataframes, optimizing by processing only the current page's data for display.
|
485 |
+
'''
|
486 |
+
zoom_str = str(zoom) + '%'
|
487 |
+
|
488 |
+
# Handle default empty review_df and recogniser_dataframe_base
|
489 |
+
if review_df is None or not isinstance(review_df, pd.DataFrame):
|
490 |
+
review_df = pd.DataFrame(columns=["image", "page", "label", "color", "xmin", "ymin", "xmax", "ymax", "text", "id"])
|
491 |
+
if recogniser_dataframe_base is None: # Create a simple default if None
|
492 |
+
recogniser_dataframe_base = gr.Dataframe(pd.DataFrame(data={"page":[], "label":[], "text":[], "id":[]}))
|
493 |
+
|
494 |
+
|
495 |
+
# Handle empty all_image_annotations state early
|
496 |
+
if not all_image_annotations:
|
497 |
+
print("No all_image_annotation object found")
|
498 |
+
# Return blank/default outputs
|
499 |
+
blank_annotator = gr.ImageAnnotator(
|
500 |
+
value = None, boxes_alpha=0.1, box_thickness=1, label_list=[], label_colors=[],
|
501 |
+
show_label=False, height=zoom_str, width=zoom_str, box_min_size=1,
|
502 |
+
box_selected_thickness=2, handle_size=4, sources=None,
|
503 |
+
show_clear_button=False, show_share_button=False, show_remove_button=False,
|
504 |
+
handles_cursor=True, interactive=True, use_default_label=True
|
505 |
+
)
|
506 |
+
blank_df_out_gr = gr.Dataframe(pd.DataFrame(columns=["page", "label", "text", "id"]))
|
507 |
+
blank_df_modified = pd.DataFrame(columns=["page", "label", "text", "id"])
|
508 |
|
509 |
+
return (blank_annotator, gr.Number(value=1), gr.Number(value=1), 1,
|
510 |
+
recogniser_entities_dropdown_value, blank_df_out_gr, blank_df_modified,
|
511 |
+
[], [], [], []) # Return empty lists/defaults for other outputs
|
512 |
+
|
513 |
+
# Validate and bound the current page number (1-based logic)
|
514 |
+
page_num_reported = max(1, gradio_annotator_current_page_number) # Minimum page is 1
|
515 |
+
page_max_reported = len(all_image_annotations)
|
516 |
+
if page_num_reported > page_max_reported:
|
517 |
+
page_num_reported = page_max_reported
|
518 |
|
519 |
+
page_num_reported_zero_indexed = page_num_reported - 1
|
520 |
+
annotate_previous_page = page_num_reported # Store the determined page number
|
521 |
|
522 |
+
# --- Process page sizes DataFrame ---
|
523 |
+
page_sizes_df = pd.DataFrame(page_sizes)
|
524 |
+
if not page_sizes_df.empty:
|
525 |
+
page_sizes_df["page"] = pd.to_numeric(page_sizes_df["page"], errors="coerce")
|
526 |
+
page_sizes_df.dropna(subset=["page"], inplace=True)
|
527 |
+
if not page_sizes_df.empty:
|
528 |
+
page_sizes_df["page"] = page_sizes_df["page"].astype(int)
|
529 |
+
else:
|
530 |
+
print("Warning: Page sizes DataFrame became empty after processing.")
|
531 |
+
|
532 |
+
# --- Handle Image Path Replacement for the Current Page ---
|
533 |
+
# This modifies the specific page entry within all_image_annotations list
|
534 |
+
# Assuming replace_annotator_object_img_np_array_with_page_sizes_image_path
|
535 |
+
# correctly updates the image path within the list element.
|
536 |
+
if len(all_image_annotations) > page_num_reported_zero_indexed:
|
537 |
+
# Make a shallow copy of the list and deep copy the specific page dict before modification
|
538 |
+
# to avoid modifying the input list unexpectedly if it's used elsewhere.
|
539 |
+
# However, the original code modified the list in place, so we'll stick to that
|
540 |
+
# pattern but acknowledge it.
|
541 |
+
page_object_to_update = all_image_annotations[page_num_reported_zero_indexed]
|
542 |
+
|
543 |
+
# Use the helper function to replace the image path within the page object
|
544 |
+
# Note: This helper returns the potentially modified page_object and the full state.
|
545 |
+
# The full state return seems redundant if only page_object_to_update is modified.
|
546 |
+
# Let's call it and assume it correctly updates the item in the list.
|
547 |
+
updated_page_object, all_image_annotations_after_img_replace = replace_annotator_object_img_np_array_with_page_sizes_image_path(
|
548 |
+
all_image_annotations, page_object_to_update, page_sizes, page_num_reported)
|
549 |
+
|
550 |
+
# The original code immediately re-assigns all_image_annotations.
|
551 |
+
# We'll rely on the function modifying the list element in place or returning the updated list.
|
552 |
+
# Assuming it returns the updated list for robustness:
|
553 |
+
all_image_annotations = all_image_annotations_after_img_replace
|
554 |
+
|
555 |
+
|
556 |
+
# Now handle the actual image file path replacement using replace_placeholder_image_with_real_image
|
557 |
+
current_image_path = updated_page_object.get('image') # Get potentially updated image path
|
558 |
+
|
559 |
+
if current_image_path and not page_sizes_df.empty:
|
560 |
+
try:
|
561 |
+
replaced_image_path, page_sizes_df = replace_placeholder_image_with_real_image(
|
562 |
+
doc_full_file_name_textbox, current_image_path, page_sizes_df,
|
563 |
+
page_num_reported, input_folder=input_folder # Use 1-based page num
|
564 |
+
)
|
565 |
|
566 |
+
# Update the image path in the state and review_df for the current page
|
567 |
+
# Find the correct entry in all_image_annotations list again by index
|
568 |
+
if len(all_image_annotations) > page_num_reported_zero_indexed:
|
569 |
+
all_image_annotations[page_num_reported_zero_indexed]['image'] = replaced_image_path
|
570 |
|
571 |
+
# Update review_df's image path for this page
|
572 |
+
if 'page' in review_df.columns and 'image' in review_df.columns:
|
573 |
+
# Ensure review_df page column is numeric for filtering
|
574 |
+
review_df['page'] = pd.to_numeric(review_df['page'], errors='coerce').fillna(-1).astype(int)
|
575 |
+
review_df.loc[review_df["page"]==page_num_reported, 'image'] = replaced_image_path
|
576 |
|
|
|
577 |
|
578 |
+
except Exception as e:
|
579 |
+
print(f"Error during image path replacement for page {page_num_reported}: {e}")
|
580 |
+
else:
|
581 |
+
print(f"Warning: Page index {page_num_reported_zero_indexed} out of bounds for all_image_annotations list.")
|
582 |
|
|
|
583 |
|
584 |
+
# Save back page_sizes_df to page_sizes list format
|
585 |
+
if not page_sizes_df.empty:
|
586 |
+
page_sizes = page_sizes_df.to_dict(orient='records')
|
587 |
+
else:
|
588 |
+
page_sizes = [] # Ensure page_sizes is a list if df is empty
|
589 |
+
|
590 |
+
# --- OPTIMIZATION: Prepare data *only* for the current page for display ---
|
591 |
+
current_page_image_annotator_object = None
|
592 |
+
if len(all_image_annotations) > page_num_reported_zero_indexed:
|
593 |
+
page_data_for_display = all_image_annotations[page_num_reported_zero_indexed]
|
594 |
+
|
595 |
+
# Convert current page annotations list to DataFrame for coordinate multiplication IF needed
|
596 |
+
# Assuming coordinate multiplication IS needed for display if state stores relative coords
|
597 |
+
current_page_annotations_df = convert_annotation_data_to_dataframe([page_data_for_display])
|
598 |
+
|
599 |
+
|
600 |
+
if not current_page_annotations_df.empty and not page_sizes_df.empty:
|
601 |
+
# Multiply coordinates *only* for this page's DataFrame
|
602 |
+
try:
|
603 |
+
# Need the specific page's size for multiplication
|
604 |
+
page_size_row = page_sizes_df[page_sizes_df['page'] == page_num_reported]
|
605 |
+
if not page_size_row.empty:
|
606 |
+
current_page_annotations_df = multiply_coordinates_by_page_sizes(
|
607 |
+
current_page_annotations_df, page_size_row, # Pass only the row for the current page
|
608 |
+
xmin="xmin", xmax="xmax", ymin="ymin", ymax="ymax"
|
609 |
+
)
|
610 |
+
|
611 |
+
except Exception as e:
|
612 |
+
print(f"Warning: Error during coordinate multiplication for page {page_num_reported}: {e}. Using original coordinates.")
|
613 |
+
# If error, proceed with original coordinates or handle as needed
|
614 |
+
|
615 |
+
if "color" not in current_page_annotations_df.columns:
|
616 |
+
current_page_annotations_df['color'] = '(0, 0, 0)'
|
617 |
+
|
618 |
+
# Convert the processed DataFrame back to the list of dicts format for the annotator
|
619 |
+
processed_current_page_annotations_list = current_page_annotations_df[["xmin", "xmax", "ymin", "ymax", "label", "color", "text", "id"]].to_dict(orient='records')
|
620 |
+
|
621 |
+
# Construct the final object expected by the Gradio ImageAnnotator value parameter
|
622 |
+
current_page_image_annotator_object: AnnotatedImageData = {
|
623 |
+
'image': page_data_for_display.get('image'), # Use the (potentially updated) image path
|
624 |
+
'boxes': processed_current_page_annotations_list
|
625 |
+
}
|
626 |
|
627 |
+
# --- Update Dropdowns and Review DataFrame ---
|
628 |
+
# This external function still operates on potentially large DataFrames.
|
629 |
+
# It receives all_image_annotations and a copy of review_df.
|
630 |
+
try:
|
631 |
+
recogniser_entities_list, recogniser_dataframe_out_gr, recogniser_dataframe_modified, recogniser_entities_dropdown_value, text_entities_drop, page_entities_drop = update_recogniser_dataframes(
|
632 |
+
all_image_annotations, # Pass the updated full state
|
633 |
+
recogniser_dataframe_base,
|
634 |
+
recogniser_entities_dropdown_value,
|
635 |
+
text_dropdown_value,
|
636 |
+
page_dropdown_value,
|
637 |
+
review_df.copy(), # Keep the copy as per original function call
|
638 |
+
page_sizes # Pass updated page sizes
|
639 |
+
)
|
640 |
+
# Generate default black colors for labels if needed by image_annotator
|
641 |
+
recogniser_colour_list = [(0, 0, 0) for _ in range(len(recogniser_entities_list))]
|
642 |
|
643 |
+
except Exception as e:
|
644 |
+
print(f"Error calling update_recogniser_dataframes: {e}. Returning empty/default filter data.")
|
645 |
+
recogniser_entities_list = []
|
646 |
+
recogniser_colour_list = []
|
647 |
+
recogniser_dataframe_out_gr = gr.Dataframe(pd.DataFrame(columns=["page", "label", "text", "id"]))
|
648 |
+
recogniser_dataframe_modified = pd.DataFrame(columns=["page", "label", "text", "id"])
|
649 |
+
text_entities_drop = []
|
650 |
+
page_entities_drop = []
|
651 |
|
|
|
652 |
|
653 |
+
# --- Final Output Components ---
|
654 |
+
page_number_reported_gradio_comp = gr.Number(label = "Current page", value=page_num_reported, precision=0)
|
655 |
|
|
|
656 |
|
|
|
|
|
|
|
|
|
|
|
657 |
|
658 |
+
### Present image_annotator outputs
|
659 |
+
# Handle the case where current_page_image_annotator_object couldn't be prepared
|
660 |
+
if current_page_image_annotator_object is None:
|
661 |
+
# This should ideally be covered by the initial empty check for all_image_annotations,
|
662 |
+
# but as a safeguard:
|
663 |
+
print("Warning: Could not prepare annotator object for the current page.")
|
664 |
+
out_image_annotator = image_annotator(value=None, interactive=False) # Present blank/non-interactive
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
665 |
else:
|
|
|
666 |
out_image_annotator = image_annotator(
|
667 |
value = current_page_image_annotator_object,
|
668 |
boxes_alpha=0.1,
|
669 |
box_thickness=1,
|
670 |
+
label_list=recogniser_entities_list, # Use labels from update_recogniser_dataframes
|
671 |
label_colors=recogniser_colour_list,
|
672 |
show_label=False,
|
673 |
height=zoom_str,
|
|
|
680 |
show_share_button=False,
|
681 |
show_remove_button=False,
|
682 |
handles_cursor=True,
|
683 |
+
interactive=True # Keep interactive if data is present
|
684 |
)
|
685 |
|
686 |
+
# The original code returned page_number_reported_gradio twice;
|
687 |
+
# returning the Gradio component and the plain integer value.
|
688 |
+
# Let's match the output signature.
|
689 |
+
return (out_image_annotator,
|
690 |
+
page_number_reported_gradio_comp,
|
691 |
+
page_number_reported_gradio_comp, # Redundant, but matches original return signature
|
692 |
+
page_num_reported, # Plain integer value
|
693 |
+
recogniser_entities_dropdown_value,
|
694 |
+
recogniser_dataframe_out_gr,
|
695 |
+
recogniser_dataframe_modified,
|
696 |
+
text_entities_drop, # List of text entities for dropdown
|
697 |
+
page_entities_drop, # List of page numbers for dropdown
|
698 |
+
page_sizes, # Updated page_sizes list
|
699 |
+
all_image_annotations) # Return the updated full state
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
700 |
|
701 |
def update_all_page_annotation_object_based_on_previous_page(
|
702 |
page_image_annotator_object:AnnotatedImageData,
|
|
|
713 |
previous_page_zero_index = previous_page -1
|
714 |
|
715 |
if not current_page: current_page = 1
|
716 |
+
|
717 |
+
# This replaces the numpy array image object with the image file path
|
718 |
+
page_image_annotator_object, all_image_annotations = replace_annotator_object_img_np_array_with_page_sizes_image_path(all_image_annotations, page_image_annotator_object, page_sizes, previous_page)
|
|
|
|
|
|
|
719 |
|
720 |
if clear_all == False: all_image_annotations[previous_page_zero_index] = page_image_annotator_object
|
721 |
else: all_image_annotations[previous_page_zero_index]["boxes"] = []
|
|
|
744 |
page_image_annotator_object = all_image_annotations[current_page - 1]
|
745 |
|
746 |
# This replaces the numpy array image object with the image file path
|
747 |
+
page_image_annotator_object, all_image_annotations = replace_annotator_object_img_np_array_with_page_sizes_image_path(all_image_annotations, page_image_annotator_object, page_sizes, current_page)
|
748 |
page_image_annotator_object['image'] = all_image_annotations[current_page - 1]["image"]
|
749 |
|
750 |
if not page_image_annotator_object:
|
|
|
780 |
# Check if all elements are integers in the range 0-255
|
781 |
if all(isinstance(c, int) and 0 <= c <= 255 for c in fill):
|
782 |
pass
|
783 |
+
|
784 |
else:
|
785 |
print(f"Invalid color values: {fill}. Defaulting to black.")
|
786 |
fill = (0, 0, 0) # Default to black if invalid
|
|
|
804 |
doc = [image]
|
805 |
|
806 |
elif file_extension in '.csv':
|
|
|
807 |
pdf_doc = []
|
808 |
|
809 |
# If working with pdfs
|
|
|
1047 |
|
1048 |
row_value_df = pd.DataFrame(data={"page":[row_value_page], "label":[row_value_label], "text":[row_value_text], "id":[row_value_id]})
|
1049 |
|
1050 |
+
return row_value_df
|
1051 |
|
1052 |
def df_select_callback_textract_api(df: pd.DataFrame, evt: gr.SelectData):
|
|
|
|
|
1053 |
|
1054 |
row_value_job_id = evt.row_value[0] # This is the page number value
|
1055 |
# row_value_label = evt.row_value[1] # This is the label number value
|
|
|
1077 |
|
1078 |
return row_value_page, row_value_df
|
1079 |
|
1080 |
+
def update_selected_review_df_row_colour(
|
1081 |
+
redaction_row_selection: pd.DataFrame,
|
1082 |
+
review_df: pd.DataFrame,
|
1083 |
+
previous_id: str = "",
|
1084 |
+
previous_colour: str = '(0, 0, 0)',
|
1085 |
+
colour: str = '(1, 0, 255)'
|
1086 |
+
) -> tuple[pd.DataFrame, str, str]:
|
1087 |
'''
|
1088 |
Update the colour of a single redaction box based on the values in a selection row
|
1089 |
+
(Optimized Version)
|
1090 |
'''
|
|
|
1091 |
|
1092 |
+
# Ensure 'color' column exists, default to previous_colour if previous_id is provided
|
1093 |
+
if "color" not in review_df.columns:
|
1094 |
+
review_df["color"] = previous_colour if previous_id else '(0, 0, 0)'
|
1095 |
+
|
1096 |
+
# Ensure 'id' column exists
|
1097 |
if "id" not in review_df.columns:
|
1098 |
+
# Assuming fill_missing_ids is a defined function that returns a DataFrame
|
1099 |
+
# It's more efficient if this is handled outside if possible,
|
1100 |
+
# or optimized internally.
|
1101 |
+
print("Warning: 'id' column not found. Calling fill_missing_ids.")
|
1102 |
+
review_df = fill_missing_ids(review_df) # Keep this if necessary, but note it can be slow
|
1103 |
+
|
1104 |
+
# --- Optimization 1 & 2: Reset existing highlight colours using vectorized assignment ---
|
1105 |
+
# Reset the color of the previously highlighted row
|
1106 |
+
if previous_id and previous_id in review_df["id"].values:
|
1107 |
+
review_df.loc[review_df["id"] == previous_id, "color"] = previous_colour
|
1108 |
+
|
1109 |
+
# Reset the color of any row that currently has the highlight colour (handle cases where previous_id might not have been tracked correctly)
|
1110 |
+
# Convert to string for comparison only if the dtype might be mixed or not purely string
|
1111 |
+
# If 'color' is consistently string, the .astype(str) might be avoidable.
|
1112 |
+
# Assuming color is consistently string format like '(R, G, B)'
|
1113 |
+
review_df.loc[review_df["color"] == colour, "color"] = '(0, 0, 0)'
|
1114 |
|
|
|
|
|
|
|
1115 |
|
1116 |
if not redaction_row_selection.empty and not review_df.empty:
|
1117 |
use_id = (
|
1118 |
+
"id" in redaction_row_selection.columns
|
1119 |
+
and "id" in review_df.columns
|
1120 |
+
and not redaction_row_selection["id"].isnull().all()
|
1121 |
and not review_df["id"].isnull().all()
|
1122 |
)
|
1123 |
|
1124 |
+
selected_merge_cols = ["id"] if use_id else ["label", "page", "text"]
|
1125 |
|
1126 |
+
# --- Optimization 3: Use inner merge directly ---
|
1127 |
+
# Merge to find rows in review_df that match redaction_row_selection
|
1128 |
+
merged_reviews = review_df.merge(
|
1129 |
+
redaction_row_selection[selected_merge_cols],
|
1130 |
+
on=selected_merge_cols,
|
1131 |
+
how="inner" # Use inner join as we only care about matches
|
1132 |
+
)
|
1133 |
|
1134 |
+
if not merged_reviews.empty:
|
1135 |
+
# Assuming we only expect one match for highlighting a single row
|
1136 |
+
# If multiple matches are possible and you want to highlight all,
|
1137 |
+
# the logic for previous_id and previous_colour needs adjustment.
|
1138 |
+
new_previous_colour = str(merged_reviews["color"].iloc[0])
|
1139 |
+
new_previous_id = merged_reviews["id"].iloc[0]
|
1140 |
+
|
1141 |
+
# --- Optimization 1 & 2: Update color of the matched row using vectorized assignment ---
|
1142 |
+
|
1143 |
+
if use_id:
|
1144 |
+
# Faster update if using unique 'id' as merge key
|
1145 |
+
review_df.loc[review_df["id"].isin(merged_reviews["id"]), "color"] = colour
|
1146 |
+
else:
|
1147 |
+
# More general case using multiple columns - might be slower
|
1148 |
+
# Create a temporary key for comparison
|
1149 |
+
def create_merge_key(df, cols):
|
1150 |
+
return df[cols].astype(str).agg('_'.join, axis=1)
|
1151 |
+
|
1152 |
+
review_df_key = create_merge_key(review_df, selected_merge_cols)
|
1153 |
+
merged_reviews_key = create_merge_key(merged_reviews, selected_merge_cols)
|
1154 |
+
|
1155 |
+
review_df.loc[review_df_key.isin(merged_reviews_key), "color"] = colour
|
1156 |
+
|
1157 |
+
previous_colour = new_previous_colour
|
1158 |
+
previous_id = new_previous_id
|
1159 |
+
else:
|
1160 |
+
# No rows matched the selection
|
1161 |
+
print("No reviews found matching selection criteria")
|
1162 |
+
# The reset logic at the beginning already handles setting color to (0, 0, 0)
|
1163 |
+
# if it was the highlight colour and didn't match.
|
1164 |
+
# No specific action needed here for color reset beyond what's done initially.
|
1165 |
+
previous_colour = '(0, 0, 0)' # Reset previous_colour as no row was highlighted
|
1166 |
+
previous_id = '' # Reset previous_id
|
1167 |
|
|
|
|
|
|
|
|
|
1168 |
else:
|
1169 |
+
# If selection is empty, reset any existing highlights
|
1170 |
+
review_df.loc[review_df["color"] == colour, "color"] = '(0, 0, 0)'
|
1171 |
+
previous_colour = '(0, 0, 0)'
|
1172 |
+
previous_id = ''
|
|
|
1173 |
|
|
|
1174 |
|
1175 |
+
# Ensure column order is maintained if necessary, though pandas generally preserves order
|
1176 |
+
# Creating a new DataFrame here might involve copying data, consider if this is strictly needed.
|
1177 |
+
if set(["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]).issubset(review_df.columns):
|
1178 |
+
review_df = review_df[["image", "page", "label", "color", "xmin","ymin", "xmax", "ymax", "text", "id"]]
|
1179 |
+
else:
|
1180 |
+
print("Warning: Not all expected columns are present in review_df for reordering.")
|
|
|
1181 |
|
|
|
1182 |
|
1183 |
return review_df, previous_id, previous_colour
|
1184 |
|
|
|
1285 |
page_sizes_df = pd.DataFrame(page_sizes)
|
1286 |
|
1287 |
# If there are no image coordinates, then convert coordinates to pymupdf coordinates prior to export
|
|
|
|
|
1288 |
pages_are_images = False
|
1289 |
|
1290 |
if "mediabox_width" not in review_file_df.columns:
|
|
|
1336 |
raise ValueError(f"Invalid cropbox format: {document_cropboxes[page_python_format]}")
|
1337 |
else:
|
1338 |
print("Document cropboxes not found.")
|
|
|
1339 |
|
1340 |
pdf_page_height = pymupdf_page.mediabox.height
|
1341 |
+
pdf_page_width = pymupdf_page.mediabox.width
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1342 |
|
1343 |
# Create redaction annotation
|
1344 |
redact_annot = SubElement(annots, 'redact')
|
|
|
1616 |
# Optionally, you can add the image path or other relevant information
|
1617 |
df.loc[_, 'image'] = image_path
|
1618 |
|
|
|
|
|
1619 |
out_file_path = output_folder + file_path_name + "_review_file.csv"
|
1620 |
df.to_csv(out_file_path, index=None)
|
1621 |
|
tools/textract_batch_call.py
CHANGED
@@ -10,7 +10,7 @@ from io import StringIO
|
|
10 |
from urllib.parse import urlparse
|
11 |
from botocore.exceptions import ClientError, NoCredentialsError, PartialCredentialsError, TokenRetrievalError
|
12 |
|
13 |
-
from tools.config import
|
14 |
#from tools.aws_textract import json_to_ocrresult
|
15 |
|
16 |
def analyse_document_with_textract_api(
|
@@ -18,7 +18,7 @@ def analyse_document_with_textract_api(
|
|
18 |
s3_input_prefix: str,
|
19 |
s3_output_prefix: str,
|
20 |
job_df:pd.DataFrame,
|
21 |
-
s3_bucket_name: str =
|
22 |
local_output_dir: str = OUTPUT_FOLDER,
|
23 |
analyse_signatures:List[str] = [],
|
24 |
successful_job_number:int=0,
|
@@ -328,7 +328,7 @@ def poll_bulk_textract_analysis_progress_and_download(
|
|
328 |
s3_output_prefix: str,
|
329 |
pdf_filename:str,
|
330 |
job_df:pd.DataFrame,
|
331 |
-
s3_bucket_name: str =
|
332 |
local_output_dir: str = OUTPUT_FOLDER,
|
333 |
load_s3_jobs_loc:str=TEXTRACT_JOBS_S3_LOC,
|
334 |
load_local_jobs_loc:str=TEXTRACT_JOBS_LOCAL_LOC,
|
|
|
10 |
from urllib.parse import urlparse
|
11 |
from botocore.exceptions import ClientError, NoCredentialsError, PartialCredentialsError, TokenRetrievalError
|
12 |
|
13 |
+
from tools.config import TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET, OUTPUT_FOLDER, AWS_REGION, DOCUMENT_REDACTION_BUCKET, LOAD_PREVIOUS_TEXTRACT_JOBS_S3, TEXTRACT_JOBS_S3_LOC, TEXTRACT_JOBS_LOCAL_LOC
|
14 |
#from tools.aws_textract import json_to_ocrresult
|
15 |
|
16 |
def analyse_document_with_textract_api(
|
|
|
18 |
s3_input_prefix: str,
|
19 |
s3_output_prefix: str,
|
20 |
job_df:pd.DataFrame,
|
21 |
+
s3_bucket_name: str = TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET,
|
22 |
local_output_dir: str = OUTPUT_FOLDER,
|
23 |
analyse_signatures:List[str] = [],
|
24 |
successful_job_number:int=0,
|
|
|
328 |
s3_output_prefix: str,
|
329 |
pdf_filename:str,
|
330 |
job_df:pd.DataFrame,
|
331 |
+
s3_bucket_name: str = TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET,
|
332 |
local_output_dir: str = OUTPUT_FOLDER,
|
333 |
load_s3_jobs_loc:str=TEXTRACT_JOBS_S3_LOC,
|
334 |
load_local_jobs_loc:str=TEXTRACT_JOBS_LOCAL_LOC,
|