seanpedrickcase commited on
Commit
c543ba0
·
1 Parent(s): febacad

Updated user guide and app settings. Updated some additional lambda_entrypoint arguments. Ensured that examples are correctly displayed on GUI.

Browse files
Files changed (7) hide show
  1. README.md +312 -55
  2. app.py +30 -41
  3. lambda_entrypoint.py +6 -4
  4. pyproject.toml +1 -1
  5. src/app_settings.qmd +534 -490
  6. src/user_guide.qmd +310 -53
  7. tools/config.py +48 -32
README.md CHANGED
@@ -10,7 +10,7 @@ license: agpl-3.0
10
  ---
11
  # Document redaction
12
 
13
- version: 1.4.0
14
 
15
  Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
16
 
@@ -204,11 +204,12 @@ These settings are only relevant if you intend to use AWS services like Textract
204
 
205
  Now you have the app installed, what follows is a guide on how to use it for basic and advanced redaction.
206
 
207
- # User Guide
208
 
209
  ## Table of contents
210
 
211
- - [Example data files](#example-data-files)
 
212
  - [Basic redaction](#basic-redaction)
213
  - [Customising redaction options](#customising-redaction-options)
214
  - [Custom allow, deny, and page redaction lists](#custom-allow-deny-and-page-redaction-lists)
@@ -220,21 +221,60 @@ Now you have the app installed, what follows is a guide on how to use it for bas
220
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
221
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
222
  - [Redacting Word, tabular data files (CSV/XLSX) or copy and pasted text](#redacting-word-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
223
-
224
- See the [advanced user guide here](#advanced-user-guide):
225
- - [Merging redaction review files](#merging-redaction-review-files)
226
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
 
 
 
227
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
228
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
 
229
  - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
230
  - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
231
  - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
232
  - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
233
  - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
234
 
235
- ## Example data files
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
236
 
237
- Please try these example files to follow along with this guide:
238
  - [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
239
  - [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
240
  - [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
@@ -254,16 +294,20 @@ The 'Redact PDFs/images tab' currently accepts PDFs and image files (JPG, PNG) f
254
 
255
  ### Text extraction
256
 
257
- First, select one of the three text extraction options:
 
 
258
  - **'Local model - selectable text'** - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
259
  - **'Local OCR model - PDFs without selectable text'** - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
260
  - **'AWS Textract service - all PDF types'** - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
261
 
262
- ### Optional - select signature extraction
263
  If you chose the AWS Textract service above, you can choose if you want handwriting and/or signatures redacted by default. Choosing signatures here will have a cost implication, as identifying signatures will cost ~£2.66 ($3.50) per 1,000 pages vs ~£1.14 ($1.50) per 1,000 pages without signature detection.
264
 
265
  ![AWS Textract handwriting and signature options](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_handwriting_signatures.PNG)
266
 
 
 
267
  ### PII redaction method
268
 
269
  If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
@@ -297,6 +341,7 @@ Click 'Redact document'. After loading in the document, the app should be able t
297
  ![Redaction outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_outputs.PNG)
298
 
299
  - **'...redacted.pdf'** files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
 
300
  - **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
301
  - **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
302
 
@@ -365,8 +410,6 @@ If the table is empty, you can add a new entry, you can add a new row by clickin
365
 
366
  ![Manually modify allow or deny list filled](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/manually_modify_filled.PNG)
367
 
368
- **Note:** As of version 0.7.0 you can now apply your whole page redaction list directly to the document file currently under review by clicking the 'Apply whole page redaction list to document currently under review' button that appears here.
369
-
370
  ### Redacting additional types of personal information
371
 
372
  You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
@@ -381,7 +424,7 @@ If you want to redact different files, I suggest you refresh your browser page t
381
 
382
  ## Redacting only specific pages
383
 
384
- Say also we are only interested in redacting page 1 of the loaded documents. On the Redaction settings tab, select 'Lowest page to redact' as 1, and 'Highest page to redact' also as 1. When you next redact your documents, only the first page will be modified.
385
 
386
  ![Selecting specific pages to redact](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/select_pages.PNG)
387
 
@@ -618,39 +661,16 @@ You can also write open text into an input box and redact that using the same me
618
  ### Redaction log outputs
619
  A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
620
 
621
- # ADVANCED USER GUIDE
622
-
623
- This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
624
-
625
- ## Table of contents
626
-
627
- - [Merging redaction review files](#merging-redaction-review-files)
628
- - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
629
- - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
630
- - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
631
- - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
632
- - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
633
- - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
634
- - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
635
- - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
636
-
637
-
638
- ## Merging redaction review files
639
-
640
- Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
641
-
642
- ![Merging review files in the user interface](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/merge_review_files_interface.PNG)
643
-
644
- You can find this option at the bottom of the 'Redaction Settings' tab. Upload multiple review files here to get a single output 'merged' review_file. In the examples file, merging the 'review_file_custom.csv' and 'review_file_local.csv' files give you an output containing redaction boxes from both. This combined review file can then be uploaded into the review tab following the usual procedure.
645
-
646
- ![Merging review files outputs in spreadsheet](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/merged_review_file_outputs_csv.PNG)
647
-
648
  ## Identifying and redacting duplicate pages
649
 
650
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
651
 
652
  Some redaction tasks involve removing duplicate pages of text that may exist across multiple documents. This feature helps you find and remove duplicate content that may exist in single or multiple documents. It can identify everything from single identical pages to multi-page sections (subdocuments). The process involves three main steps: configuring the analysis, reviewing the results in the interactive interface, and then using the generated files to perform the redactions.
653
 
 
 
 
 
654
  ![Example duplicate page inputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/duplicate_page_find_in_app/img/duplicate_page_input_interface_new.PNG)
655
 
656
  **Step 1: Upload and Configure the Analysis**
@@ -695,11 +715,43 @@ The analysis also generates a set of downloadable files for your records and for
695
 
696
  If you want to combine the results from this redaction process with previous redaction tasks for the same PDF, you could merge review file outputs following the steps described in [Merging existing redaction review files](#merging-existing-redaction-review-files) above.
697
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
698
  ## Fuzzy search and redaction
699
 
700
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/fuzzy_search/).
701
 
702
- Sometimes you may be searching for terns that are slightly mispelled throughout a document, for example names. The document redaction app gives the option for searching for long phrases that may contain spelling mistakes, a method called 'fuzzy matching'.
703
 
704
  To do this, go to the Redaction Settings, and the 'Select entity types to redact' area. In the box below relevant to your chosen redaction method (local or AWS Comprehend), select 'CUSTOM_FUZZY' from the list. Next, we can select the maximum number of spelling mistakes allowed in the search (up to nine). Here, you can either type in a number or use the small arrows to the right of the box. Change this option to 3. This will allow for a maximum of three 'changes' in text needed to match to the desired search terms.
705
 
@@ -719,9 +771,20 @@ Using these deny list with spelling mistakes, the app fuzzy match these terms to
719
 
720
  Files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/export_to_adobe/).
721
 
722
- ### Exporting to Adobe Acrobat
 
 
723
 
724
- The Document Redaction app has a feature to export suggested redactions to Adobe, and likewise to import Adobe comment files into the app. The file format used is the .xfdf Adobe comment file format - [you can find more information about how to use these files here](https://helpx.adobe.com/uk/acrobat/using/importing-exporting-comments.html).
 
 
 
 
 
 
 
 
 
725
 
726
  To convert suggested redactions to Adobe format, you need to have the original PDF and a review file csv in the input box at the top of the Review redactions page.
727
 
@@ -769,6 +832,46 @@ The '_textract.json' output can be used to speed up further redaction tasks as [
769
 
770
  You can now easily get the '..._ocr_output.csv' redaction output based on this '_textract.json' (described in [Redaction outputs](#redaction-outputs)) by clicking on the button 'Convert Textract job outputs to OCR results'. You can now use this file e.g. for [identifying duplicate pages](#identifying-and-redacting-duplicate-pages), or for redaction review.
771
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
772
  ## Using AWS Textract and Comprehend when not running in an AWS environment
773
 
774
  AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
@@ -790,26 +893,180 @@ The app should then pick up these keys when trying to access the AWS Textract an
790
 
791
  Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
792
 
793
- ## Modifying existing redaction review files
794
 
795
- *Note:* As of version 0.7.0 you can now modify redaction review files directly in the app on the 'Review redactions' tab. Open the accordion 'View and edit review data' under the file input area. You can edit review file data cells here - press Enter to apply changes. You should see the effect on the current page if you click the 'Save changes on current page to file' button to the right.
796
 
797
- You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
798
 
799
- As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
 
 
800
 
801
- If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
802
 
803
- ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)
804
 
805
- The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Find & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
 
 
806
 
807
- How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
 
 
 
808
 
809
- Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
810
 
811
- I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
812
 
813
- ![Review file after modification](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/partnership_redactions_after.PNG)
 
 
 
 
 
 
 
 
 
814
 
815
- We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
  # Document redaction
12
 
13
+ version: 1.4.1
14
 
15
  Redact personally identifiable information (PII) from documents (pdf, png, jpg), Word files (docx), or tabular data (xlsx/csv/parquet). Please see the [User Guide](#user-guide) for a full walkthrough of all the features in the app.
16
 
 
204
 
205
  Now you have the app installed, what follows is a guide on how to use it for basic and advanced redaction.
206
 
207
+ # User guide
208
 
209
  ## Table of contents
210
 
211
+ ### Getting Started
212
+ - [Built-in example data](#built-in-example-data)
213
  - [Basic redaction](#basic-redaction)
214
  - [Customising redaction options](#customising-redaction-options)
215
  - [Custom allow, deny, and page redaction lists](#custom-allow-deny-and-page-redaction-lists)
 
221
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
222
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
223
  - [Redacting Word, tabular data files (CSV/XLSX) or copy and pasted text](#redacting-word-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
 
 
 
224
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
225
+
226
+ ### Advanced user guide
227
+ - [Advanced user guide](#advanced-user-guide)
228
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
229
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
230
+ - [Using _for_review.pdf files with Adobe Acrobat](#using-_for_reviewpdf-files-with-adobe-acrobat)
231
  - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
232
  - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
233
  - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
234
  - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
235
  - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
236
+ - [Merging redaction review files](#merging-redaction-review-files)
237
+
238
+ ### Features for expert users/system administrators
239
+ - [Features for expert users/system administrators](#features-for-expert-userssystem-administrators)
240
+ - [Advanced OCR options (Hybrid OCR)](#advanced-ocr-options-hybrid-ocr)
241
+ - [Command Line Interface (CLI)](#command-line-interface-cli)
242
+
243
+ ## Built-in example data
244
+
245
+ The app now includes built-in example files that you can use to quickly test different features. These examples are automatically loaded and can be accessed directly from the interface without needing to download files separately.
246
+
247
+ ### Using built-in examples
248
+
249
+ **For PDF/image redaction:** On the 'Redact PDFs/images' tab, you'll see a section titled "Try an example - Click on an example below and then the 'Extract text and redact document' button". Simply click on any of the available examples to load them with pre-configured settings:
250
+
251
+ - **PDF with selectable text redaction** - Uses local text extraction with standard PII detection
252
+ - **Image redaction with local OCR** - Processes an image file using OCR
253
+ - **PDF redaction with custom entities** - Demonstrates custom entity selection (Titles, Person, Dates)
254
+ - **PDF redaction with AWS services and signature detection** - Shows AWS Textract with signature extraction (if AWS is enabled)
255
+ - **PDF redaction with custom deny list and whole page redaction** - Demonstrates advanced redaction features
256
+
257
+ Once you have clicked on an example, you can click the 'Extract text and redact document' button to load the example into the app and redact it.
258
+
259
+ **For tabular data:** On the 'Word or Excel/csv files' tab, you'll find examples for both redaction and duplicate detection:
260
 
261
+ - **CSV file redaction** - Shows how to redact specific columns in tabular data
262
+ - **Word document redaction** - Demonstrates Word document processing
263
+ - **Excel file duplicate detection** - Shows how to find duplicate rows in spreadsheet data
264
+
265
+ Once you have clicked on an example, you can click the 'Redact text/data files' button to load the example into the app and redact it. For the duplicate detection example, you can click the 'Find duplicate cells/rows' button to load the example into the app and find duplicates.
266
+
267
+ **For duplicate page detection:** On the 'Identify duplicate pages' tab, you'll find examples for finding duplicate content in documents:
268
+
269
+ - **Find duplicate pages of text in document OCR outputs** - Uses page-level analysis with a similarity threshold of 0.95 and minimum word count of 10
270
+ - **Find duplicate text lines in document OCR outputs** - Uses line-level analysis with a similarity threshold of 0.95 and minimum word count of 3
271
+
272
+ Once you have clicked on an example, you can click the 'Identify duplicate pages/subdocuments' button to load the example into the app and find duplicate content.
273
+
274
+ ### External example files (optional)
275
+
276
+ If you prefer to use your own example files or want to follow along with specific tutorials, you can still download these external example files:
277
 
 
278
  - [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
279
  - [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
280
  - [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
 
294
 
295
  ### Text extraction
296
 
297
+ You can modify default text extraction methods by clicking on the 'Change default text extraction method...' box'.
298
+
299
+ Here you can select one of the three text extraction options:
300
  - **'Local model - selectable text'** - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
301
  - **'Local OCR model - PDFs without selectable text'** - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
302
  - **'AWS Textract service - all PDF types'** - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
303
 
304
+ ### Enable AWS Textract signature extraction
305
  If you chose the AWS Textract service above, you can choose if you want handwriting and/or signatures redacted by default. Choosing signatures here will have a cost implication, as identifying signatures will cost ~£2.66 ($3.50) per 1,000 pages vs ~£1.14 ($1.50) per 1,000 pages without signature detection.
306
 
307
  ![AWS Textract handwriting and signature options](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_handwriting_signatures.PNG)
308
 
309
+ **NOTE:** it is also possible to enable form extraction, layout extraction, and table extraction with AWS Textract. This is not enabled by default, but it is possible for your system admin to enable this feature in the config file.
310
+
311
  ### PII redaction method
312
 
313
  If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
 
341
  ![Redaction outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_outputs.PNG)
342
 
343
  - **'...redacted.pdf'** files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
344
+ - **'...redactions_for_review.pdf'** files contain the original PDF with redaction boxes overlaid but the original text still visible underneath. This file is designed for use in Adobe Acrobat and other PDF viewers where you can see the suggested redactions without the text being permanently removed. This is particularly useful for reviewing redactions before finalising them.
345
  - **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
346
  - **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
347
 
 
410
 
411
  ![Manually modify allow or deny list filled](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/manually_modify_filled.PNG)
412
 
 
 
413
  ### Redacting additional types of personal information
414
 
415
  You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
 
424
 
425
  ## Redacting only specific pages
426
 
427
+ Say also we are only interested in redacting page 1 of the loaded documents. On the Redaction settings tab, select 'Lowest page to redact' as 1, and 'Highest page to redact' also as 1. When you next redact your documents, only the first page will be modified. The output files should now have a suffix similar to '..._1_1.pdf', indicating the lowest and highest page numbers that were redacted.
428
 
429
  ![Selecting specific pages to redact](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/select_pages.PNG)
430
 
 
661
  ### Redaction log outputs
662
  A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
663
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
664
  ## Identifying and redacting duplicate pages
665
 
666
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
667
 
668
  Some redaction tasks involve removing duplicate pages of text that may exist across multiple documents. This feature helps you find and remove duplicate content that may exist in single or multiple documents. It can identify everything from single identical pages to multi-page sections (subdocuments). The process involves three main steps: configuring the analysis, reviewing the results in the interactive interface, and then using the generated files to perform the redactions.
669
 
670
+ ### Duplicate page detection in documents
671
+
672
+ This section covers finding duplicate pages across PDF documents using OCR output files.
673
+
674
  ![Example duplicate page inputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/duplicate_page_find_in_app/img/duplicate_page_input_interface_new.PNG)
675
 
676
  **Step 1: Upload and Configure the Analysis**
 
715
 
716
  If you want to combine the results from this redaction process with previous redaction tasks for the same PDF, you could merge review file outputs following the steps described in [Merging existing redaction review files](#merging-existing-redaction-review-files) above.
717
 
718
+ ### Duplicate detection in tabular data
719
+
720
+ The app also includes functionality to find duplicate cells or rows in CSV, Excel, or Parquet files. This is particularly useful for cleaning datasets where you need to identify and remove duplicate entries.
721
+
722
+ **Step 1: Upload files and configure analysis**
723
+
724
+ Navigate to the 'Word or Excel/csv files' tab and scroll down to the "Find duplicate cells in tabular data" section. Upload your tabular files (CSV, Excel, or Parquet) and configure the analysis parameters:
725
+
726
+ - **Similarity threshold**: Score (0-1) to consider cells a match. 1 = perfect match
727
+ - **Minimum word count**: Cells with fewer words than this value are ignored
728
+ - **Do initial clean of text**: Remove URLs, HTML tags, and non-ASCII characters
729
+ - **Remove duplicate rows**: Automatically remove duplicate rows from deduplicated files
730
+ - **Select Excel sheet names**: Choose which sheets to analyze (for Excel files)
731
+ - **Select text columns**: Choose which columns contain text to analyze
732
+
733
+ **Step 2: Review results**
734
+
735
+ After clicking "Find duplicate cells/rows", the results will be displayed in a table showing:
736
+ - File1, Row1, File2, Row2
737
+ - Similarity_Score
738
+ - Text1, Text2 (the actual text content being compared)
739
+
740
+ Click on any row to see more details about the duplicate match in the preview boxes below.
741
+
742
+ **Step 3: Remove duplicates**
743
+
744
+ Select a file from the dropdown and click "Remove duplicate rows from selected file" to create a cleaned version with duplicates removed. The cleaned file will be available for download.
745
+
746
+ # Advanced user guide
747
+
748
+ This advanced user guide covers features that require system administration access or command-line usage. These features are typically used by system administrators or advanced users who need more control over the redaction process.
749
+
750
  ## Fuzzy search and redaction
751
 
752
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/fuzzy_search/).
753
 
754
+ Sometimes you may be searching for terms that are slightly mispelled throughout a document, for example names. The document redaction app gives the option for searching for long phrases that may contain spelling mistakes, a method called 'fuzzy matching'.
755
 
756
  To do this, go to the Redaction Settings, and the 'Select entity types to redact' area. In the box below relevant to your chosen redaction method (local or AWS Comprehend), select 'CUSTOM_FUZZY' from the list. Next, we can select the maximum number of spelling mistakes allowed in the search (up to nine). Here, you can either type in a number or use the small arrows to the right of the box. Change this option to 3. This will allow for a maximum of three 'changes' in text needed to match to the desired search terms.
757
 
 
771
 
772
  Files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/export_to_adobe/).
773
 
774
+ The Document Redaction app has enhanced features for working with Adobe Acrobat. You can now export suggested redactions to Adobe, import Adobe comment files into the app, and use the new `_for_review.pdf` files directly in Adobe Acrobat.
775
+
776
+ ### Using _for_review.pdf files with Adobe Acrobat
777
 
778
+ The app now generates `...redactions_for_review.pdf` files that contain the original PDF with redaction boxes overlaid but the original text still visible underneath. These files are specifically designed for use in Adobe Acrobat and other PDF viewers where you can:
779
+
780
+ - See the suggested redactions without the text being permanently removed
781
+ - Review redactions before finalising them
782
+ - Use Adobe Acrobat's built-in redaction tools to modify or apply the redactions
783
+ - Export the final redacted version directly from Adobe
784
+
785
+ Simply open the `...redactions_for_review.pdf` file in Adobe Acrobat to begin reviewing and modifying the suggested redactions.
786
+
787
+ ### Exporting to Adobe Acrobat
788
 
789
  To convert suggested redactions to Adobe format, you need to have the original PDF and a review file csv in the input box at the top of the Review redactions page.
790
 
 
832
 
833
  You can now easily get the '..._ocr_output.csv' redaction output based on this '_textract.json' (described in [Redaction outputs](#redaction-outputs)) by clicking on the button 'Convert Textract job outputs to OCR results'. You can now use this file e.g. for [identifying duplicate pages](#identifying-and-redacting-duplicate-pages), or for redaction review.
834
 
835
+
836
+
837
+ ## Modifying existing redaction review files
838
+ You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
839
+
840
+ As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified insider or outside of the app. This gives you the flexibility to change redaction details outside of the app.
841
+
842
+ ### Inside the app
843
+ You can now modify redaction review files directly in the app on the 'Review redactions' tab. Open the accordion 'View and edit review data' under the file input area. You can edit review file data cells here - press Enter to apply changes. You should see the effect on the current page if you click the 'Save changes on current page to file' button to the right.
844
+
845
+ ### Outside the app
846
+ If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
847
+
848
+ ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)
849
+
850
+ The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Find & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
851
+
852
+ How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
853
+
854
+ Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
855
+
856
+ I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
857
+
858
+ ![Review file after modification](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/partnership_redactions_after.PNG)
859
+
860
+ We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
861
+
862
+ ## Merging redaction review files
863
+
864
+ Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
865
+
866
+ ![Merging review files in the user interface](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/merge_review_files_interface.PNG)
867
+
868
+ You can find this option at the bottom of the 'Redaction Settings' tab. Upload multiple review files here to get a single output 'merged' review_file. In the examples file, merging the 'review_file_custom.csv' and 'review_file_local.csv' files give you an output containing redaction boxes from both. This combined review file can then be uploaded into the review tab following the usual procedure.
869
+
870
+ ![Merging review files outputs in spreadsheet](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/merged_review_file_outputs_csv.PNG)
871
+
872
+ # Features for expert users/system administrators
873
+ This advanced user guide covers features that require system administration access or command-line usage. These options are not enabled by default but can be configured by your system administrator, and are not available to users who are just using the graphical user interface. These features are typically used by system administrators or advanced users who need more control over the redaction process.
874
+
875
  ## Using AWS Textract and Comprehend when not running in an AWS environment
876
 
877
  AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
 
893
 
894
  Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
895
 
896
+ ## Advanced OCR options (Hybrid OCR)
897
 
898
+ The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by your system administrator.
899
 
900
+ ### Available OCR models
901
 
902
+ - **Tesseract** (default): The standard OCR engine that works well for most documents
903
+ - **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise
904
+ - **Hybrid**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text
905
 
906
+ ### Enabling advanced OCR options
907
 
908
+ To enable these options, your system administrator needs to modify the configuration file (`config.py`) and set:
909
 
910
+ ```
911
+ SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
912
+ ```
913
 
914
+ Once enabled, users will see a "Change default local OCR model" section in the redaction settings where they can choose between:
915
+ - tesseract
916
+ - hybrid
917
+ - paddle
918
 
919
+ ### Hybrid OCR configuration
920
 
921
+ The hybrid OCR mode uses several configurable parameters:
922
 
923
+ - **HYBRID_OCR_CONFIDENCE_THRESHOLD** (default: 65): Tesseract confidence score below which PaddleOCR will be used for re-extraction
924
+ - **HYBRID_OCR_PADDING** (default: 1): Padding added to word bounding boxes before re-extraction
925
+ - **SAVE_EXAMPLE_TESSERACT_VS_PADDLE_IMAGES** (default: False): Save comparison images when using hybrid mode
926
+ - **SAVE_PADDLE_VISUALISATIONS** (default: False): Save images with PaddleOCR bounding boxes overlaid
927
+
928
+ ### When to use different OCR models
929
+
930
+ - **Tesseract**: Best for general use, good balance of speed and accuracy
931
+ - **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision
932
+ - **Hybrid**: Best for challenging documents where some text has low confidence scores, providing the benefits of both engines
933
 
934
+
935
+
936
+
937
+
938
+ ## Command Line Interface (CLI)
939
+
940
+ The app includes a comprehensive command-line interface (`cli_redact.py`) that allows you to perform redaction, deduplication, and AWS Textract operations directly from the terminal. This is particularly useful for batch processing, automation, and integration with other systems.
941
+
942
+ ### Getting started with the CLI
943
+
944
+ To use the CLI, you need to:
945
+
946
+ 1. Open a terminal window
947
+ 2. Navigate to the app folder containing `cli_redact.py`
948
+ 3. Activate your virtual environment (conda or venv)
949
+ 4. Run commands using `python cli_redact.py` followed by your options
950
+
951
+ ### Basic CLI syntax
952
+
953
+ ```bash
954
+ python cli_redact.py --task [redact|deduplicate|textract] --input_file [file_path] [additional_options]
955
+ ```
956
+
957
+ ### Redaction examples
958
+
959
+ **Basic PDF redaction with default settings:**
960
+ ```bash
961
+ python cli_redact.py --input_file example_data/example_of_emails_sent_to_a_professor_before_applying.pdf
962
+ ```
963
+
964
+ **Extract text only (no redaction) with whole page redaction:**
965
+ ```bash
966
+ python cli_redact.py --input_file example_data/Partnership-Agreement-Toolkit_0_0.pdf --redact_whole_page_file example_data/partnership_toolkit_redact_some_pages.csv --pii_detector None
967
+ ```
968
+
969
+ **Redact with custom entities and allow list:**
970
+ ```bash
971
+ python cli_redact.py --input_file example_data/graduate-job-example-cover-letter.pdf --allow_list_file example_data/test_allow_list_graduate.csv --local_redact_entities TITLES PERSON DATE_TIME
972
+ ```
973
+
974
+ **Redact with fuzzy matching and custom deny list:**
975
+ ```bash
976
+ python cli_redact.py --input_file example_data/Partnership-Agreement-Toolkit_0_0.pdf --deny_list_file example_data/Partnership-Agreement-Toolkit_test_deny_list_para_single_spell.csv --local_redact_entities CUSTOM_FUZZY --page_min 1 --page_max 3 --fuzzy_mistakes 3
977
+ ```
978
+
979
+ **Redact with AWS services:**
980
+ ```bash
981
+ python cli_redact.py --input_file example_data/example_of_emails_sent_to_a_professor_before_applying.pdf --ocr_method "AWS Textract" --pii_detector "AWS Comprehend"
982
+ ```
983
+
984
+ **Redact specific pages with signature extraction:**
985
+ ```bash
986
+ python cli_redact.py --input_file example_data/Partnership-Agreement-Toolkit_0_0.pdf --page_min 6 --page_max 7 --ocr_method "AWS Textract" --handwrite_signature_extraction "Extract handwriting" "Extract signatures"
987
+ ```
988
+
989
+ ### Tabular data redaction
990
+
991
+ **Anonymize CSV file with specific columns:**
992
+ ```bash
993
+ python cli_redact.py --input_file example_data/combined_case_notes.csv --text_columns "Case Note" "Client" --anon_strategy replace_redacted
994
+ ```
995
+
996
+ **Anonymize Excel file:**
997
+ ```bash
998
+ python cli_redact.py --input_file example_data/combined_case_notes.xlsx --text_columns "Case Note" "Client" --excel_sheets combined_case_notes --anon_strategy redact
999
+ ```
1000
+
1001
+ **Anonymize Word document:**
1002
+ ```bash
1003
+ python cli_redact.py --input_file "example_data/Bold minimalist professional cover letter.docx" --anon_strategy replace_redacted
1004
+ ```
1005
+
1006
+ ### Duplicate detection
1007
+
1008
+ **Find duplicate pages in OCR files:**
1009
+ ```bash
1010
+ python cli_redact.py --task deduplicate --input_file example_data/example_outputs/doubled_output_joined.pdf_ocr_output.csv --duplicate_type pages --similarity_threshold 0.95
1011
+ ```
1012
+
1013
+ **Find duplicates at line level:**
1014
+ ```bash
1015
+ python cli_redact.py --task deduplicate --input_file example_data/example_outputs/doubled_output_joined.pdf_ocr_output.csv --duplicate_type pages --similarity_threshold 0.95 --combine_pages False --min_word_count 3
1016
+ ```
1017
+
1018
+ **Find duplicate rows in tabular data:**
1019
+ ```bash
1020
+ python cli_redact.py --task deduplicate --input_file example_data/Lambeth_2030-Our_Future_Our_Lambeth.pdf.csv --duplicate_type tabular --text_columns "text" --similarity_threshold 0.95
1021
+ ```
1022
+
1023
+ ### AWS Textract operations
1024
+
1025
+ **Submit document for analysis:**
1026
+ ```bash
1027
+ python cli_redact.py --task textract --textract_action submit --input_file example_data/example_of_emails_sent_to_a_professor_before_applying.pdf
1028
+ ```
1029
+
1030
+ **Submit with signature extraction:**
1031
+ ```bash
1032
+ python cli_redact.py --task textract --textract_action submit --input_file example_data/Partnership-Agreement-Toolkit_0_0.pdf --extract_signatures
1033
+ ```
1034
+
1035
+ **Retrieve results by job ID:**
1036
+ ```bash
1037
+ python cli_redact.py --task textract --textract_action retrieve --job_id 12345678-1234-1234-1234-123456789012
1038
+ ```
1039
+
1040
+ **List recent jobs:**
1041
+ ```bash
1042
+ python cli_redact.py --task textract --textract_action list
1043
+ ```
1044
+
1045
+ ### Common CLI options
1046
+
1047
+ - `--task`: Choose between "redact", "deduplicate", or "textract"
1048
+ - `--input_file`: Path to input file(s)
1049
+ - `--output_dir`: Directory for output files (default: output/)
1050
+ - `--page_min` / `--page_max`: Process only specific page range
1051
+ - `--ocr_method`: Choose text extraction method
1052
+ - `--pii_detector`: Choose PII detection method
1053
+ - `--local_redact_entities`: Specify local entities to redact
1054
+ - `--allow_list_file` / `--deny_list_file`: Custom lists
1055
+ - `--redact_whole_page_file`: List of pages to redact completely
1056
+ - `--fuzzy_mistakes`: Number of spelling mistakes allowed in fuzzy matching
1057
+ - `--similarity_threshold`: Threshold for duplicate detection
1058
+ - `--anon_strategy`: Anonymization strategy for tabular data
1059
+
1060
+ ### Output files
1061
+
1062
+ The CLI generates the same output files as the GUI:
1063
+ - `...redacted.pdf`: Final redacted document
1064
+ - `...redactions_for_review.pdf`: Document with redaction boxes for review
1065
+ - `...review_file.csv`: Detailed redaction information
1066
+ - `...ocr_results.csv`: Extracted text results
1067
+ - `..._textract.json`: AWS Textract results (if applicable)
1068
+
1069
+ For more advanced options and configuration, refer to the help text by running:
1070
+ ```bash
1071
+ python cli_redact.py --help
1072
+ ```
app.py CHANGED
@@ -22,14 +22,12 @@ from tools.config import (
22
  CHOSEN_LOCAL_OCR_MODEL,
23
  CHOSEN_REDACT_ENTITIES,
24
  COGNITO_AUTH,
25
- COMPRESS_REDACTED_PDF,
26
  CONFIG_FOLDER,
27
  COST_CODES_PATH,
28
  CSV_ACCESS_LOG_HEADERS,
29
  CSV_FEEDBACK_LOG_HEADERS,
30
  CSV_USAGE_LOG_HEADERS,
31
  CUSTOM_BOX_COLOUR,
32
- DEFAULT_COMBINE_PAGES,
33
  DEFAULT_CONCURRENCY_LIMIT,
34
  DEFAULT_COST_CODE,
35
  DEFAULT_DUPLICATE_DETECTION_THRESHOLD,
@@ -48,11 +46,37 @@ from tools.config import (
48
  DEFAULT_TEXT_COLUMNS,
49
  DEFAULT_TEXT_EXTRACTION_MODEL,
50
  DENY_LIST_PATH,
 
 
 
 
51
  DIRECT_MODE_DEFAULT_USER,
52
  DIRECT_MODE_DUPLICATE_TYPE,
 
 
 
 
 
 
 
53
  DIRECT_MODE_INPUT_FILE,
 
 
 
 
 
 
 
54
  DIRECT_MODE_OUTPUT_DIR,
 
 
 
 
 
 
 
55
  DIRECT_MODE_TASK,
 
56
  DISPLAY_FILE_NAMES_IN_LOGS,
57
  DO_INITIAL_TABULAR_DATA_CLEAN,
58
  DOCUMENT_REDACTION_BUCKET,
@@ -76,11 +100,9 @@ from tools.config import (
76
  GRADIO_TEMP_DIR,
77
  HANDWRITE_SIGNATURE_TEXTBOX_FULL_OPTIONS,
78
  HOST_NAME,
79
- IMAGES_DPI,
80
  INPUT_FOLDER,
81
  LOAD_PREVIOUS_TEXTRACT_JOBS_S3,
82
  LOCAL_OCR_MODEL_OPTIONS,
83
- LOCAL_PII_OPTION,
84
  LOG_FILE_NAME,
85
  MAX_FILE_SIZE,
86
  MAX_OPEN_TEXT_CHARACTERS,
@@ -91,38 +113,10 @@ from tools.config import (
91
  OUTPUT_FOLDER,
92
  PADDLE_MODEL_PATH,
93
  PII_DETECTION_MODELS,
94
- PREPROCESS_LOCAL_OCR_IMAGES,
95
  REMOVE_DUPLICATE_ROWS,
96
- RETURN_REDACTED_PDF,
97
  ROOT_PATH,
98
  RUN_AWS_FUNCTIONS,
99
  RUN_DIRECT_MODE,
100
- # Additional direct mode configuration options
101
- DIRECT_MODE_LANGUAGE,
102
- DIRECT_MODE_PII_DETECTOR,
103
- DIRECT_MODE_OCR_METHOD,
104
- DIRECT_MODE_PAGE_MIN,
105
- DIRECT_MODE_PAGE_MAX,
106
- DIRECT_MODE_IMAGES_DPI,
107
- DIRECT_MODE_CHOSEN_LOCAL_OCR_MODEL,
108
- DIRECT_MODE_PREPROCESS_LOCAL_OCR_IMAGES,
109
- DIRECT_MODE_COMPRESS_REDACTED_PDF,
110
- DIRECT_MODE_RETURN_PDF_END_OF_REDACTION,
111
- DIRECT_MODE_EXTRACT_FORMS,
112
- DIRECT_MODE_EXTRACT_TABLES,
113
- DIRECT_MODE_EXTRACT_LAYOUT,
114
- DIRECT_MODE_EXTRACT_SIGNATURES,
115
- DIRECT_MODE_MATCH_FUZZY_WHOLE_PHRASE_BOOL,
116
- DIRECT_MODE_ANON_STRATEGY,
117
- DIRECT_MODE_FUZZY_MISTAKES,
118
- DIRECT_MODE_SIMILARITY_THRESHOLD,
119
- DIRECT_MODE_MIN_WORD_COUNT,
120
- DIRECT_MODE_MIN_CONSECUTIVE_PAGES,
121
- DIRECT_MODE_GREEDY_MATCH,
122
- DIRECT_MODE_COMBINE_PAGES,
123
- DIRECT_MODE_REMOVE_DUPLICATE_ROWS,
124
- DIRECT_MODE_TEXTRACT_ACTION,
125
- DIRECT_MODE_JOB_ID,
126
  RUN_FASTAPI,
127
  S3_ACCESS_LOGS_FOLDER,
128
  S3_ALLOW_LIST_PATH,
@@ -141,7 +135,6 @@ from tools.config import (
141
  SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS,
142
  SPACY_MODEL_PATH,
143
  TABULAR_PII_DETECTION_MODELS,
144
- TESSERACT_TEXT_EXTRACT_OPTION,
145
  TEXT_EXTRACTION_MODELS,
146
  TEXTRACT_JOBS_LOCAL_LOC,
147
  TEXTRACT_JOBS_S3_INPUT_LOC,
@@ -1045,7 +1038,7 @@ with blocks:
1045
  with gr.Tab("Redact PDFs/images"):
1046
 
1047
  # Examples for PDF/image redaction
1048
- if SHOW_EXAMPLES is True:
1049
  gr.Markdown(
1050
  "### Try an example - Click on an example below and then the 'Extract text and redact document' button:"
1051
  )
@@ -1834,7 +1827,7 @@ with blocks:
1834
  )
1835
 
1836
  # Examples for duplicate page detection
1837
- if SHOW_EXAMPLES == "True":
1838
  gr.Markdown(
1839
  "### Try an example - Click on an example below and then the 'Identify duplicate pages/subdocuments' button:"
1840
  )
@@ -1989,7 +1982,7 @@ with blocks:
1989
  )
1990
 
1991
  # Examples for Word/Excel/csv redaction and tabular duplicate detection
1992
- if SHOW_EXAMPLES == "True":
1993
  gr.Markdown(
1994
  "### Try an example - Click on an example below and then the 'Redact text/data files' button for redaction, or the 'Find duplicate cells/rows' button for duplicate detection:"
1995
  )
@@ -6578,14 +6571,11 @@ with blocks:
6578
  "extract_layout": DIRECT_MODE_EXTRACT_LAYOUT,
6579
  "extract_signatures": DIRECT_MODE_EXTRACT_SIGNATURES,
6580
  "match_fuzzy_whole_phrase_bool": DIRECT_MODE_MATCH_FUZZY_WHOLE_PHRASE_BOOL,
6581
-
6582
  # Word/Tabular Anonymisation Arguments
6583
-
6584
  "anon_strategy": DIRECT_MODE_ANON_STRATEGY,
6585
  "text_columns": DEFAULT_TEXT_COLUMNS,
6586
  "excel_sheets": DEFAULT_EXCEL_SHEETS,
6587
  "fuzzy_mistakes": DIRECT_MODE_FUZZY_MISTAKES,
6588
-
6589
  # Duplicate Detection Arguments
6590
  "duplicate_type": DIRECT_MODE_DUPLICATE_TYPE,
6591
  "similarity_threshold": DIRECT_MODE_SIMILARITY_THRESHOLD,
@@ -6594,10 +6584,9 @@ with blocks:
6594
  "greedy_match": DIRECT_MODE_GREEDY_MATCH,
6595
  "combine_pages": DIRECT_MODE_COMBINE_PAGES,
6596
  "remove_duplicate_rows": DIRECT_MODE_REMOVE_DUPLICATE_ROWS,
6597
-
6598
  # Textract Batch Operations Arguments
6599
  "textract_action": DIRECT_MODE_TEXTRACT_ACTION,
6600
- "job_id": DIRECT_MODE_JOB_ID,
6601
  "textract_bucket": TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET,
6602
  "textract_input_prefix": TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER,
6603
  "textract_output_prefix": TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER,
 
22
  CHOSEN_LOCAL_OCR_MODEL,
23
  CHOSEN_REDACT_ENTITIES,
24
  COGNITO_AUTH,
 
25
  CONFIG_FOLDER,
26
  COST_CODES_PATH,
27
  CSV_ACCESS_LOG_HEADERS,
28
  CSV_FEEDBACK_LOG_HEADERS,
29
  CSV_USAGE_LOG_HEADERS,
30
  CUSTOM_BOX_COLOUR,
 
31
  DEFAULT_CONCURRENCY_LIMIT,
32
  DEFAULT_COST_CODE,
33
  DEFAULT_DUPLICATE_DETECTION_THRESHOLD,
 
46
  DEFAULT_TEXT_COLUMNS,
47
  DEFAULT_TEXT_EXTRACTION_MODEL,
48
  DENY_LIST_PATH,
49
+ DIRECT_MODE_ANON_STRATEGY,
50
+ DIRECT_MODE_CHOSEN_LOCAL_OCR_MODEL,
51
+ DIRECT_MODE_COMBINE_PAGES,
52
+ DIRECT_MODE_COMPRESS_REDACTED_PDF,
53
  DIRECT_MODE_DEFAULT_USER,
54
  DIRECT_MODE_DUPLICATE_TYPE,
55
+ DIRECT_MODE_EXTRACT_FORMS,
56
+ DIRECT_MODE_EXTRACT_LAYOUT,
57
+ DIRECT_MODE_EXTRACT_SIGNATURES,
58
+ DIRECT_MODE_EXTRACT_TABLES,
59
+ DIRECT_MODE_FUZZY_MISTAKES,
60
+ DIRECT_MODE_GREEDY_MATCH,
61
+ DIRECT_MODE_IMAGES_DPI,
62
  DIRECT_MODE_INPUT_FILE,
63
+ DIRECT_MODE_JOB_ID,
64
+ # Additional direct mode configuration options
65
+ DIRECT_MODE_LANGUAGE,
66
+ DIRECT_MODE_MATCH_FUZZY_WHOLE_PHRASE_BOOL,
67
+ DIRECT_MODE_MIN_CONSECUTIVE_PAGES,
68
+ DIRECT_MODE_MIN_WORD_COUNT,
69
+ DIRECT_MODE_OCR_METHOD,
70
  DIRECT_MODE_OUTPUT_DIR,
71
+ DIRECT_MODE_PAGE_MAX,
72
+ DIRECT_MODE_PAGE_MIN,
73
+ DIRECT_MODE_PII_DETECTOR,
74
+ DIRECT_MODE_PREPROCESS_LOCAL_OCR_IMAGES,
75
+ DIRECT_MODE_REMOVE_DUPLICATE_ROWS,
76
+ DIRECT_MODE_RETURN_PDF_END_OF_REDACTION,
77
+ DIRECT_MODE_SIMILARITY_THRESHOLD,
78
  DIRECT_MODE_TASK,
79
+ DIRECT_MODE_TEXTRACT_ACTION,
80
  DISPLAY_FILE_NAMES_IN_LOGS,
81
  DO_INITIAL_TABULAR_DATA_CLEAN,
82
  DOCUMENT_REDACTION_BUCKET,
 
100
  GRADIO_TEMP_DIR,
101
  HANDWRITE_SIGNATURE_TEXTBOX_FULL_OPTIONS,
102
  HOST_NAME,
 
103
  INPUT_FOLDER,
104
  LOAD_PREVIOUS_TEXTRACT_JOBS_S3,
105
  LOCAL_OCR_MODEL_OPTIONS,
 
106
  LOG_FILE_NAME,
107
  MAX_FILE_SIZE,
108
  MAX_OPEN_TEXT_CHARACTERS,
 
113
  OUTPUT_FOLDER,
114
  PADDLE_MODEL_PATH,
115
  PII_DETECTION_MODELS,
 
116
  REMOVE_DUPLICATE_ROWS,
 
117
  ROOT_PATH,
118
  RUN_AWS_FUNCTIONS,
119
  RUN_DIRECT_MODE,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
  RUN_FASTAPI,
121
  S3_ACCESS_LOGS_FOLDER,
122
  S3_ALLOW_LIST_PATH,
 
135
  SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS,
136
  SPACY_MODEL_PATH,
137
  TABULAR_PII_DETECTION_MODELS,
 
138
  TEXT_EXTRACTION_MODELS,
139
  TEXTRACT_JOBS_LOCAL_LOC,
140
  TEXTRACT_JOBS_S3_INPUT_LOC,
 
1038
  with gr.Tab("Redact PDFs/images"):
1039
 
1040
  # Examples for PDF/image redaction
1041
+ if SHOW_EXAMPLES:
1042
  gr.Markdown(
1043
  "### Try an example - Click on an example below and then the 'Extract text and redact document' button:"
1044
  )
 
1827
  )
1828
 
1829
  # Examples for duplicate page detection
1830
+ if SHOW_EXAMPLES:
1831
  gr.Markdown(
1832
  "### Try an example - Click on an example below and then the 'Identify duplicate pages/subdocuments' button:"
1833
  )
 
1982
  )
1983
 
1984
  # Examples for Word/Excel/csv redaction and tabular duplicate detection
1985
+ if SHOW_EXAMPLES:
1986
  gr.Markdown(
1987
  "### Try an example - Click on an example below and then the 'Redact text/data files' button for redaction, or the 'Find duplicate cells/rows' button for duplicate detection:"
1988
  )
 
6571
  "extract_layout": DIRECT_MODE_EXTRACT_LAYOUT,
6572
  "extract_signatures": DIRECT_MODE_EXTRACT_SIGNATURES,
6573
  "match_fuzzy_whole_phrase_bool": DIRECT_MODE_MATCH_FUZZY_WHOLE_PHRASE_BOOL,
 
6574
  # Word/Tabular Anonymisation Arguments
 
6575
  "anon_strategy": DIRECT_MODE_ANON_STRATEGY,
6576
  "text_columns": DEFAULT_TEXT_COLUMNS,
6577
  "excel_sheets": DEFAULT_EXCEL_SHEETS,
6578
  "fuzzy_mistakes": DIRECT_MODE_FUZZY_MISTAKES,
 
6579
  # Duplicate Detection Arguments
6580
  "duplicate_type": DIRECT_MODE_DUPLICATE_TYPE,
6581
  "similarity_threshold": DIRECT_MODE_SIMILARITY_THRESHOLD,
 
6584
  "greedy_match": DIRECT_MODE_GREEDY_MATCH,
6585
  "combine_pages": DIRECT_MODE_COMBINE_PAGES,
6586
  "remove_duplicate_rows": DIRECT_MODE_REMOVE_DUPLICATE_ROWS,
 
6587
  # Textract Batch Operations Arguments
6588
  "textract_action": DIRECT_MODE_TEXTRACT_ACTION,
6589
+ "job_id": DIRECT_MODE_JOB_ID,
6590
  "textract_bucket": TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET,
6591
  "textract_input_prefix": TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER,
6592
  "textract_output_prefix": TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER,
lambda_entrypoint.py CHANGED
@@ -15,11 +15,11 @@ from tools.config import (
15
  DEFAULT_PAGE_MAX,
16
  DEFAULT_PAGE_MIN,
17
  IMAGES_DPI,
18
- LAMBDA_POLL_INTERVAL,
 
19
  LAMBDA_MAX_POLL_ATTEMPTS,
 
20
  LAMBDA_PREPARE_IMAGES,
21
- LAMBDA_EXTRACT_SIGNATURES,
22
- LAMBDA_DEFAULT_USERNAME,
23
  )
24
 
25
 
@@ -532,7 +532,9 @@ def lambda_handler(event, context):
532
  os.getenv("TEXTRACT_JOBS_LOCAL_LOC", ""),
533
  ),
534
  "poll_interval": int(arguments.get("poll_interval", LAMBDA_POLL_INTERVAL)),
535
- "max_poll_attempts": int(arguments.get("max_poll_attempts", LAMBDA_MAX_POLL_ATTEMPTS)),
 
 
536
  # Additional arguments that were missing
537
  "search_query": arguments.get(
538
  "search_query", os.getenv("DEFAULT_SEARCH_QUERY", "")
 
15
  DEFAULT_PAGE_MAX,
16
  DEFAULT_PAGE_MIN,
17
  IMAGES_DPI,
18
+ LAMBDA_DEFAULT_USERNAME,
19
+ LAMBDA_EXTRACT_SIGNATURES,
20
  LAMBDA_MAX_POLL_ATTEMPTS,
21
+ LAMBDA_POLL_INTERVAL,
22
  LAMBDA_PREPARE_IMAGES,
 
 
23
  )
24
 
25
 
 
532
  os.getenv("TEXTRACT_JOBS_LOCAL_LOC", ""),
533
  ),
534
  "poll_interval": int(arguments.get("poll_interval", LAMBDA_POLL_INTERVAL)),
535
+ "max_poll_attempts": int(
536
+ arguments.get("max_poll_attempts", LAMBDA_MAX_POLL_ATTEMPTS)
537
+ ),
538
  # Additional arguments that were missing
539
  "search_query": arguments.get(
540
  "search_query", os.getenv("DEFAULT_SEARCH_QUERY", "")
pyproject.toml CHANGED
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
4
 
5
  [project]
6
  name = "doc_redaction"
7
- version = "1.4.0"
8
  description = "Redact PDF/image-based documents, Word, or CSV/XLSX files using a Gradio-based GUI interface"
9
  readme = "README.md"
10
  requires-python = ">=3.10"
 
4
 
5
  [project]
6
  name = "doc_redaction"
7
+ version = "1.4.1"
8
  description = "Redact PDF/image-based documents, Word, or CSV/XLSX files using a Gradio-based GUI interface"
9
  readme = "README.md"
10
  requires-python = ">=3.10"
src/app_settings.qmd CHANGED
@@ -2,529 +2,573 @@
2
  title: "App settings management guide"
3
  format:
4
  html:
5
- toc: true # Enable the table of contents
6
- toc-depth: 3 # Include headings up to level 2 (##)
7
- toc-title: "On this page" # Optional: Title for your TOC
8
  ---
9
 
10
- Settings for the redaction app can be set from outside by changing values in the `config.env` file stored in your local config folder, or in S3 if running on AWS. This guide provides an overview of how to configure the application using environment variables. The application loads configurations using `os.environ.get()`. It first attempts to load variables from the file specified by `APP_CONFIG_PATH` (which defaults to `config/app_config.env`). If `AWS_CONFIG_PATH` is also set (e.g., to `config/aws_config.env`), variables are loaded from that file as well. Environment variables set directly in the system will always take precedence over those defined in these `.env` files.
11
 
12
- ## App Configuration File (config.env)
13
 
14
  This section details variables related to the main application configuration file.
15
 
16
- * **`APP_CONFIG_PATH`**
17
- * **Description:** Specifies the path to the application configuration `.env` file. This file contains various settings that control the application's behavior.
18
- * **Default Value:** `config/app_config.env`
19
- * **Configuration:** Set as an environment variable directly. This variable defines where to load other application configurations, so it cannot be set within `config/app_config.env` itself.
 
 
 
20
 
21
  ## AWS Options
22
 
23
  This section covers configurations related to AWS services used by the application.
24
 
25
- * **`AWS_CONFIG_PATH`**
26
- * **Description:** Specifies the path to the AWS configuration `.env` file. This file is intended to store AWS credentials and specific settings.
27
- * **Default Value:** `''` (empty string)
28
- * **Configuration:** Set as an environment variable directly. This variable defines an additional source for AWS-specific configurations.
29
-
30
- * **`RUN_AWS_FUNCTIONS`**
31
- * **Description:** Enables or disables AWS-specific functionalities within the application. Set to `"True"` to enable and `"False"` to disable.
32
- * **Default Value:** `"False"`
33
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
34
-
35
- * **`AWS_REGION`**
36
- * **Description:** Defines the AWS region where services like S3, Cognito, and Textract are located.
37
- * **Default Value:** `''`
38
- * **Configuration:** Set as an environment variable directly, or include in `config/aws_config.env` (if `AWS_CONFIG_PATH` is configured).
39
-
40
- * **`AWS_CLIENT_ID`**
41
- * **Description:** The client ID for AWS Cognito, used for user authentication.
42
- * **Default Value:** `''`
43
- * **Configuration:** Set as an environment variable directly, or include in `config/aws_config.env` (if `AWS_CONFIG_PATH` is configured).
44
-
45
- * **`AWS_CLIENT_SECRET`**
46
- * **Description:** The client secret for AWS Cognito, used in conjunction with the client ID for authentication.
47
- * **Default Value:** `''`
48
- * **Configuration:** Set as an environment variable directly, or include in `config/aws_config.env` (if `AWS_CONFIG_PATH` is configured).
49
-
50
- * **`AWS_USER_POOL_ID`**
51
- * **Description:** The user pool ID for AWS Cognito, identifying the user directory.
52
- * **Default Value:** `''`
53
- * **Configuration:** Set as an environment variable directly, or include in `config/aws_config.env` (if `AWS_CONFIG_PATH` is configured).
54
-
55
- * **`AWS_ACCESS_KEY`**
56
- * **Description:** The AWS access key ID for programmatic access to AWS services.
57
- * **Default Value:** `''` (Note: Often found in the environment or AWS credentials file.)
58
- * **Configuration:** Set as an environment variable directly, or include in `config/aws_config.env` (if `AWS_CONFIG_PATH` is configured). It's also commonly configured via shared AWS credentials files or IAM roles.
59
-
60
- * **`AWS_SECRET_KEY`**
61
- * **Description:** The AWS secret access key corresponding to the AWS access key ID.
62
- * **Default Value:** `''` (Note: Often found in the environment or AWS credentials file.)
63
- * **Configuration:** Set as an environment variable directly, or include in `config/aws_config.env` (if `AWS_CONFIG_PATH` is configured). It's also commonly configured via shared AWS credentials files or IAM roles.
64
-
65
- * **`DOCUMENT_REDACTION_BUCKET`**
66
- * **Description:** The name of the S3 bucket used for storing documents related to the redaction process.
67
- * **Default Value:** `''`
68
- * **Configuration:** Set as an environment variable directly, or include in `config/aws_config.env` (if `AWS_CONFIG_PATH` is configured).
69
-
70
- * **`CUSTOM_HEADER`**
71
- * **Description:** Specifies a custom header name to be included in requests, often used for services like AWS CloudFront.
72
- * **Default Value:** `''`
73
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
74
-
75
- * **`CUSTOM_HEADER_VALUE`**
76
- * **Description:** The value for the custom header specified by `CUSTOM_HEADER`.
77
- * **Default Value:** `''`
78
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
79
 
80
  ## Image Options
81
 
82
  Settings related to image processing within the application.
83
 
84
- * **`IMAGES_DPI`**
85
- * **Description:** Dots Per Inch (DPI) setting for image processing, affecting the resolution and quality of processed images.
86
- * **Default Value:** `'300.0'`
87
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
88
 
89
- * **`LOAD_TRUNCATED_IMAGES`**
90
- * **Description:** Controls whether the application attempts to load truncated images. Set to `'True'` to enable.
91
- * **Default Value:** `'True'`
92
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
93
 
94
- * **`MAX_IMAGE_PIXELS`**
95
- * **Description:** Sets the maximum number of pixels for an image that the application will process. Leave blank for no limit. This can help prevent issues with very large images.
96
- * **Default Value:** `''`
97
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
98
 
99
  ## File I/O Options
100
 
101
  Configuration for input and output file handling.
102
 
103
- * **`SESSION_OUTPUT_FOLDER`**
104
- * **Description:** If set to `'True'`, the application will save output and input files into session-specific subfolders, helping to organise files from different user sessions.
105
- * **Default Value:** `'False'`
106
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
107
 
108
- * **`GRADIO_OUTPUT_FOLDER`** (aliased as `OUTPUT_FOLDER`)
109
- * **Description:** Specifies the default output folder for files generated by Gradio components. Can be set to "TEMP" to use a temporary directory.
110
- * **Default Value:** `'output/'`
111
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
112
 
113
- * **`GRADIO_INPUT_FOLDER`** (aliased as `INPUT_FOLDER`)
114
- * **Description:** Specifies the default input folder for files used by Gradio components. Can be set to "TEMP" to use a temporary directory.
115
- * **Default Value:** `'input/'`
116
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
117
 
118
- * **`GRADIO_TEMP_DIR`**
119
- * **Description:** Defines the path for Gradio's temporary file storage.
120
- * **Default Value:** `'tmp/gradio_tmp/'`
121
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
122
 
123
- * **`MPLCONFIGDIR`**
124
- * **Description:** Specifies the cache directory for the Matplotlib library, which is used for plotting and image handling.
125
- * **Default Value:** `'tmp/matplotlib_cache/'`
126
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
127
 
128
  ## Logging Options
129
 
130
- Settings for configuring application logging, including log formats and storage locations.
131
-
132
- * **`SAVE_LOGS_TO_CSV`**
133
- * **Description:** Enables or disables saving logs to CSV files. Set to `'True'` to enable.
134
- * **Default Value:** `'True'`
135
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
136
-
137
- * **`USE_LOG_SUBFOLDERS`**
138
- * **Description:** If enabled (`'True'`), logs will be stored in subfolders based on date and hostname, aiding in log organisation.
139
- * **Default Value:** `'True'`
140
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
141
-
142
- * **`FEEDBACK_LOGS_FOLDER`**
143
- * **Description:** Specifies the base folder for storing feedback logs. If `USE_LOG_SUBFOLDERS` is true, date/hostname subfolders will be created within this folder.
144
- * **Default Value:** `'feedback/'`
145
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
146
-
147
- * **`ACCESS_LOGS_FOLDER`**
148
- * **Description:** Specifies the base folder for storing access logs. If `USE_LOG_SUBFOLDERS` is true, date/hostname subfolders will be created within this folder.
149
- * **Default Value:** `'logs/'`
150
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
151
-
152
- * **`USAGE_LOGS_FOLDER`**
153
- * **Description:** Specifies the base folder for storing usage logs. If `USE_LOG_SUBFOLDERS` is true, date/hostname subfolders will be created within this folder.
154
- * **Default Value:** `'usage/'`
155
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
156
-
157
- * **`DISPLAY_FILE_NAMES_IN_LOGS`**
158
- * **Description:** If set to `'True'`, file names will be included in the log entries.
159
- * **Default Value:** `'False'`
160
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
161
-
162
- * **`CSV_ACCESS_LOG_HEADERS`**
163
- * **Description:** Defines custom headers for CSV access logs. If left blank, component labels will be used as headers.
164
- * **Default Value:** `''`
165
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
166
-
167
- * **`CSV_FEEDBACK_LOG_HEADERS`**
168
- * **Description:** Defines custom headers for CSV feedback logs. If left blank, component labels will be used as headers.
169
- * **Default Value:** `''`
170
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
171
-
172
- * **`CSV_USAGE_LOG_HEADERS`**
173
- * **Description:** Defines custom headers for CSV usage logs.
174
- * **Default Value:** A predefined list of header names. Refer to `config.py` for the complete list.
175
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
176
-
177
- * **`SAVE_LOGS_TO_DYNAMODB`**
178
- * **Description:** Enables or disables saving logs to AWS DynamoDB. Set to `'True'` to enable. Requires appropriate AWS setup.
179
- * **Default Value:** `'False'`
180
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
181
-
182
- * **`ACCESS_LOG_DYNAMODB_TABLE_NAME`**
183
- * **Description:** The name of the DynamoDB table used for storing access logs.
184
- * **Default Value:** `'redaction_access_log'`
185
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
186
-
187
- * **`DYNAMODB_ACCESS_LOG_HEADERS`**
188
- * **Description:** Specifies the headers (attributes) for the DynamoDB access log table.
189
- * **Default Value:** `''`
190
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
191
-
192
- * **`FEEDBACK_LOG_DYNAMODB_TABLE_NAME`**
193
- * **Description:** The name of the DynamoDB table used for storing feedback logs.
194
- * **Default Value:** `'redaction_feedback'`
195
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
196
-
197
- * **`DYNAMODB_FEEDBACK_LOG_HEADERS`**
198
- * **Description:** Specifies the headers (attributes) for the DynamoDB feedback log table.
199
- * **Default Value:** `''`
200
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
201
-
202
- * **`USAGE_LOG_DYNAMODB_TABLE_NAME`**
203
- * **Description:** The name of the DynamoDB table used for storing usage logs.
204
- * **Default Value:** `'redaction_usage'`
205
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
206
-
207
- * **`DYNAMODB_USAGE_LOG_HEADERS`**
208
- * **Description:** Specifies the headers (attributes) for the DynamoDB usage log table.
209
- * **Default Value:** `''`
210
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
211
-
212
- * **`LOGGING`**
213
- * **Description:** Enables or disables general console logging. Set to `'True'` to enable.
214
- * **Default Value:** `'False'`
215
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
216
-
217
- * **`LOG_FILE_NAME`**
218
- * **Description:** Specifies the name for the CSV log file if `SAVE_LOGS_TO_CSV` is enabled.
219
- * **Default Value:** `'log.csv'`
220
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
221
-
222
- ## Redaction Options
223
-
224
- Configurations related to the text redaction process, including PII detection models and external tool paths.
225
-
226
- * **`TESSERACT_FOLDER`**
227
- * **Description:** Path to the local Tesseract OCR installation folder. Only required if Tesseract is not in the system's PATH, or when running a packaged executable (e.g., via PyInstaller).
228
- * **Default Value:** `""` (empty string)
229
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
230
-
231
- * **`TESSERACT_DATA_FOLDER`**
232
- * **Description:** Path to the Tesseract trained data files (e.g., `tessdata`).
233
- * **Default Value:** `"/usr/share/tessdata"`
234
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
235
-
236
- * **`POPPLER_FOLDER`**
237
- * **Description:** Path to the local Poppler installation's `bin` folder. Poppler is used for PDF processing. Only required if Poppler is not in the system's PATH.
238
- * **Default Value:** `""` (empty string)
239
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
240
-
241
- * **`SELECTABLE_TEXT_EXTRACT_OPTION`**
242
- * **Description:** Display name in the UI for the text extraction method that processes selectable text directly from PDFs.
243
- * **Default Value:** `"Local model - selectable text"`
244
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
245
-
246
- * **`TESSERACT_TEXT_EXTRACT_OPTION`**
247
- * **Description:** Display name in the UI for the text extraction method using local Tesseract OCR (for PDFs without selectable text).
248
- * **Default Value:** `"Local OCR model - PDFs without selectable text"`
249
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
250
-
251
- * **`TEXTRACT_TEXT_EXTRACT_OPTION`**
252
- * **Description:** Display name in the UI for the text extraction method using AWS Textract service.
253
- * **Default Value:** `"AWS Textract service - all PDF types"`
254
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
255
-
256
- * **`NO_REDACTION_PII_OPTION`**
257
- * **Description:** Display name in the UI for the option to only extract text without performing any PII detection or redaction.
258
- * **Default Value:** `"Only extract text (no redaction)"`
259
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
260
-
261
- * **`LOCAL_PII_OPTION`**
262
- * **Description:** Display name in the UI for the PII detection method using a local model.
263
- * **Default Value:** `"Local"`
264
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
265
-
266
- * **`AWS_PII_OPTION`**
267
- * **Description:** Display name in the UI for the PII detection method using AWS Comprehend.
268
- * **Default Value:** `"AWS Comprehend"`
269
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
270
-
271
- * **`SHOW_LOCAL_TEXT_EXTRACTION_OPTIONS`**
272
- * **Description:** Controls whether local text extraction options (selectable text, Tesseract) are shown in the UI. Set to `'True'` to show.
273
- * **Default Value:** `'True'`
274
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
275
-
276
- * **`SHOW_AWS_TEXT_EXTRACTION_OPTIONS`**
277
- * **Description:** Controls whether AWS Textract text extraction option is shown in the UI. Set to `'True'` to show.
278
- * **Default Value:** `'True'`
279
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
280
-
281
- * **`DEFAULT_TEXT_EXTRACTION_MODEL`**
282
- * **Description:** Sets the default text extraction model selected in the UI. Defaults to `TEXTRACT_TEXT_EXTRACT_OPTION` if AWS options are shown; otherwise, defaults to `SELECTABLE_TEXT_EXTRACT_OPTION`.
283
- * **Default Value:** Value of `TEXTRACT_TEXT_EXTRACT_OPTION` if `SHOW_AWS_TEXT_EXTRACTION_OPTIONS` is True, else value of `SELECTABLE_TEXT_EXTRACT_OPTION`.
284
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. Provide one of the text extraction option display names.
285
-
286
- * **`SHOW_LOCAL_PII_DETECTION_OPTIONS`**
287
- * **Description:** Controls whether the local PII detection option is shown in the UI. Set to `'True'` to show.
288
- * **Default Value:** `'True'`
289
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
290
-
291
- * **`SHOW_AWS_PII_DETECTION_OPTIONS`**
292
- * **Description:** Controls whether the AWS Comprehend PII detection option is shown in the UI. Set to `'True'` to show.
293
- * **Default Value:** `'True'`
294
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
295
-
296
- * **`DEFAULT_PII_DETECTION_MODEL`**
297
- * **Description:** Sets the default PII detection model selected in the UI. Defaults to `AWS_PII_OPTION` if AWS options are shown; otherwise, defaults to `LOCAL_PII_OPTION`.
298
- * **Default Value:** Value of `AWS_PII_OPTION` if `SHOW_AWS_PII_DETECTION_OPTIONS` is True, else value of `LOCAL_PII_OPTION`.
299
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. Provide one of the PII detection option display names.
300
-
301
- * **`CHOSEN_LOCAL_OCR_MODEL`**
302
- * **Description:** Choose the engine for local OCR: `"tesseract"`, `"paddle"`, or `"hybrid"`. "paddle" is effective for line extraction but not word-level redaction. "hybrid" uses Tesseract first, then PaddleOCR for low-confidence words.
303
- * **Default Value:** `"tesseract"`
304
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
305
-
306
- * **`PREPROCESS_LOCAL_OCR_IMAGES`**
307
- * **Description:** If set to `"True"`, images will be preprocessed (e.g., deskewed, contrast adjusted) before being sent to the local OCR engine. This can sometimes yield worse results on clean scans.
308
- * **Default Value:** `"False"`
309
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
310
-
311
- * **`CHOSEN_COMPREHEND_ENTITIES`**
312
- * **Description:** A list of AWS Comprehend PII entity types to be redacted when using AWS Comprehend.
313
- * **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
314
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
315
-
316
- * **`FULL_COMPREHEND_ENTITY_LIST`**
317
- * **Description:** The complete list of PII entity types supported by AWS Comprehend that can be selected for redaction.
318
- * **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
319
- * **Configuration:** This is typically an informational variable reflecting the capabilities of AWS Comprehend and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_COMPREHEND_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
320
-
321
- * **`CHOSEN_REDACT_ENTITIES`**
322
- * **Description:** A list of local PII entity types to be redacted when using the local PII detection model.
323
- * **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
324
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`. This should be a string representation of a Python list.
325
-
326
- * **`FULL_ENTITY_LIST`**
327
- * **Description:** The complete list of PII entity types supported by the local PII detection model that can be selected for redaction.
328
- * **Default Value:** A predefined list of entity types. Refer to `config.py` for the complete list.
329
- * **Configuration:** This is typically an informational variable reflecting the capabilities of the local model and is not meant to be changed by users directly affecting redaction behavior (use `CHOSEN_REDACT_ENTITIES` for that). Set as an environment variable directly, or include in `config/app_config.env`.
330
-
331
- * **`PAGE_BREAK_VALUE`**
332
- * **Description:** Defines a page count after which a function might restart. (Note: Currently not activated).
333
- * **Default Value:** `'99999'`
334
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
335
-
336
- * **`MAX_TIME_VALUE`**
337
- * **Description:** Specifies a maximum time value for long-running processes.
338
- * **Default Value:** `'999999'`
339
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
340
-
341
- * **`CUSTOM_BOX_COLOUR`**
342
- * **Description:** Allows specifying a custom color for the redaction boxes drawn on documents. Only `"grey"` is currently supported as a custom value. If empty, a default color is used.
343
- * **Default Value:** `""` (empty string)
344
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
345
-
346
- * **`RETURN_PDF_END_OF_REDACTION`**
347
- * **Description:** If set to `'True'`, the application will return a PDF document at the end of the redaction task.
348
- * **Default Value:** `"True"`
349
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
350
-
351
- * **`COMPRESS_REDACTED_PDF`**
352
- * **Description:** If set to `'True'`, the redacted PDF output will be compressed. This can reduce file size but may cause issues on systems with low memory.
353
- * **Default Value:** `"False"`
354
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
355
 
356
  ## Language Options
357
 
358
- Settings for multi-language support in OCR and PII detection.
359
-
360
- * **`SHOW_LANGUAGE_SELECTION`**
361
- * **Description:** If set to `"True"`, a dropdown menu for language selection will be visible in the user interface.
362
- * **Default Value:** `"False"`
363
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
364
-
365
- * **`DEFAULT_LANGUAGE_FULL_NAME`**
366
- * **Description:** The default language's full name (e.g., "english") to be displayed in the UI.
367
- * **Default Value:** `"english"`
368
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
369
-
370
- * **`DEFAULT_LANGUAGE`**
371
- * **Description:** The default language's short code (e.g., "en") used by the backend engines. Ensure the corresponding Tesseract/PaddleOCR language packs are installed.
372
- * **Default Value:** `"en"`
373
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
374
-
375
- * **`MAPPED_LANGUAGE_CHOICES`**
376
- * **Description:** A string list of full language names (e.g., 'english', 'french') presented to the user in the language dropdown.
377
- * **Default Value:** A predefined list. See `config.py`.
378
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
379
-
380
- * **`LANGUAGE_CHOICES`**
381
- * **Description:** A string list of short language codes (e.g., 'en', 'fr') that correspond to `MAPPED_LANGUAGE_CHOICES`. This is what the backend uses.
382
- * **Default Value:** A predefined list. See `config.py`.
383
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
384
-
385
- ## App Run Options
386
-
387
- General runtime configurations for the application.
388
-
389
- * **`TLDEXTRACT_CACHE`**
390
- * **Description:** Path to the cache directory used by the `tldextract` library, which helps in accurately extracting top-level domains (TLDs) from URLs.
391
- * **Default Value:** `'tmp/tld/'`
392
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
393
-
394
- * **`COGNITO_AUTH`**
395
- * **Description:** Enables or disables AWS Cognito authentication for the application. Set to `'True'` to enable.
396
- * **Default Value:** `'False'`
397
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
398
-
399
- * **`RUN_DIRECT_MODE`**
400
- * **Description:** If set to `'True'`, runs the application in a "direct mode", which might alter certain behaviors (e.g., UI elements, processing flow).
401
- * **Default Value:** `'False'`
402
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
403
-
404
- * **`MAX_QUEUE_SIZE`**
405
- * **Description:** The maximum number of requests that can be queued in the Gradio interface.
406
- * **Default Value:** `'5'` (integer)
407
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
408
-
409
- * **`MAX_FILE_SIZE`**
410
- * **Description:** Maximum file size allowed for uploads (e.g., "250mb", "1gb").
411
- * **Default Value:** `'250mb'`
412
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
413
-
414
- * **`GRADIO_SERVER_PORT`**
415
- * **Description:** The network port on which the Gradio server will listen.
416
- * **Default Value:** `'7860'` (integer)
417
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
418
-
419
- * **`ROOT_PATH`**
420
- * **Description:** The root path for the application, useful if running behind a reverse proxy (e.g., `/app`).
421
- * **Default Value:** `''` (empty string)
422
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
423
-
424
- * **`DEFAULT_CONCURRENCY_LIMIT`**
425
- * **Description:** The default concurrency limit for Gradio event handlers, controlling how many requests can be processed simultaneously.
426
- * **Default Value:** `'3'`
427
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
428
-
429
- * **`GET_DEFAULT_ALLOW_LIST`**
430
- * **Description:** If set, enables the use of a default allow list for user access or specific functionalities. The exact behavior depends on application logic.
431
- * **Default Value:** `''` (empty string)
432
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
433
-
434
- * **`ALLOW_LIST_PATH`**
435
- * **Description:** Path to a local CSV file containing an allow list (e.g., `config/default_allow_list.csv`).
436
- * **Default Value:** `''` (empty string)
437
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
438
-
439
- * **`S3_ALLOW_LIST_PATH`**
440
- * **Description:** Path to an allow list CSV file stored in an S3 bucket (e.g., `default_allow_list.csv`). Requires `DOCUMENT_REDACTION_BUCKET` to be set.
441
- * **Default Value:** `''` (empty string)
442
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
443
-
444
- * **`FILE_INPUT_HEIGHT`**
445
- * **Description:** Sets the height (in pixels or other CSS unit) of the file input component in the Gradio UI.
446
- * **Default Value:** `'200'`
447
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
448
 
449
  ## Cost Code Options
450
 
451
- Settings related to tracking and applying cost codes for application usage.
452
-
453
- * **`SHOW_COSTS`**
454
- * **Description:** If set to `'True'`, cost-related information will be displayed in the UI.
455
- * **Default Value:** `'False'`
456
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
457
-
458
- * **`GET_COST_CODES`**
459
- * **Description:** Enables fetching and using cost codes within the application. Set to `'True'` to enable.
460
- * **Default Value:** `'False'`
461
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
462
-
463
- * **`DEFAULT_COST_CODE`**
464
- * **Description:** Specifies a default cost code to be used if cost codes are enabled but none is selected by the user.
465
- * **Default Value:** `''` (empty string)
466
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
467
-
468
- * **`COST_CODES_PATH`**
469
- * **Description:** Path to a local CSV file containing available cost codes (e.g., `config/COST_CENTRES.csv`).
470
- * **Default Value:** `''` (empty string)
471
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
472
-
473
- * **`S3_COST_CODES_PATH`**
474
- * **Description:** Path to a cost codes CSV file stored in an S3 bucket (e.g., `COST_CENTRES.csv`). Requires `DOCUMENT_REDACTION_BUCKET` to be set.
475
- * **Default Value:** `''` (empty string)
476
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
477
-
478
- * **`ENFORCE_COST_CODES`**
479
- * **Description:** If set to `'True'` and `GET_COST_CODES` is also enabled, makes the selection of a cost code mandatory for users.
480
- * **Default Value:** `'False'`
481
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
482
-
483
- ## Whole Document API Options
484
-
485
- Configurations for features related to processing whole documents via APIs, particularly AWS Textract for large documents.
486
-
487
- * **`SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS`**
488
- * **Description:** Controls whether UI options for whole document Textract calls are displayed.
489
- * **Default Value:** `'False'`
490
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
491
-
492
- * **`TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET`**
493
- * **Description:** The S3 bucket used for input and output of whole document analysis with AWS Textract.
494
- * **Default Value:** `''` (empty string)
495
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
496
-
497
- * **`TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER`**
498
- * **Description:** The subfolder within `TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET` where input documents for Textract analysis are placed.
499
- * **Default Value:** `'input'`
500
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
501
-
502
- * **`TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_OUTPUT_SUBFOLDER`**
503
- * **Description:** The subfolder within `TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET` where output results from Textract analysis are stored.
504
- * **Default Value:** `'output'`
505
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
506
-
507
- * **`LOAD_PREVIOUS_TEXTRACT_JOBS_S3`**
508
- * **Description:** If set to `'True'`, the application will attempt to load data from previous Textract jobs stored in S3.
509
- * **Default Value:** `'False'`
510
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
511
-
512
- * **`TEXTRACT_JOBS_S3_LOC`**
513
- * **Description:** The S3 subfolder (within the main redaction bucket) where Textract job data (output) is stored.
514
- * **Default Value:** `'output'`
515
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
516
-
517
- * **`TEXTRACT_JOBS_S3_INPUT_LOC`**
518
- * **Description:** The S3 subfolder (within the main redaction bucket) where Textract job input is stored.
519
- * **Default Value:** `'input'`
520
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env` (or `config/aws_config.env` if `AWS_CONFIG_PATH` is configured).
521
-
522
- * **`TEXTRACT_JOBS_LOCAL_LOC`**
523
- * **Description:** The local subfolder where Textract job data is stored if not using S3 or as a cache.
524
- * **Default Value:** `'output'`
525
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
526
-
527
- * **`DAYS_TO_DISPLAY_WHOLE_DOCUMENT_JOBS`**
528
- * **Description:** Specifies the number of past days for which to display whole document Textract jobs in the UI.
529
- * **Default Value:** `'7'`
530
- * **Configuration:** Set as an environment variable directly, or include in `config/app_config.env`.
 
2
  title: "App settings management guide"
3
  format:
4
  html:
5
+ toc: true
6
+ toc-depth: 3
7
+ toc-title: "On this page"
8
  ---
9
 
10
+ Settings for the redaction app can be set from outside by changing values in the `.env` file stored in your local config folder, or in S3 if running on AWS. This guide provides an overview of how to configure the application using environment variables. The application loads configurations using `os.environ.get()`. It first attempts to load variables from the file specified by `APP_CONFIG_PATH` (which defaults to `config/app_config.env`). If `AWS_CONFIG_PATH` is also set (e.g., to `config/aws_config.env`), variables are loaded from that file as well. Environment variables set directly in the system will always take precedence over those defined in these `.env` files.
11
 
12
+ ## App Configuration File (`config.env`)
13
 
14
  This section details variables related to the main application configuration file.
15
 
16
+ * **`CONFIG_FOLDER`**
17
+ * **Description:** The folder where configuration files are stored.
18
+ * **Default Value:** `config/`
19
+
20
+ * **`APP_CONFIG_PATH`**
21
+ * **Description:** Specifies the path to the application configuration `.env` file. This file contains various settings that control the application's behavior.
22
+ * **Default Value:** `config/app_config.env`
23
 
24
  ## AWS Options
25
 
26
  This section covers configurations related to AWS services used by the application.
27
 
28
+ * **`AWS_CONFIG_PATH`**
29
+ * **Description:** Specifies the path to the AWS configuration `.env` file. This file is intended to store AWS credentials and specific settings.
30
+ * **Default Value:** `''` (empty string)
31
+
32
+ * **`RUN_AWS_FUNCTIONS`**
33
+ * **Description:** Enables or disables AWS-specific functionalities within the application. Set to `"True"` to enable.
34
+ * **Default Value:** `"False"`
35
+
36
+ * **`AWS_REGION`**
37
+ * **Description:** Defines the AWS region where services like S3, Cognito, and Textract are located.
38
+ * **Default Value:** `''`
39
+
40
+ * **`AWS_CLIENT_ID`**
41
+ * **Description:** The client ID for AWS Cognito, used for user authentication.
42
+ * **Default Value:** `''`
43
+
44
+ * **`AWS_CLIENT_SECRET`**
45
+ * **Description:** The client secret for AWS Cognito, used in conjunction with the client ID for authentication.
46
+ * **Default Value:** `''`
47
+
48
+ * **`AWS_USER_POOL_ID`**
49
+ * **Description:** The user pool ID for AWS Cognito, identifying the user directory.
50
+ * **Default Value:** `''`
51
+
52
+ * **`AWS_ACCESS_KEY`**
53
+ * **Description:** The AWS access key ID for programmatic access to AWS services.
54
+ * **Default Value:** `''`
55
+
56
+ * **`AWS_SECRET_KEY`**
57
+ * **Description:** The AWS secret access key corresponding to the AWS access key ID.
58
+ * **Default Value:** `''`
59
+
60
+ * **`DOCUMENT_REDACTION_BUCKET`**
61
+ * **Description:** The name of the S3 bucket used for storing documents related to the redaction process.
62
+ * **Default Value:** `''`
63
+
64
+ * **`PRIORITISE_SSO_OVER_AWS_ENV_ACCESS_KEYS`**
65
+ * **Description:** If set to `"True"`, the app will prioritize using AWS SSO credentials over access keys stored in environment variables.
66
+ * **Default Value:** `"True"`
67
+
68
+ * **`CUSTOM_HEADER`**
69
+ * **Description:** Specifies a custom header name to be included in requests, often used for services like AWS CloudFront.
70
+ * **Default Value:** `''`
71
+
72
+ * **`CUSTOM_HEADER_VALUE`**
73
+ * **Description:** The value for the custom header specified by `CUSTOM_HEADER`.
74
+ * **Default Value:** `''`
 
 
 
 
 
 
 
75
 
76
  ## Image Options
77
 
78
  Settings related to image processing within the application.
79
 
80
+ * **`IMAGES_DPI`**
81
+ * **Description:** Dots Per Inch (DPI) setting for image processing, affecting the resolution and quality of processed images.
82
+ * **Default Value:** `'300.0'`
 
83
 
84
+ * **`LOAD_TRUNCATED_IMAGES`**
85
+ * **Description:** Controls whether the application attempts to load truncated images. Set to `'True'` to enable.
86
+ * **Default Value:** `'True'`
 
87
 
88
+ * **`MAX_IMAGE_PIXELS`**
89
+ * **Description:** Sets the maximum number of pixels for an image that the application will process. Leave blank for no limit. This can help prevent issues with very large images.
90
+ * **Default Value:** `''`
 
91
 
92
  ## File I/O Options
93
 
94
  Configuration for input and output file handling.
95
 
96
+ * **`SESSION_OUTPUT_FOLDER`**
97
+ * **Description:** If set to `'True'`, the application will save output and input files into session-specific subfolders.
98
+ * **Default Value:** `'False'`
 
99
 
100
+ * **`OUTPUT_FOLDER`**
101
+ * **Description:** Specifies the default output folder for generated files. Can be set to `"TEMP"` to use a temporary directory.
102
+ * **Default Value:** `'output/'`
 
103
 
104
+ * **`INPUT_FOLDER`**
105
+ * **Description:** Specifies the default input folder for files. Can be set to `"TEMP"` to use a temporary directory.
106
+ * **Default Value:** `'input/'`
 
107
 
108
+ * **`GRADIO_TEMP_DIR`**
109
+ * **Description:** Defines the path for Gradio's temporary file storage.
110
+ * **Default Value:** `''`
 
111
 
112
+ * **`MPLCONFIGDIR`**
113
+ * **Description:** Specifies the cache directory for the Matplotlib library.
114
+ * **Default Value:** `''`
 
115
 
116
  ## Logging Options
117
 
118
+ Settings for configuring application logging.
119
+
120
+ * **`SAVE_LOGS_TO_CSV`**
121
+ * **Description:** Enables or disables saving logs to CSV files. Set to `'True'` to enable.
122
+ * **Default Value:** `'True'`
123
+
124
+ * **`USE_LOG_SUBFOLDERS`**
125
+ * **Description:** If enabled (`'True'`), logs will be stored in subfolders based on date and hostname.
126
+ * **Default Value:** `'True'`
127
+
128
+ * **`FEEDBACK_LOGS_FOLDER`**, **`ACCESS_LOGS_FOLDER`**, **`USAGE_LOGS_FOLDER`**
129
+ * **Description:** Base folders for feedback, access, and usage logs respectively.
130
+ * **Default Values:** `'feedback/'`, `'logs/'`, `'usage/'`
131
+
132
+ * **`S3_FEEDBACK_LOGS_FOLDER`**, **`S3_ACCESS_LOGS_FOLDER`**, **`S3_USAGE_LOGS_FOLDER`**
133
+ * **Description:** S3 paths where feedback, access, and usage logs will be stored if `RUN_AWS_FUNCTIONS` is enabled.
134
+ * **Default Values:** Dynamically generated based on date and hostname, e.g., `'feedback/YYYYMMDD/hostname/'`.
135
+
136
+ * **`LOG_FILE_NAME`**, **`USAGE_LOG_FILE_NAME`**, **`FEEDBACK_LOG_FILE_NAME`**
137
+ * **Description:** Specifies the name for log files. `USAGE_LOG_FILE_NAME` and `FEEDBACK_LOG_FILE_NAME` default to the value of `LOG_FILE_NAME`.
138
+ * **Default Value:** `'log.csv'`
139
+
140
+ * **`DISPLAY_FILE_NAMES_IN_LOGS`**
141
+ * **Description:** If set to `'True'`, file names will be included in log entries.
142
+ * **Default Value:** `'False'`
143
+
144
+ * **`CSV_ACCESS_LOG_HEADERS`**, **`CSV_FEEDBACK_LOG_HEADERS`**, **`CSV_USAGE_LOG_HEADERS`**
145
+ * **Description:** Defines custom headers for the respective CSV logs as a string representation of a list. If blank, component labels are used.
146
+ * **Default Value:** Varies; see script for `CSV_USAGE_LOG_HEADERS` default.
147
+
148
+ * **`SAVE_LOGS_TO_DYNAMODB`**
149
+ * **Description:** Enables or disables saving logs to AWS DynamoDB. Set to `'True'` to enable.
150
+ * **Default Value:** `'False'`
151
+
152
+ * **`ACCESS_LOG_DYNAMODB_TABLE_NAME`**, **`FEEDBACK_LOG_DYNAMODB_TABLE_NAME`**, **`USAGE_LOG_DYNAMODB_TABLE_NAME`**
153
+ * **Description:** Names of the DynamoDB tables for storing access, feedback, and usage logs.
154
+ * **Default Values:** `'redaction_access_log'`, `'redaction_feedback'`, `'redaction_usage'`
155
+
156
+ * **`DYNAMODB_ACCESS_LOG_HEADERS`**, **`DYNAMODB_FEEDBACK_LOG_HEADERS`**, **`DYNAMODB_USAGE_LOG_HEADERS`**
157
+ * **Description:** Specifies the headers (attributes) for the respective DynamoDB log tables.
158
+ * **Default Value:** `''`
159
+
160
+ * **`LOGGING`**
161
+ * **Description:** Enables or disables general console logging. Set to `'True'` to enable.
162
+ * **Default Value:** `'False'`
163
+
164
+ ## Gradio & General App Options
165
+
166
+ Configurations for the Gradio UI, server behavior, and application limits.
167
+
168
+ * **`FAVICON_PATH`**
169
+ * **Description:** Path to the favicon icon file for the web interface.
170
+ * **Default Value:** `"favicon.png"`
171
+
172
+ * **`RUN_FASTAPI`**
173
+ * **Description:** If set to `"True"`, the application will be served via FastAPI, allowing for API endpoint integration.
174
+ * **Default Value:** `"False"`
175
+
176
+ * **`GRADIO_SERVER_NAME`**
177
+ * **Description:** The IP address the Gradio server will bind to. Use `"0.0.0.0"` to allow external access.
178
+ * **Default Value:** `"0.0.0.0"`
179
+
180
+ * **`GRADIO_SERVER_PORT`**
181
+ * **Description:** The network port on which the Gradio server will listen.
182
+ * **Default Value:** `7860`
183
+
184
+ * **`ALLOWED_ORIGINS`**
185
+ * **Description:** A comma-separated list of allowed origins for Cross-Origin Resource Sharing (CORS).
186
+ * **Default Value:** `''`
187
+
188
+ * **`ALLOWED_HOSTS`**
189
+ * **Description:** A comma-separated list of allowed hostnames.
190
+ * **Default Value:** `''`
191
+
192
+ * **`ROOT_PATH`**
193
+ * **Description:** The root path for the application, useful if running behind a reverse proxy (e.g., `/app`).
194
+ * **Default Value:** `''`
195
+
196
+ * **`FASTAPI_ROOT_PATH`**
197
+ * **Description:** The root path for the FastAPI application, used when `RUN_FASTAPI` is true.
198
+ * **Default Value:** `"/"`
199
+
200
+ * **`MAX_QUEUE_SIZE`**
201
+ * **Description:** The maximum number of requests that can be queued in the Gradio interface.
202
+ * **Default Value:** `5`
203
+
204
+ * **`MAX_FILE_SIZE`**
205
+ * **Description:** Maximum file size allowed for uploads (e.g., "250mb", "1gb").
206
+ * **Default Value:** `'250mb'`
207
+
208
+ * **`DEFAULT_CONCURRENCY_LIMIT`**
209
+ * **Description:** The default concurrency limit for Gradio event handlers, controlling how many requests can be processed simultaneously.
210
+ * **Default Value:** `3`
211
+
212
+ * **`MAX_SIMULTANEOUS_FILES`**
213
+ * **Description:** The maximum number of files that can be processed at once.
214
+ * **Default Value:** `10`
215
+
216
+ * **`MAX_DOC_PAGES`**
217
+ * **Description:** The maximum number of pages a document can have.
218
+ * **Default Value:** `3000`
219
+
220
+ * **`MAX_TABLE_ROWS`** / **`MAX_TABLE_COLUMNS`**
221
+ * **Description:** Maximum number of rows and columns for tabular data processing.
222
+ * **Default Values:** `250000` / `100`
223
+
224
+ * **`MAX_OPEN_TEXT_CHARACTERS`**
225
+ * **Description:** Maximum number of characters for open text input.
226
+ * **Default Value:** `50000`
227
+
228
+ * **`TLDEXTRACT_CACHE`**
229
+ * **Description:** Path to the cache directory used by the `tldextract` library.
230
+ * **Default Value:** `'tmp/tld/'`
231
+
232
+ * **`COGNITO_AUTH`**
233
+ * **Description:** Enables or disables AWS Cognito authentication. Set to `'True'` to enable.
234
+ * **Default Value:** `'False'`
235
+
236
+ * **`USER_GUIDE_URL`**
237
+ * **Description:** A safe URL pointing to the user guide. The URL is validated against a list of allowed domains.
238
+ * **Default Value:** `"https://seanpedrick-case.github.io/doc_redaction"`
239
+
240
+ * **`SHOW_EXAMPLES`**
241
+ * **Description:** If set to `"True"`, displays example files in the Gradio interface.
242
+ * **Default Value:** `"True"`
243
+
244
+ * **`SHOW_AWS_EXAMPLES`**
245
+ * **Description:** If set to `"True"`, includes AWS-specific examples.
246
+ * **Default Value:** `"False"`
247
+
248
+ * **`FILE_INPUT_HEIGHT`**
249
+ * **Description:** Sets the height (in pixels) of the file input component in the Gradio UI.
250
+ * **Default Value:** `200`
251
+
252
+ ## Redaction & PII Options
253
+
254
+ Configurations related to text extraction, PII detection, and the redaction process.
255
+
256
+ ### UI and Model Selection
257
+
258
+ * **`EXTRACTION_AND_PII_OPTIONS_OPEN_BY_DEFAULT`**
259
+ * **Description:** If set to `"True"`, the "Extraction and PII Options" accordion in the UI will be open by default.
260
+ * **Default Value:** `"True"`
261
+
262
+ * **`SHOW_LOCAL_TEXT_EXTRACTION_OPTIONS`** / **`SHOW_AWS_TEXT_EXTRACTION_OPTIONS`**
263
+ * **Description:** Controls whether local (Tesseract) or AWS (Textract) text extraction options are shown in the UI.
264
+ * **Default Value:** `"True"` for both.
265
+
266
+ * **`SHOW_LOCAL_PII_DETECTION_OPTIONS`** / **`SHOW_AWS_PII_DETECTION_OPTIONS`**
267
+ * **Description:** Controls whether local or AWS (Comprehend) PII detection options are shown in the UI.
268
+ * **Default Value:** `"True"` for both.
269
+
270
+ * **`DEFAULT_TEXT_EXTRACTION_MODEL`**
271
+ * **Description:** Sets the default text extraction model selected in the UI.
272
+ * **Default Value:** Defaults to AWS Textract if available, otherwise local selectable text.
273
+
274
+ * **`DEFAULT_PII_DETECTION_MODEL`**
275
+ * **Description:** Sets the default PII detection model selected in the UI.
276
+ * **Default Value:** Defaults to AWS Comprehend if available, otherwise the local model.
277
+
278
+ * **`LOAD_REDACTION_ANNOTATIONS_FROM_PDF`**
279
+ * **Description:** If set to `"True"`, the application will load existing redaction annotations from PDFs during the review step.
280
+ * **Default Value:** `"True"`
281
+
282
+ ### External Tool Paths
283
+
284
+ * **`TESSERACT_FOLDER`**
285
+ * **Description:** Path to the local Tesseract OCR installation folder.
286
+ * **Default Value:** `''`
287
+
288
+ * **`TESSERACT_DATA_FOLDER`**
289
+ * **Description:** Path to the Tesseract trained data files (`tessdata`).
290
+ * **Default Value:** `"/usr/share/tessdata"`
291
+
292
+ * **`POPPLER_FOLDER`**
293
+ * **Description:** Path to the local Poppler installation's `bin` folder.
294
+ * **Default Value:** `''`
295
+
296
+ * **`PADDLE_MODEL_PATH`** / **`SPACY_MODEL_PATH`**
297
+ * **Description:** Custom directory for PaddleOCR and spaCy model storage, useful for environments like AWS Lambda.
298
+ * **Default Value:** `''` (uses default location).
299
+
300
+ ### Local OCR (Tesseract & PaddleOCR)
301
+
302
+ * **`CHOSEN_LOCAL_OCR_MODEL`**
303
+ * **Description:** Choose the engine for local OCR: `"tesseract"`, `"paddle"`, or `"hybrid"`.
304
+ * **Default Value:** `"tesseract"`
305
+
306
+ * **`SHOW_LOCAL_OCR_MODEL_OPTIONS`**
307
+ * **Description:** If set to `"True"`, allows the user to select the local OCR model from the UI.
308
+ * **Default Value:** `"False"`
309
+
310
+ * **`HYBRID_OCR_CONFIDENCE_THRESHOLD`**
311
+ * **Description:** In "hybrid" mode, this is the Tesseract confidence score below which PaddleOCR will be used for re-extraction.
312
+ * **Default Value:** `65`
313
+
314
+ * **`HYBRID_OCR_PADDING`**
315
+ * **Description:** In "hybrid" mode, padding added to the word's bounding box before re-extraction.
316
+ * **Default Value:** `1`
317
+
318
+ * **`PADDLE_USE_TEXTLINE_ORIENTATION`**
319
+ * **Description:** Toggles textline orientation detection for PaddleOCR.
320
+ * **Default Value:** `"False"`
321
+
322
+ * **`PADDLE_DET_DB_UNCLIP_RATIO`**
323
+ * **Description:** Controls the expansion ratio of the detected text region in PaddleOCR.
324
+ * **Default Value:** `1.2`
325
+
326
+ * **`SAVE_EXAMPLE_TESSERACT_VS_PADDLE_IMAGES`**
327
+ * **Description:** Saves comparison images when using "hybrid" OCR mode.
328
+ * **Default Value:** `"False"`
329
+
330
+ * **`SAVE_PADDLE_VISUALISATIONS`**
331
+ * **Description:** Saves images with PaddleOCR's detected bounding boxes overlaid.
332
+ * **Default Value:** `"False"`
333
+
334
+ * **`PREPROCESS_LOCAL_OCR_IMAGES`**
335
+ * **Description:** If set to `"True"`, images will be preprocessed before local OCR. Can slow down processing.
336
+ * **Default Value:** `"False"`
337
+
338
+ ### Entity and Search Options
339
+
340
+ * **`CHOSEN_COMPREHEND_ENTITIES`** / **`FULL_COMPREHEND_ENTITY_LIST`**
341
+ * **Description:** The selected and available PII entity types for AWS Comprehend.
342
+ * **Default Value:** Predefined lists of entities (see script).
343
+
344
+ * **`CHOSEN_REDACT_ENTITIES`** / **`FULL_ENTITY_LIST`**
345
+ * **Description:** The selected and available PII entity types for the local model.
346
+ * **Default Value:** Predefined lists of entities (see script).
347
+
348
+ * **`CUSTOM_ENTITIES`**
349
+ * **Description:** A list of entities that are considered "custom" and may have special handling.
350
+ * **Default Value:** `['TITLES', 'UKPOSTCODE', 'STREETNAME', 'CUSTOM']`
351
+
352
+ * **`DEFAULT_SEARCH_QUERY`**
353
+ * **Description:** The default text for the custom search/redact input box.
354
+ * **Default Value:** `''`
355
+
356
+ * **`DEFAULT_FUZZY_SPELLING_MISTAKES_NUM`**
357
+ * **Description:** Default number of allowed spelling mistakes for fuzzy searches.
358
+ * **Default Value:** `1`
359
+
360
+ * **`DEFAULT_PAGE_MIN`** / **`DEFAULT_PAGE_MAX`**
361
+ * **Description:** Default start and end pages for processing. `0` for max means process all pages.
362
+ * **Default Value:** `0` for both.
363
+
364
+ ### Textract Feature Selection
365
+
366
+ * **`DEFAULT_HANDWRITE_SIGNATURE_CHECKBOX`**
367
+ * **Description:** The default options selected for Textract's handwriting and signature detection.
368
+ * **Default Value:** `['Extract handwriting']`
369
+
370
+ * **`INCLUDE_FORM_EXTRACTION_TEXTRACT_OPTION`**
371
+ * **`INCLUDE_LAYOUT_EXTRACTION_TEXTRACT_OPTION`**
372
+ * **`INCLUDE_TABLE_EXTRACTION_TEXTRACT_OPTION`**
373
+ * **Description:** Booleans (`"True"`/`"False"`) to include Forms, Layout, and Tables as selectable options for Textract analysis.
374
+ * **Default Value:** `"False"` for all.
375
+
376
+ ### Tabular Data Options
377
+
378
+ * **`DO_INITIAL_TABULAR_DATA_CLEAN`**
379
+ * **Description:** If `"True"`, performs an initial cleaning step on tabular data.
380
+ * **Default Value:** `"True"`
381
+
382
+ * **`DEFAULT_TEXT_COLUMNS`** / **`DEFAULT_EXCEL_SHEETS`**
383
+ * **Description:** Default values for specifying which columns or sheets to process in tabular files.
384
+ * **Default Value:** `[]` (empty list)
385
+
386
+ * **`DEFAULT_TABULAR_ANONYMISATION_STRATEGY`**
387
+ * **Description:** The default method for anonymizing tabular data (e.g., "redact completely").
388
+ * **Default Value:** `"redact completely"`
389
 
390
  ## Language Options
391
 
392
+ Settings for multi-language support.
393
+
394
+ * **`SHOW_LANGUAGE_SELECTION`**
395
+ * **Description:** If set to `"True"`, a language selection dropdown will be visible in the UI.
396
+ * **Default Value:** `"False"`
397
+
398
+ * **`DEFAULT_LANGUAGE_FULL_NAME`** / **`DEFAULT_LANGUAGE`**
399
+ * **Description:** The default language's full name (e.g., "english") and its short code (e.g., "en").
400
+ * **Default Values:** `"english"`, `"en"`
401
+
402
+ * **`textract_language_choices`** / **`aws_comprehend_language_choices`**
403
+ * **Description:** Lists of supported language codes for Textract and Comprehend.
404
+ * **Default Value:** `['en', 'es', 'fr', 'de', 'it', 'pt']` and `['en', 'es']`
405
+
406
+ * **`MAPPED_LANGUAGE_CHOICES`** / **`LANGUAGE_CHOICES`**
407
+ * **Description:** Paired lists of full language names and their corresponding short codes for the UI dropdown.
408
+ * **Default Value:** Predefined lists (see script).
409
+
410
+ ## Duplicate Detection Settings
411
+
412
+ * **`DEFAULT_DUPLICATE_DETECTION_THRESHOLD`**
413
+ * **Description:** The similarity score (0.0 to 1.0) above which documents/pages are considered duplicates.
414
+ * **Default Value:** `0.95`
415
+
416
+ * **`DEFAULT_MIN_CONSECUTIVE_PAGES`**
417
+ * **Description:** Minimum number of consecutive pages that must be duplicates to be flagged.
418
+ * **Default Value:** `1`
419
+
420
+ * **`USE_GREEDY_DUPLICATE_DETECTION`**
421
+ * **Description:** If `"True"`, uses a greedy algorithm that may find more duplicates but can be less precise.
422
+ * **Default Value:** `"True"`
423
+
424
+ * **`DEFAULT_COMBINE_PAGES`**
425
+ * **Description:** If `"True"`, text from the same page number across different files is combined before checking for duplicates.
426
+ * **Default Value:** `"True"`
427
+
428
+ * **`DEFAULT_MIN_WORD_COUNT`**
429
+ * **Description:** Pages with fewer words than this value will be ignored by the duplicate detector.
430
+ * **Default Value:** `10`
431
+
432
+ * **`REMOVE_DUPLICATE_ROWS`**
433
+ * **Description:** If `"True"`, enables duplicate row detection in tabular data.
434
+ * **Default Value:** `"False"`
435
+
436
+ ## File Output Options
437
+
438
+ * **`USE_GUI_BOX_COLOURS_FOR_OUTPUTS`**
439
+ * **Description:** If `"True"`, the final redacted PDF will use the same redaction box colors as shown in the review UI.
440
+ * **Default Value:** `"False"`
441
+
442
+ * **`CUSTOM_BOX_COLOUR`**
443
+ * **Description:** Specifies the color for redaction boxes as an RGB tuple string, e.g., `"(0, 0, 0)"` for black.
444
+ * **Default Value:** `"(0, 0, 0)"`
445
+
446
+ * **`APPLY_REDACTIONS_IMAGES`**, **`APPLY_REDACTIONS_GRAPHICS`**, **`APPLY_REDACTIONS_TEXT`**
447
+ * **Description:** Advanced control over how redactions are applied to underlying images, vector graphics, and text in the PDF, based on PyMuPDF options. `0` is the default for a standard redaction workflow.
448
+ * **Default Value:** `0` for all.
449
+
450
+ * **`RETURN_PDF_FOR_REVIEW`**
451
+ * **Description:** If set to `"True"`, a PDF with redaction boxes drawn on it (but text not removed) is generated for the "Review" tab.
452
+ * **Default Value:** `"True"`
453
+
454
+ * **`RETURN_REDACTED_PDF`**
455
+ * **Description:** If set to `'True'`, the application will return a fully redacted PDF at the end of the main task.
456
+ * **Default Value:** `"True"`
457
+
458
+ * **`COMPRESS_REDACTED_PDF`**
459
+ * **Description:** If set to `'True'`, the redacted PDF output will be compressed.
460
+ * **Default Value:** `"False"`
461
+
462
+ ## Direct Mode & Lambda Configuration
463
+
464
+ Settings for running the application from the command line (Direct Mode) or as an AWS Lambda function.
465
+
466
+ ### Direct Mode
467
+
468
+ * **`RUN_DIRECT_MODE`**
469
+ * **Description:** Set to `'True'` to enable direct command-line mode.
470
+ * **Default Value:** `'False'`
471
+
472
+ * **`DIRECT_MODE_DEFAULT_USER`**
473
+ * **Description:** Default username for CLI requests.
474
+ * **Default Value:** `''`
475
+
476
+ * **`DIRECT_MODE_TASK`**
477
+ * **Description:** The task to perform: `'redact'` or `'deduplicate'`.
478
+ * **Default Value:** `'redact'`
479
+
480
+ * **`DIRECT_MODE_INPUT_FILE`** / **`DIRECT_MODE_OUTPUT_DIR`**
481
+ * **Description:** Path to the input file and output directory for the task.
482
+ * **Default Values:** `''`, `output/`
483
+
484
+ * **Other `DIRECT_MODE_*` variables:**
485
+ * **Description:** These variables allow for setting nearly all application options (e.g., `DIRECT_MODE_PII_DETECTOR`, `DIRECT_MODE_SIMILARITY_THRESHOLD`) directly for a single CLI run, overriding other configurations.
486
+ * **Default Value:** Defaults are inherited from the main application settings (e.g., `LOCAL_PII_OPTION`, `DEFAULT_DUPLICATE_DETECTION_THRESHOLD`).
487
+
488
+ ### Lambda Configuration
489
+
490
+ * **`LAMBDA_POLL_INTERVAL`**
491
+ * **Description:** Polling interval in seconds for checking Textract job status.
492
+ * **Default Value:** `30`
493
+
494
+ * **`LAMBDA_MAX_POLL_ATTEMPTS`**
495
+ * **Description:** Maximum number of polling attempts before timeout.
496
+ * **Default Value:** `120`
497
+
498
+ * **`LAMBDA_PREPARE_IMAGES`**
499
+ * **Description:** If `"True"`, prepares images for OCR processing within the Lambda environment.
500
+ * **Default Value:** `"True"`
501
+
502
+ * **`LAMBDA_EXTRACT_SIGNATURES`**
503
+ * **Description:** Enables signature extraction during Textract analysis in Lambda.
504
+ * **Default Value:** `"False"`
505
+
506
+ * **`LAMBDA_DEFAULT_USERNAME`**
507
+ * **Description:** Default username for operations initiated by Lambda.
508
+ * **Default Value:** `"lambda_user"`
509
+
510
+ ## Allow, Deny, & Whole Page Redaction Lists
511
+
512
+ * **`GET_DEFAULT_ALLOW_LIST`**, **`GET_DEFAULT_DENY_LIST`**, **`GET_DEFAULT_WHOLE_PAGE_REDACTION_LIST`**
513
+ * **Description:** Booleans (`"True"`/`"False"`) to enable the use of allow, deny, or whole-page redaction lists.
514
+ * **Default Value:** `"False"`
515
+
516
+ * **`ALLOW_LIST_PATH`**, **`DENY_LIST_PATH`**, **`WHOLE_PAGE_REDACTION_LIST_PATH`**
517
+ * **Description:** Local paths to the respective CSV list files.
518
+ * **Default Value:** `''`
519
+
520
+ * **`S3_ALLOW_LIST_PATH`**, **`S3_DENY_LIST_PATH`**, **`S3_WHOLE_PAGE_REDACTION_LIST_PATH`**
521
+ * **Description:** Paths to the respective list files within the `DOCUMENT_REDACTION_BUCKET`.
522
+ * **Default Value:** `''`
523
 
524
  ## Cost Code Options
525
 
526
+ * **`SHOW_COSTS`**
527
+ * **Description:** If set to `'True'`, cost-related information will be displayed in the UI.
528
+ * **Default Value:** `'False'`
529
+
530
+ * **`GET_COST_CODES`**
531
+ * **Description:** Enables fetching and using cost codes. Set to `'True'` to enable.
532
+ * **Default Value:** `'False'`
533
+
534
+ * **`DEFAULT_COST_CODE`**
535
+ * **Description:** Specifies a default cost code.
536
+ * **Default Value:** `''`
537
+
538
+ * **`COST_CODES_PATH`** / **`S3_COST_CODES_PATH`**
539
+ * **Description:** Local or S3 path to a CSV file containing available cost codes.
540
+ * **Default Value:** `''`
541
+
542
+ * **`ENFORCE_COST_CODES`**
543
+ * **Description:** If set to `'True'`, makes the selection of a cost code mandatory.
544
+ * **Default Value:** `'False'`
545
+
546
+ ## Whole Document API Options (Textract Async)
547
+
548
+ * **`SHOW_WHOLE_DOCUMENT_TEXTRACT_CALL_OPTIONS`**
549
+ * **Description:** Controls whether UI options for asynchronous whole document Textract calls are displayed.
550
+ * **Default Value:** `'False'`
551
+
552
+ * **`TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_BUCKET`**
553
+ * **Description:** The S3 bucket used for asynchronous Textract analysis.
554
+ * **Default Value:** `''`
555
+
556
+ * **`TEXTRACT_WHOLE_DOCUMENT_ANALYSIS_INPUT_SUBFOLDER`** / **`..._OUTPUT_SUBFOLDER`**
557
+ * **Description:** Input and output subfolders within the analysis bucket.
558
+ * **Default Values:** `'input'`, `'output'`
559
+
560
+ * **`LOAD_PREVIOUS_TEXTRACT_JOBS_S3`**
561
+ * **Description:** If set to `'True'`, the application will load data from previous Textract jobs stored in S3.
562
+ * **Default Value:** `'False'`
563
+
564
+ * **`TEXTRACT_JOBS_S3_LOC`** / **`TEXTRACT_JOBS_S3_INPUT_LOC`**
565
+ * **Description:** S3 subfolders where Textract job output and input are stored.
566
+ * **Default Value:** `'output'`, `'input'`
567
+
568
+ * **`TEXTRACT_JOBS_LOCAL_LOC`**
569
+ * **Description:** The local subfolder for storing Textract job data.
570
+ * **Default Value:** `'output'`
571
+
572
+ * **`DAYS_TO_DISPLAY_WHOLE_DOCUMENT_JOBS`**
573
+ * **Description:** Specifies the number of past days for which to display whole document Textract jobs.
574
+ * **Default Value:** `7`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
src/user_guide.qmd CHANGED
@@ -9,7 +9,8 @@ format:
9
 
10
  ## Table of contents
11
 
12
- - [Example data files](#example-data-files)
 
13
  - [Basic redaction](#basic-redaction)
14
  - [Customising redaction options](#customising-redaction-options)
15
  - [Custom allow, deny, and page redaction lists](#custom-allow-deny-and-page-redaction-lists)
@@ -21,21 +22,60 @@ format:
21
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
22
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
23
  - [Redacting Word, tabular data files (CSV/XLSX) or copy and pasted text](#redacting-word-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
24
-
25
- See the [advanced user guide here](#advanced-user-guide):
26
- - [Merging redaction review files](#merging-redaction-review-files)
27
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
 
 
 
28
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
29
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
 
30
  - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
31
  - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
32
  - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
33
  - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
34
  - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- ## Example data files
37
 
38
- Please try these example files to follow along with this guide:
39
  - [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
40
  - [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
41
  - [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
@@ -55,16 +95,20 @@ The 'Redact PDFs/images tab' currently accepts PDFs and image files (JPG, PNG) f
55
 
56
  ### Text extraction
57
 
58
- First, select one of the three text extraction options:
 
 
59
  - **'Local model - selectable text'** - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
60
  - **'Local OCR model - PDFs without selectable text'** - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
61
  - **'AWS Textract service - all PDF types'** - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
62
 
63
- ### Optional - select signature extraction
64
  If you chose the AWS Textract service above, you can choose if you want handwriting and/or signatures redacted by default. Choosing signatures here will have a cost implication, as identifying signatures will cost ~£2.66 ($3.50) per 1,000 pages vs ~£1.14 ($1.50) per 1,000 pages without signature detection.
65
 
66
  ![AWS Textract handwriting and signature options](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_handwriting_signatures.PNG)
67
 
 
 
68
  ### PII redaction method
69
 
70
  If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
@@ -98,6 +142,7 @@ Click 'Redact document'. After loading in the document, the app should be able t
98
  ![Redaction outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_outputs.PNG)
99
 
100
  - **'...redacted.pdf'** files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
 
101
  - **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
102
  - **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
103
 
@@ -166,8 +211,6 @@ If the table is empty, you can add a new entry, you can add a new row by clickin
166
 
167
  ![Manually modify allow or deny list filled](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/manually_modify_filled.PNG)
168
 
169
- **Note:** As of version 0.7.0 you can now apply your whole page redaction list directly to the document file currently under review by clicking the 'Apply whole page redaction list to document currently under review' button that appears here.
170
-
171
  ### Redacting additional types of personal information
172
 
173
  You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
@@ -182,7 +225,7 @@ If you want to redact different files, I suggest you refresh your browser page t
182
 
183
  ## Redacting only specific pages
184
 
185
- Say also we are only interested in redacting page 1 of the loaded documents. On the Redaction settings tab, select 'Lowest page to redact' as 1, and 'Highest page to redact' also as 1. When you next redact your documents, only the first page will be modified.
186
 
187
  ![Selecting specific pages to redact](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/select_pages.PNG)
188
 
@@ -419,39 +462,16 @@ You can also write open text into an input box and redact that using the same me
419
  ### Redaction log outputs
420
  A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
421
 
422
- # ADVANCED USER GUIDE
423
-
424
- This advanced user guide will go over some of the features recently added to the app, including: modifying and merging redaction review files, identifying and redacting duplicate pages across multiple PDFs, 'fuzzy' search and redact, and exporting redactions to Adobe Acrobat.
425
-
426
- ## Table of contents
427
-
428
- - [Merging redaction review files](#merging-redaction-review-files)
429
- - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
430
- - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
431
- - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
432
- - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
433
- - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
434
- - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
435
- - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
436
- - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
437
-
438
-
439
- ## Merging redaction review files
440
-
441
- Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
442
-
443
- ![Merging review files in the user interface](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/merge_review_files_interface.PNG)
444
-
445
- You can find this option at the bottom of the 'Redaction Settings' tab. Upload multiple review files here to get a single output 'merged' review_file. In the examples file, merging the 'review_file_custom.csv' and 'review_file_local.csv' files give you an output containing redaction boxes from both. This combined review file can then be uploaded into the review tab following the usual procedure.
446
-
447
- ![Merging review files outputs in spreadsheet](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/merged_review_file_outputs_csv.PNG)
448
-
449
  ## Identifying and redacting duplicate pages
450
 
451
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
452
 
453
  Some redaction tasks involve removing duplicate pages of text that may exist across multiple documents. This feature helps you find and remove duplicate content that may exist in single or multiple documents. It can identify everything from single identical pages to multi-page sections (subdocuments). The process involves three main steps: configuring the analysis, reviewing the results in the interactive interface, and then using the generated files to perform the redactions.
454
 
 
 
 
 
455
  ![Example duplicate page inputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/duplicate_page_find_in_app/img/duplicate_page_input_interface_new.PNG)
456
 
457
  **Step 1: Upload and Configure the Analysis**
@@ -496,11 +516,43 @@ The analysis also generates a set of downloadable files for your records and for
496
 
497
  If you want to combine the results from this redaction process with previous redaction tasks for the same PDF, you could merge review file outputs following the steps described in [Merging existing redaction review files](#merging-existing-redaction-review-files) above.
498
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
499
  ## Fuzzy search and redaction
500
 
501
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/fuzzy_search/).
502
 
503
- Sometimes you may be searching for terns that are slightly mispelled throughout a document, for example names. The document redaction app gives the option for searching for long phrases that may contain spelling mistakes, a method called 'fuzzy matching'.
504
 
505
  To do this, go to the Redaction Settings, and the 'Select entity types to redact' area. In the box below relevant to your chosen redaction method (local or AWS Comprehend), select 'CUSTOM_FUZZY' from the list. Next, we can select the maximum number of spelling mistakes allowed in the search (up to nine). Here, you can either type in a number or use the small arrows to the right of the box. Change this option to 3. This will allow for a maximum of three 'changes' in text needed to match to the desired search terms.
506
 
@@ -520,9 +572,20 @@ Using these deny list with spelling mistakes, the app fuzzy match these terms to
520
 
521
  Files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/export_to_adobe/).
522
 
523
- ### Exporting to Adobe Acrobat
 
 
524
 
525
- The Document Redaction app has a feature to export suggested redactions to Adobe, and likewise to import Adobe comment files into the app. The file format used is the .xfdf Adobe comment file format - [you can find more information about how to use these files here](https://helpx.adobe.com/uk/acrobat/using/importing-exporting-comments.html).
 
 
 
 
 
 
 
 
 
526
 
527
  To convert suggested redactions to Adobe format, you need to have the original PDF and a review file csv in the input box at the top of the Review redactions page.
528
 
@@ -570,6 +633,46 @@ The '_textract.json' output can be used to speed up further redaction tasks as [
570
 
571
  You can now easily get the '..._ocr_output.csv' redaction output based on this '_textract.json' (described in [Redaction outputs](#redaction-outputs)) by clicking on the button 'Convert Textract job outputs to OCR results'. You can now use this file e.g. for [identifying duplicate pages](#identifying-and-redacting-duplicate-pages), or for redaction review.
572
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
573
  ## Using AWS Textract and Comprehend when not running in an AWS environment
574
 
575
  AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
@@ -591,26 +694,180 @@ The app should then pick up these keys when trying to access the AWS Textract an
591
 
592
  Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
593
 
594
- ## Modifying existing redaction review files
595
 
596
- *Note:* As of version 0.7.0 you can now modify redaction review files directly in the app on the 'Review redactions' tab. Open the accordion 'View and edit review data' under the file input area. You can edit review file data cells here - press Enter to apply changes. You should see the effect on the current page if you click the 'Save changes on current page to file' button to the right.
597
 
598
- You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
599
 
600
- As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified outside of the app, and also merged with others from multiple redaction attempts on the same file. This gives you the flexibility to change redaction details outside of the app.
 
 
601
 
602
- If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
603
 
604
- ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)
605
 
606
- The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Find & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
 
 
607
 
608
- How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
 
 
 
609
 
610
- Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
611
 
612
- I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
613
 
614
- ![Review file after modification](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/partnership_redactions_after.PNG)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
615
 
616
- We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  ## Table of contents
11
 
12
+ ### Getting Started
13
+ - [Built-in example data](#built-in-example-data)
14
  - [Basic redaction](#basic-redaction)
15
  - [Customising redaction options](#customising-redaction-options)
16
  - [Custom allow, deny, and page redaction lists](#custom-allow-deny-and-page-redaction-lists)
 
22
  - [Handwriting and signature redaction](#handwriting-and-signature-redaction)
23
  - [Reviewing and modifying suggested redactions](#reviewing-and-modifying-suggested-redactions)
24
  - [Redacting Word, tabular data files (CSV/XLSX) or copy and pasted text](#redacting-word-tabular-data-files-xlsxcsv-or-copy-and-pasted-text)
 
 
 
25
  - [Identifying and redacting duplicate pages](#identifying-and-redacting-duplicate-pages)
26
+
27
+ ### Advanced user guide
28
+ - [Advanced user guide](#advanced-user-guide)
29
  - [Fuzzy search and redaction](#fuzzy-search-and-redaction)
30
  - [Export redactions to and import from Adobe Acrobat](#export-to-and-import-from-adobe)
31
+ - [Using _for_review.pdf files with Adobe Acrobat](#using-_for_reviewpdf-files-with-adobe-acrobat)
32
  - [Exporting to Adobe Acrobat](#exporting-to-adobe-acrobat)
33
  - [Importing from Adobe Acrobat](#importing-from-adobe-acrobat)
34
  - [Using the AWS Textract document API](#using-the-aws-textract-document-api)
35
  - [Using AWS Textract and Comprehend when not running in an AWS environment](#using-aws-textract-and-comprehend-when-not-running-in-an-aws-environment)
36
  - [Modifying existing redaction review files](#modifying-existing-redaction-review-files)
37
+ - [Merging redaction review files](#merging-redaction-review-files)
38
+
39
+ ### Features for expert users/system administrators
40
+ - [Features for expert users/system administrators](#features-for-expert-userssystem-administrators)
41
+ - [Advanced OCR options (Hybrid OCR)](#advanced-ocr-options-hybrid-ocr)
42
+ - [Command Line Interface (CLI)](#command-line-interface-cli)
43
+
44
+ ## Built-in example data
45
+
46
+ The app now includes built-in example files that you can use to quickly test different features. These examples are automatically loaded and can be accessed directly from the interface without needing to download files separately.
47
+
48
+ ### Using built-in examples
49
+
50
+ **For PDF/image redaction:** On the 'Redact PDFs/images' tab, you'll see a section titled "Try an example - Click on an example below and then the 'Extract text and redact document' button". Simply click on any of the available examples to load them with pre-configured settings:
51
+
52
+ - **PDF with selectable text redaction** - Uses local text extraction with standard PII detection
53
+ - **Image redaction with local OCR** - Processes an image file using OCR
54
+ - **PDF redaction with custom entities** - Demonstrates custom entity selection (Titles, Person, Dates)
55
+ - **PDF redaction with AWS services and signature detection** - Shows AWS Textract with signature extraction (if AWS is enabled)
56
+ - **PDF redaction with custom deny list and whole page redaction** - Demonstrates advanced redaction features
57
+
58
+ Once you have clicked on an example, you can click the 'Extract text and redact document' button to load the example into the app and redact it.
59
+
60
+ **For tabular data:** On the 'Word or Excel/csv files' tab, you'll find examples for both redaction and duplicate detection:
61
+
62
+ - **CSV file redaction** - Shows how to redact specific columns in tabular data
63
+ - **Word document redaction** - Demonstrates Word document processing
64
+ - **Excel file duplicate detection** - Shows how to find duplicate rows in spreadsheet data
65
+
66
+ Once you have clicked on an example, you can click the 'Redact text/data files' button to load the example into the app and redact it. For the duplicate detection example, you can click the 'Find duplicate cells/rows' button to load the example into the app and find duplicates.
67
+
68
+ **For duplicate page detection:** On the 'Identify duplicate pages' tab, you'll find examples for finding duplicate content in documents:
69
+
70
+ - **Find duplicate pages of text in document OCR outputs** - Uses page-level analysis with a similarity threshold of 0.95 and minimum word count of 10
71
+ - **Find duplicate text lines in document OCR outputs** - Uses line-level analysis with a similarity threshold of 0.95 and minimum word count of 3
72
+
73
+ Once you have clicked on an example, you can click the 'Identify duplicate pages/subdocuments' button to load the example into the app and find duplicate content.
74
+
75
+ ### External example files (optional)
76
 
77
+ If you prefer to use your own example files or want to follow along with specific tutorials, you can still download these external example files:
78
 
 
79
  - [Example of files sent to a professor before applying](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_of_emails_sent_to_a_professor_before_applying.pdf)
80
  - [Example complaint letter (jpg)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/example_complaint_letter.jpg)
81
  - [Partnership Agreement Toolkit (for signatures and more advanced usage)](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/Partnership-Agreement-Toolkit_0_0.pdf)
 
95
 
96
  ### Text extraction
97
 
98
+ You can modify default text extraction methods by clicking on the 'Change default text extraction method...' box'.
99
+
100
+ Here you can select one of the three text extraction options:
101
  - **'Local model - selectable text'** - This will read text directly from PDFs that have selectable text to redact (using PikePDF). This is fine for most PDFs, but will find nothing if the PDF does not have selectable text, and it is not good for handwriting or signatures. If it encounters an image file, it will send it onto the second option below.
102
  - **'Local OCR model - PDFs without selectable text'** - This option will use a simple Optical Character Recognition (OCR) model (Tesseract) to pull out text from a PDF/image that it 'sees'. This can handle most typed text in PDFs/images without selectable text, but struggles with handwriting/signatures. If you are interested in the latter, then you should use the third option if available.
103
  - **'AWS Textract service - all PDF types'** - Only available for instances of the app running on AWS. AWS Textract is a service that performs OCR on documents within their secure service. This is a more advanced version of OCR compared to the local option, and carries a (relatively small) cost. Textract excels in complex documents based on images, or documents that contain a lot of handwriting and signatures.
104
 
105
+ ### Enable AWS Textract signature extraction
106
  If you chose the AWS Textract service above, you can choose if you want handwriting and/or signatures redacted by default. Choosing signatures here will have a cost implication, as identifying signatures will cost ~£2.66 ($3.50) per 1,000 pages vs ~£1.14 ($1.50) per 1,000 pages without signature detection.
107
 
108
  ![AWS Textract handwriting and signature options](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/textract_handwriting_signatures.PNG)
109
 
110
+ **NOTE:** it is also possible to enable form extraction, layout extraction, and table extraction with AWS Textract. This is not enabled by default, but it is possible for your system admin to enable this feature in the config file.
111
+
112
  ### PII redaction method
113
 
114
  If you are running with the AWS service enabled, here you will also have a choice for PII redaction method:
 
142
  ![Redaction outputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/quick_start/redaction_outputs.PNG)
143
 
144
  - **'...redacted.pdf'** files contain the original pdf with suggested redacted text deleted and replaced by a black box on top of the document.
145
+ - **'...redactions_for_review.pdf'** files contain the original PDF with redaction boxes overlaid but the original text still visible underneath. This file is designed for use in Adobe Acrobat and other PDF viewers where you can see the suggested redactions without the text being permanently removed. This is particularly useful for reviewing redactions before finalising them.
146
  - **'...ocr_results.csv'** files contain the line-by-line text outputs from the entire document. This file can be useful for later searching through for any terms of interest in the document (e.g. using Excel or a similar program).
147
  - **'...review_file.csv'** files are the review files that contain details and locations of all of the suggested redactions in the document. This file is key to the [review process](#reviewing-and-modifying-suggested-redactions), and should be downloaded to use later for this.
148
 
 
211
 
212
  ![Manually modify allow or deny list filled](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/manually_modify_filled.PNG)
213
 
 
 
214
  ### Redacting additional types of personal information
215
 
216
  You may want to redact additional types of information beyond the defaults, or you may not be interested in default suggested entity types. There are dates in the example complaint letter. Say we wanted to redact those dates also?
 
225
 
226
  ## Redacting only specific pages
227
 
228
+ Say also we are only interested in redacting page 1 of the loaded documents. On the Redaction settings tab, select 'Lowest page to redact' as 1, and 'Highest page to redact' also as 1. When you next redact your documents, only the first page will be modified. The output files should now have a suffix similar to '..._1_1.pdf', indicating the lowest and highest page numbers that were redacted.
229
 
230
  ![Selecting specific pages to redact](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/allow_list/select_pages.PNG)
231
 
 
462
  ### Redaction log outputs
463
  A list of the suggested redaction outputs from the tabular data / open text data redaction is available on the Redaction settings page under 'Log file outputs'.
464
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
465
  ## Identifying and redacting duplicate pages
466
 
467
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/duplicate_page_find_in_app/).
468
 
469
  Some redaction tasks involve removing duplicate pages of text that may exist across multiple documents. This feature helps you find and remove duplicate content that may exist in single or multiple documents. It can identify everything from single identical pages to multi-page sections (subdocuments). The process involves three main steps: configuring the analysis, reviewing the results in the interactive interface, and then using the generated files to perform the redactions.
470
 
471
+ ### Duplicate page detection in documents
472
+
473
+ This section covers finding duplicate pages across PDF documents using OCR output files.
474
+
475
  ![Example duplicate page inputs](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/duplicate_page_find_in_app/img/duplicate_page_input_interface_new.PNG)
476
 
477
  **Step 1: Upload and Configure the Analysis**
 
516
 
517
  If you want to combine the results from this redaction process with previous redaction tasks for the same PDF, you could merge review file outputs following the steps described in [Merging existing redaction review files](#merging-existing-redaction-review-files) above.
518
 
519
+ ### Duplicate detection in tabular data
520
+
521
+ The app also includes functionality to find duplicate cells or rows in CSV, Excel, or Parquet files. This is particularly useful for cleaning datasets where you need to identify and remove duplicate entries.
522
+
523
+ **Step 1: Upload files and configure analysis**
524
+
525
+ Navigate to the 'Word or Excel/csv files' tab and scroll down to the "Find duplicate cells in tabular data" section. Upload your tabular files (CSV, Excel, or Parquet) and configure the analysis parameters:
526
+
527
+ - **Similarity threshold**: Score (0-1) to consider cells a match. 1 = perfect match
528
+ - **Minimum word count**: Cells with fewer words than this value are ignored
529
+ - **Do initial clean of text**: Remove URLs, HTML tags, and non-ASCII characters
530
+ - **Remove duplicate rows**: Automatically remove duplicate rows from deduplicated files
531
+ - **Select Excel sheet names**: Choose which sheets to analyze (for Excel files)
532
+ - **Select text columns**: Choose which columns contain text to analyze
533
+
534
+ **Step 2: Review results**
535
+
536
+ After clicking "Find duplicate cells/rows", the results will be displayed in a table showing:
537
+ - File1, Row1, File2, Row2
538
+ - Similarity_Score
539
+ - Text1, Text2 (the actual text content being compared)
540
+
541
+ Click on any row to see more details about the duplicate match in the preview boxes below.
542
+
543
+ **Step 3: Remove duplicates**
544
+
545
+ Select a file from the dropdown and click "Remove duplicate rows from selected file" to create a cleaned version with duplicates removed. The cleaned file will be available for download.
546
+
547
+ # Advanced user guide
548
+
549
+ This advanced user guide covers features that require system administration access or command-line usage. These features are typically used by system administrators or advanced users who need more control over the redaction process.
550
+
551
  ## Fuzzy search and redaction
552
 
553
  The files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/fuzzy_search/).
554
 
555
+ Sometimes you may be searching for terms that are slightly mispelled throughout a document, for example names. The document redaction app gives the option for searching for long phrases that may contain spelling mistakes, a method called 'fuzzy matching'.
556
 
557
  To do this, go to the Redaction Settings, and the 'Select entity types to redact' area. In the box below relevant to your chosen redaction method (local or AWS Comprehend), select 'CUSTOM_FUZZY' from the list. Next, we can select the maximum number of spelling mistakes allowed in the search (up to nine). Here, you can either type in a number or use the small arrows to the right of the box. Change this option to 3. This will allow for a maximum of three 'changes' in text needed to match to the desired search terms.
558
 
 
572
 
573
  Files for this section are stored [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/export_to_adobe/).
574
 
575
+ The Document Redaction app has enhanced features for working with Adobe Acrobat. You can now export suggested redactions to Adobe, import Adobe comment files into the app, and use the new `_for_review.pdf` files directly in Adobe Acrobat.
576
+
577
+ ### Using _for_review.pdf files with Adobe Acrobat
578
 
579
+ The app now generates `...redactions_for_review.pdf` files that contain the original PDF with redaction boxes overlaid but the original text still visible underneath. These files are specifically designed for use in Adobe Acrobat and other PDF viewers where you can:
580
+
581
+ - See the suggested redactions without the text being permanently removed
582
+ - Review redactions before finalising them
583
+ - Use Adobe Acrobat's built-in redaction tools to modify or apply the redactions
584
+ - Export the final redacted version directly from Adobe
585
+
586
+ Simply open the `...redactions_for_review.pdf` file in Adobe Acrobat to begin reviewing and modifying the suggested redactions.
587
+
588
+ ### Exporting to Adobe Acrobat
589
 
590
  To convert suggested redactions to Adobe format, you need to have the original PDF and a review file csv in the input box at the top of the Review redactions page.
591
 
 
633
 
634
  You can now easily get the '..._ocr_output.csv' redaction output based on this '_textract.json' (described in [Redaction outputs](#redaction-outputs)) by clicking on the button 'Convert Textract job outputs to OCR results'. You can now use this file e.g. for [identifying duplicate pages](#identifying-and-redacting-duplicate-pages), or for redaction review.
635
 
636
+
637
+
638
+ ## Modifying existing redaction review files
639
+ You can find the folder containing the files discussed in this section [here](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/).
640
+
641
+ As well as serving as inputs to the document redaction app's review function, the 'review_file.csv' output can be modified insider or outside of the app. This gives you the flexibility to change redaction details outside of the app.
642
+
643
+ ### Inside the app
644
+ You can now modify redaction review files directly in the app on the 'Review redactions' tab. Open the accordion 'View and edit review data' under the file input area. You can edit review file data cells here - press Enter to apply changes. You should see the effect on the current page if you click the 'Save changes on current page to file' button to the right.
645
+
646
+ ### Outside the app
647
+ If you open up a 'review_file' csv output using a spreadsheet software program such as Microsoft Excel you can easily modify redaction properties. Open the file '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local.csv)', and you should see a spreadshet with just four suggested redactions (see below). The following instructions are for using Excel.
648
+
649
+ ![Review file before](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/review_file_before.PNG)
650
+
651
+ The first thing we can do is remove the first row - 'et' is suggested as a person, but is obviously not a genuine instance of personal information. Right click on the row number and select delete on this menu. Next, let's imagine that what the app identified as a 'phone number' was in fact another type of number and so we wanted to change the label. Simply click on the relevant label cells, let's change it to 'SECURITY_NUMBER'. You could also use 'Find & Select' -> 'Replace' from the top ribbon menu if you wanted to change a number of labels simultaneously.
652
+
653
+ How about we wanted to change the colour of the 'email address' entry on the redaction review tab of the redaction app? The colours in a review file are based on an RGB scale with three numbers ranging from 0-255. [You can find suitable colours here](https://rgbcolorpicker.com). Using this scale, if I wanted my review box to be pure blue, I can change the cell value to (0,0,255).
654
+
655
+ Imagine that a redaction box was slightly too small, and I didn't want to use the in-app options to change the size. In the review file csv, we can modify e.g. the ymin and ymax values for any box to increase the extent of the redaction box. For the 'email address' entry, let's decrease ymin by 5, and increase ymax by 5.
656
+
657
+ I have saved an output file following the above steps as '[Partnership-Agreement-Toolkit_0_0_redacted.pdf_review_file_local_mod.csv](https://github.com/seanpedrick-case/document_redaction_examples/blob/main/merge_review_files/outputs/Partnership-Agreement-Toolkit_0_0.pdf_review_file_local_mod.csv)' in the same folder that the original was found. Let's upload this file to the app along with the original pdf to see how the redactions look now.
658
+
659
+ ![Review file after modification](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/partnership_redactions_after.PNG)
660
+
661
+ We can see from the above that we have successfully removed a redaction box, changed labels, colours, and redaction box sizes.
662
+
663
+ ## Merging redaction review files
664
+
665
+ Say you have run multiple redaction tasks on the same document, and you want to merge all of these redactions together. You could do this in your spreadsheet editor, but this could be fiddly especially if dealing with multiple review files or large numbers of redactions. The app has a feature to combine multiple review files together to create a 'merged' review file.
666
+
667
+ ![Merging review files in the user interface](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/merge_review_files_interface.PNG)
668
+
669
+ You can find this option at the bottom of the 'Redaction Settings' tab. Upload multiple review files here to get a single output 'merged' review_file. In the examples file, merging the 'review_file_custom.csv' and 'review_file_local.csv' files give you an output containing redaction boxes from both. This combined review file can then be uploaded into the review tab following the usual procedure.
670
+
671
+ ![Merging review files outputs in spreadsheet](https://raw.githubusercontent.com/seanpedrick-case/document_redaction_examples/main/merge_review_files/img/merged_review_file_outputs_csv.PNG)
672
+
673
+ # Features for expert users/system administrators
674
+ This advanced user guide covers features that require system administration access or command-line usage. These options are not enabled by default but can be configured by your system administrator, and are not available to users who are just using the graphical user interface. These features are typically used by system administrators or advanced users who need more control over the redaction process.
675
+
676
  ## Using AWS Textract and Comprehend when not running in an AWS environment
677
 
678
  AWS Textract and Comprehend give much better results for text extraction and document redaction than the local model options in the app. The most secure way to access them in the Redaction app is to run the app in a secure AWS environment with relevant permissions. Alternatively, you could run the app on your own system while logged in to AWS SSO with relevant permissions.
 
694
 
695
  Again, a lot can potentially go wrong with AWS solutions that are insecure, so before trying the above please consult with your AWS and data security teams.
696
 
697
+ ## Advanced OCR options (Hybrid OCR)
698
 
699
+ The app supports advanced OCR options that combine multiple OCR engines for improved accuracy. These options are not enabled by default but can be configured by your system administrator.
700
 
701
+ ### Available OCR models
702
 
703
+ - **Tesseract** (default): The standard OCR engine that works well for most documents
704
+ - **PaddleOCR**: More accurate for whole line text extraction, but word-level bounding boxes may be less precise
705
+ - **Hybrid**: Combines Tesseract and PaddleOCR - uses Tesseract for initial extraction, then PaddleOCR for re-extraction of low-confidence text
706
 
707
+ ### Enabling advanced OCR options
708
 
709
+ To enable these options, your system administrator needs to modify the configuration file (`config.py`) and set:
710
 
711
+ ```
712
+ SHOW_LOCAL_OCR_MODEL_OPTIONS = "True"
713
+ ```
714
 
715
+ Once enabled, users will see a "Change default local OCR model" section in the redaction settings where they can choose between:
716
+ - tesseract
717
+ - hybrid
718
+ - paddle
719
 
720
+ ### Hybrid OCR configuration
721
 
722
+ The hybrid OCR mode uses several configurable parameters:
723
 
724
+ - **HYBRID_OCR_CONFIDENCE_THRESHOLD** (default: 65): Tesseract confidence score below which PaddleOCR will be used for re-extraction
725
+ - **HYBRID_OCR_PADDING** (default: 1): Padding added to word bounding boxes before re-extraction
726
+ - **SAVE_EXAMPLE_TESSERACT_VS_PADDLE_IMAGES** (default: False): Save comparison images when using hybrid mode
727
+ - **SAVE_PADDLE_VISUALISATIONS** (default: False): Save images with PaddleOCR bounding boxes overlaid
728
+
729
+ ### When to use different OCR models
730
+
731
+ - **Tesseract**: Best for general use, good balance of speed and accuracy
732
+ - **PaddleOCR**: Best for documents with clear, well-formatted text where line-level accuracy is more important than word-level precision
733
+ - **Hybrid**: Best for challenging documents where some text has low confidence scores, providing the benefits of both engines
734
+
735
+
736
+
737
+
738
+
739
+ ## Command Line Interface (CLI)
740
+
741
+ The app includes a comprehensive command-line interface (`cli_redact.py`) that allows you to perform redaction, deduplication, and AWS Textract operations directly from the terminal. This is particularly useful for batch processing, automation, and integration with other systems.
742
+
743
+ ### Getting started with the CLI
744
+
745
+ To use the CLI, you need to:
746
+
747
+ 1. Open a terminal window
748
+ 2. Navigate to the app folder containing `cli_redact.py`
749
+ 3. Activate your virtual environment (conda or venv)
750
+ 4. Run commands using `python cli_redact.py` followed by your options
751
+
752
+ ### Basic CLI syntax
753
+
754
+ ```bash
755
+ python cli_redact.py --task [redact|deduplicate|textract] --input_file [file_path] [additional_options]
756
+ ```
757
+
758
+ ### Redaction examples
759
+
760
+ **Basic PDF redaction with default settings:**
761
+ ```bash
762
+ python cli_redact.py --input_file example_data/example_of_emails_sent_to_a_professor_before_applying.pdf
763
+ ```
764
+
765
+ **Extract text only (no redaction) with whole page redaction:**
766
+ ```bash
767
+ python cli_redact.py --input_file example_data/Partnership-Agreement-Toolkit_0_0.pdf --redact_whole_page_file example_data/partnership_toolkit_redact_some_pages.csv --pii_detector None
768
+ ```
769
+
770
+ **Redact with custom entities and allow list:**
771
+ ```bash
772
+ python cli_redact.py --input_file example_data/graduate-job-example-cover-letter.pdf --allow_list_file example_data/test_allow_list_graduate.csv --local_redact_entities TITLES PERSON DATE_TIME
773
+ ```
774
+
775
+ **Redact with fuzzy matching and custom deny list:**
776
+ ```bash
777
+ python cli_redact.py --input_file example_data/Partnership-Agreement-Toolkit_0_0.pdf --deny_list_file example_data/Partnership-Agreement-Toolkit_test_deny_list_para_single_spell.csv --local_redact_entities CUSTOM_FUZZY --page_min 1 --page_max 3 --fuzzy_mistakes 3
778
+ ```
779
+
780
+ **Redact with AWS services:**
781
+ ```bash
782
+ python cli_redact.py --input_file example_data/example_of_emails_sent_to_a_professor_before_applying.pdf --ocr_method "AWS Textract" --pii_detector "AWS Comprehend"
783
+ ```
784
+
785
+ **Redact specific pages with signature extraction:**
786
+ ```bash
787
+ python cli_redact.py --input_file example_data/Partnership-Agreement-Toolkit_0_0.pdf --page_min 6 --page_max 7 --ocr_method "AWS Textract" --handwrite_signature_extraction "Extract handwriting" "Extract signatures"
788
+ ```
789
+
790
+ ### Tabular data redaction
791
+
792
+ **Anonymize CSV file with specific columns:**
793
+ ```bash
794
+ python cli_redact.py --input_file example_data/combined_case_notes.csv --text_columns "Case Note" "Client" --anon_strategy replace_redacted
795
+ ```
796
+
797
+ **Anonymize Excel file:**
798
+ ```bash
799
+ python cli_redact.py --input_file example_data/combined_case_notes.xlsx --text_columns "Case Note" "Client" --excel_sheets combined_case_notes --anon_strategy redact
800
+ ```
801
+
802
+ **Anonymize Word document:**
803
+ ```bash
804
+ python cli_redact.py --input_file "example_data/Bold minimalist professional cover letter.docx" --anon_strategy replace_redacted
805
+ ```
806
+
807
+ ### Duplicate detection
808
+
809
+ **Find duplicate pages in OCR files:**
810
+ ```bash
811
+ python cli_redact.py --task deduplicate --input_file example_data/example_outputs/doubled_output_joined.pdf_ocr_output.csv --duplicate_type pages --similarity_threshold 0.95
812
+ ```
813
+
814
+ **Find duplicates at line level:**
815
+ ```bash
816
+ python cli_redact.py --task deduplicate --input_file example_data/example_outputs/doubled_output_joined.pdf_ocr_output.csv --duplicate_type pages --similarity_threshold 0.95 --combine_pages False --min_word_count 3
817
+ ```
818
+
819
+ **Find duplicate rows in tabular data:**
820
+ ```bash
821
+ python cli_redact.py --task deduplicate --input_file example_data/Lambeth_2030-Our_Future_Our_Lambeth.pdf.csv --duplicate_type tabular --text_columns "text" --similarity_threshold 0.95
822
+ ```
823
+
824
+ ### AWS Textract operations
825
+
826
+ **Submit document for analysis:**
827
+ ```bash
828
+ python cli_redact.py --task textract --textract_action submit --input_file example_data/example_of_emails_sent_to_a_professor_before_applying.pdf
829
+ ```
830
+
831
+ **Submit with signature extraction:**
832
+ ```bash
833
+ python cli_redact.py --task textract --textract_action submit --input_file example_data/Partnership-Agreement-Toolkit_0_0.pdf --extract_signatures
834
+ ```
835
 
836
+ **Retrieve results by job ID:**
837
+ ```bash
838
+ python cli_redact.py --task textract --textract_action retrieve --job_id 12345678-1234-1234-1234-123456789012
839
+ ```
840
+
841
+ **List recent jobs:**
842
+ ```bash
843
+ python cli_redact.py --task textract --textract_action list
844
+ ```
845
+
846
+ ### Common CLI options
847
+
848
+ - `--task`: Choose between "redact", "deduplicate", or "textract"
849
+ - `--input_file`: Path to input file(s)
850
+ - `--output_dir`: Directory for output files (default: output/)
851
+ - `--page_min` / `--page_max`: Process only specific page range
852
+ - `--ocr_method`: Choose text extraction method
853
+ - `--pii_detector`: Choose PII detection method
854
+ - `--local_redact_entities`: Specify local entities to redact
855
+ - `--allow_list_file` / `--deny_list_file`: Custom lists
856
+ - `--redact_whole_page_file`: List of pages to redact completely
857
+ - `--fuzzy_mistakes`: Number of spelling mistakes allowed in fuzzy matching
858
+ - `--similarity_threshold`: Threshold for duplicate detection
859
+ - `--anon_strategy`: Anonymization strategy for tabular data
860
+
861
+ ### Output files
862
+
863
+ The CLI generates the same output files as the GUI:
864
+ - `...redacted.pdf`: Final redacted document
865
+ - `...redactions_for_review.pdf`: Document with redaction boxes for review
866
+ - `...review_file.csv`: Detailed redaction information
867
+ - `...ocr_results.csv`: Extracted text results
868
+ - `..._textract.json`: AWS Textract results (if applicable)
869
+
870
+ For more advanced options and configuration, refer to the help text by running:
871
+ ```bash
872
+ python cli_redact.py --help
873
+ ```
tools/config.py CHANGED
@@ -773,26 +773,32 @@ DIRECT_MODE_PII_DETECTOR = get_or_create_env_var(
773
  DIRECT_MODE_OCR_METHOD = get_or_create_env_var(
774
  "DIRECT_MODE_OCR_METHOD", "Local OCR"
775
  ) # OCR method for PDF/image processing
776
- DIRECT_MODE_PAGE_MIN = int(get_or_create_env_var(
777
- "DIRECT_MODE_PAGE_MIN", str(DEFAULT_PAGE_MIN)
778
- )) # First page to process
779
- DIRECT_MODE_PAGE_MAX = int(get_or_create_env_var(
780
- "DIRECT_MODE_PAGE_MAX", str(DEFAULT_PAGE_MAX)
781
- )) # Last page to process
782
- DIRECT_MODE_IMAGES_DPI = float(get_or_create_env_var(
783
- "DIRECT_MODE_IMAGES_DPI", str(IMAGES_DPI)
784
- )) # DPI for image processing
785
  DIRECT_MODE_CHOSEN_LOCAL_OCR_MODEL = get_or_create_env_var(
786
  "DIRECT_MODE_CHOSEN_LOCAL_OCR_MODEL", CHOSEN_LOCAL_OCR_MODEL
787
  ) # Local OCR model choice
788
  DIRECT_MODE_PREPROCESS_LOCAL_OCR_IMAGES = convert_string_to_boolean(
789
- get_or_create_env_var("DIRECT_MODE_PREPROCESS_LOCAL_OCR_IMAGES", str(PREPROCESS_LOCAL_OCR_IMAGES))
 
 
790
  ) # Preprocess images before OCR
791
  DIRECT_MODE_COMPRESS_REDACTED_PDF = convert_string_to_boolean(
792
- get_or_create_env_var("DIRECT_MODE_COMPRESS_REDACTED_PDF", str(COMPRESS_REDACTED_PDF))
 
 
793
  ) # Compress redacted PDF
794
  DIRECT_MODE_RETURN_PDF_END_OF_REDACTION = convert_string_to_boolean(
795
- get_or_create_env_var("DIRECT_MODE_RETURN_PDF_END_OF_REDACTION", str(RETURN_REDACTED_PDF))
 
 
796
  ) # Return PDF at end of redaction
797
  DIRECT_MODE_EXTRACT_FORMS = convert_string_to_boolean(
798
  get_or_create_env_var("DIRECT_MODE_EXTRACT_FORMS", "False")
@@ -812,26 +818,36 @@ DIRECT_MODE_MATCH_FUZZY_WHOLE_PHRASE_BOOL = convert_string_to_boolean(
812
  DIRECT_MODE_ANON_STRATEGY = get_or_create_env_var(
813
  "DIRECT_MODE_ANON_STRATEGY", DEFAULT_TABULAR_ANONYMISATION_STRATEGY
814
  ) # Anonymisation strategy for tabular data
815
- DIRECT_MODE_FUZZY_MISTAKES = int(get_or_create_env_var(
816
- "DIRECT_MODE_FUZZY_MISTAKES", str(DEFAULT_FUZZY_SPELLING_MISTAKES_NUM)
817
- )) # Number of fuzzy spelling mistakes allowed
818
- DIRECT_MODE_SIMILARITY_THRESHOLD = float(get_or_create_env_var(
819
- "DIRECT_MODE_SIMILARITY_THRESHOLD", str(DEFAULT_DUPLICATE_DETECTION_THRESHOLD)
820
- )) # Similarity threshold for duplicate detection
821
- DIRECT_MODE_MIN_WORD_COUNT = int(get_or_create_env_var(
822
- "DIRECT_MODE_MIN_WORD_COUNT", str(DEFAULT_MIN_WORD_COUNT)
823
- )) # Minimum word count for duplicate detection
824
- DIRECT_MODE_MIN_CONSECUTIVE_PAGES = int(get_or_create_env_var(
825
- "DIRECT_MODE_MIN_CONSECUTIVE_PAGES", str(DEFAULT_MIN_CONSECUTIVE_PAGES)
826
- )) # Minimum consecutive pages for duplicate detection
 
 
 
 
 
 
827
  DIRECT_MODE_GREEDY_MATCH = convert_string_to_boolean(
828
- get_or_create_env_var("DIRECT_MODE_GREEDY_MATCH", str(USE_GREEDY_DUPLICATE_DETECTION))
 
 
829
  ) # Use greedy matching for duplicate detection
830
  DIRECT_MODE_COMBINE_PAGES = convert_string_to_boolean(
831
  get_or_create_env_var("DIRECT_MODE_COMBINE_PAGES", str(DEFAULT_COMBINE_PAGES))
832
  ) # Combine pages for duplicate detection
833
  DIRECT_MODE_REMOVE_DUPLICATE_ROWS = convert_string_to_boolean(
834
- get_or_create_env_var("DIRECT_MODE_REMOVE_DUPLICATE_ROWS", str(REMOVE_DUPLICATE_ROWS))
 
 
835
  ) # Remove duplicate rows in tabular data
836
 
837
  # Textract Batch Operations Options
@@ -843,12 +859,12 @@ DIRECT_MODE_JOB_ID = get_or_create_env_var(
843
  ) # Job ID for Textract operations
844
 
845
  # Lambda-specific configuration options
846
- LAMBDA_POLL_INTERVAL = int(get_or_create_env_var(
847
- "LAMBDA_POLL_INTERVAL", "30"
848
- )) # Polling interval in seconds for Textract job status
849
- LAMBDA_MAX_POLL_ATTEMPTS = int(get_or_create_env_var(
850
- "LAMBDA_MAX_POLL_ATTEMPTS", "120"
851
- )) # Maximum number of polling attempts for Textract job completion
852
  LAMBDA_PREPARE_IMAGES = convert_string_to_boolean(
853
  get_or_create_env_var("LAMBDA_PREPARE_IMAGES", "True")
854
  ) # Prepare images for OCR processing
 
773
  DIRECT_MODE_OCR_METHOD = get_or_create_env_var(
774
  "DIRECT_MODE_OCR_METHOD", "Local OCR"
775
  ) # OCR method for PDF/image processing
776
+ DIRECT_MODE_PAGE_MIN = int(
777
+ get_or_create_env_var("DIRECT_MODE_PAGE_MIN", str(DEFAULT_PAGE_MIN))
778
+ ) # First page to process
779
+ DIRECT_MODE_PAGE_MAX = int(
780
+ get_or_create_env_var("DIRECT_MODE_PAGE_MAX", str(DEFAULT_PAGE_MAX))
781
+ ) # Last page to process
782
+ DIRECT_MODE_IMAGES_DPI = float(
783
+ get_or_create_env_var("DIRECT_MODE_IMAGES_DPI", str(IMAGES_DPI))
784
+ ) # DPI for image processing
785
  DIRECT_MODE_CHOSEN_LOCAL_OCR_MODEL = get_or_create_env_var(
786
  "DIRECT_MODE_CHOSEN_LOCAL_OCR_MODEL", CHOSEN_LOCAL_OCR_MODEL
787
  ) # Local OCR model choice
788
  DIRECT_MODE_PREPROCESS_LOCAL_OCR_IMAGES = convert_string_to_boolean(
789
+ get_or_create_env_var(
790
+ "DIRECT_MODE_PREPROCESS_LOCAL_OCR_IMAGES", str(PREPROCESS_LOCAL_OCR_IMAGES)
791
+ )
792
  ) # Preprocess images before OCR
793
  DIRECT_MODE_COMPRESS_REDACTED_PDF = convert_string_to_boolean(
794
+ get_or_create_env_var(
795
+ "DIRECT_MODE_COMPRESS_REDACTED_PDF", str(COMPRESS_REDACTED_PDF)
796
+ )
797
  ) # Compress redacted PDF
798
  DIRECT_MODE_RETURN_PDF_END_OF_REDACTION = convert_string_to_boolean(
799
+ get_or_create_env_var(
800
+ "DIRECT_MODE_RETURN_PDF_END_OF_REDACTION", str(RETURN_REDACTED_PDF)
801
+ )
802
  ) # Return PDF at end of redaction
803
  DIRECT_MODE_EXTRACT_FORMS = convert_string_to_boolean(
804
  get_or_create_env_var("DIRECT_MODE_EXTRACT_FORMS", "False")
 
818
  DIRECT_MODE_ANON_STRATEGY = get_or_create_env_var(
819
  "DIRECT_MODE_ANON_STRATEGY", DEFAULT_TABULAR_ANONYMISATION_STRATEGY
820
  ) # Anonymisation strategy for tabular data
821
+ DIRECT_MODE_FUZZY_MISTAKES = int(
822
+ get_or_create_env_var(
823
+ "DIRECT_MODE_FUZZY_MISTAKES", str(DEFAULT_FUZZY_SPELLING_MISTAKES_NUM)
824
+ )
825
+ ) # Number of fuzzy spelling mistakes allowed
826
+ DIRECT_MODE_SIMILARITY_THRESHOLD = float(
827
+ get_or_create_env_var(
828
+ "DIRECT_MODE_SIMILARITY_THRESHOLD", str(DEFAULT_DUPLICATE_DETECTION_THRESHOLD)
829
+ )
830
+ ) # Similarity threshold for duplicate detection
831
+ DIRECT_MODE_MIN_WORD_COUNT = int(
832
+ get_or_create_env_var("DIRECT_MODE_MIN_WORD_COUNT", str(DEFAULT_MIN_WORD_COUNT))
833
+ ) # Minimum word count for duplicate detection
834
+ DIRECT_MODE_MIN_CONSECUTIVE_PAGES = int(
835
+ get_or_create_env_var(
836
+ "DIRECT_MODE_MIN_CONSECUTIVE_PAGES", str(DEFAULT_MIN_CONSECUTIVE_PAGES)
837
+ )
838
+ ) # Minimum consecutive pages for duplicate detection
839
  DIRECT_MODE_GREEDY_MATCH = convert_string_to_boolean(
840
+ get_or_create_env_var(
841
+ "DIRECT_MODE_GREEDY_MATCH", str(USE_GREEDY_DUPLICATE_DETECTION)
842
+ )
843
  ) # Use greedy matching for duplicate detection
844
  DIRECT_MODE_COMBINE_PAGES = convert_string_to_boolean(
845
  get_or_create_env_var("DIRECT_MODE_COMBINE_PAGES", str(DEFAULT_COMBINE_PAGES))
846
  ) # Combine pages for duplicate detection
847
  DIRECT_MODE_REMOVE_DUPLICATE_ROWS = convert_string_to_boolean(
848
+ get_or_create_env_var(
849
+ "DIRECT_MODE_REMOVE_DUPLICATE_ROWS", str(REMOVE_DUPLICATE_ROWS)
850
+ )
851
  ) # Remove duplicate rows in tabular data
852
 
853
  # Textract Batch Operations Options
 
859
  ) # Job ID for Textract operations
860
 
861
  # Lambda-specific configuration options
862
+ LAMBDA_POLL_INTERVAL = int(
863
+ get_or_create_env_var("LAMBDA_POLL_INTERVAL", "30")
864
+ ) # Polling interval in seconds for Textract job status
865
+ LAMBDA_MAX_POLL_ATTEMPTS = int(
866
+ get_or_create_env_var("LAMBDA_MAX_POLL_ATTEMPTS", "120")
867
+ ) # Maximum number of polling attempts for Textract job completion
868
  LAMBDA_PREPARE_IMAGES = convert_string_to_boolean(
869
  get_or_create_env_var("LAMBDA_PREPARE_IMAGES", "True")
870
  ) # Prepare images for OCR processing