Spaces:

seanpedrickcase
/

document_redaction

Running

App Files Files Community

document_redaction / src /faq.qmd

seanpedrickcase

Added source files for quarto documentation website

c3d1c4c 5 days ago

raw

history blame contribute delete

24.2 kB

	---
	title: "User FAQ"
	format:
	html:
	toc: true # Enable the table of contents
	toc-depth: 3 # Include headings up to level 2 (##)
	toc-title: "On this page" # Optional: Title for your TOC
	---

	## General Advice:
	* Read the User Guide: Many common questions are addressed in the detailed User Guide sections.
	* Start Simple: If you're new, try redacting with default options first before customising extensively.
	* Human Review is Key: Always manually review the `...redacted.pdf` or use the 'Review redactions' tab. No automated system is perfect.
	* Save Incrementally: When working on the 'Review redactions' tab, use the 'Save changes on current page to file' button periodically, especially for large documents.

	## General questions

	#### What is document redaction and what does this app do?
	Document redaction is the process of removing sensitive or personally identifiable information (PII) from documents. This application is a tool that automates this process for various document types, including PDFs, images, open text, and tabular data (`XLSX`/`CSV`/`Parquet`). It identifies potential PII using different methods and allows users to review, modify, and export the suggested redactions.

	#### What types of documents and data can be redacted?
	The app can handle a variety of formats. For documents, it supports `PDF`s and images (`JPG`, `PNG`). For tabular data, it works with `XLSX`, `CSV`, and `Parquet` files. Additionally, it can redact open text that is copied and pasted directly into the application interface.

	#### How does the app identify text and PII for redaction?
	The app employs several methods for text extraction and PII identification. Text can be extracted directly from selectable `PDF` text, using a local Optical Character Recognition (OCR) model for image-based content, or through the AWS Textract service for more complex documents, handwriting, and signatures (if available). For PII identification, it can use a local model based on the `spacy` package or the AWS Comprehend service for more accurate results (if available).

	#### Can I customise what information is redacted?
	Yes, the app offers extensive customisation options. You can define terms that should never be redacted (an 'allow list'), terms that should always be redacted (a 'deny list'), and specify entire pages to be fully redacted using `CSV` files. You can also select specific types of entities to redact, such as dates, or remove default entity types that are not relevant to your needs.

	#### How can I review and modify the suggested redactions?
	The app provides a dedicated 'Review redactions' tab with a visual interface. You can upload the original document and the generated review file (`CSV`) to see the suggested redactions overlaid on the document. Here, you can move, resize, delete, and add new redaction boxes. You can also filter suggested redactions based on criteria and exclude them individually or in groups.

	#### Can I work with tabular data or copy and pasted text?
	Yes, the app has a dedicated tab for redacting tabular data files (`XLSX`/`CSV`) and open text. For tabular data, you can upload your file and select which columns to redact. For open text, you can simply paste the text into a box. You can then choose the redaction method and the desired output format for the anonymised data.

	#### What are the options for the anonymisation format of redacted text?
	When redacting tabular data or open text, you have several options for how the redacted information is replaced. The default is to replace the text with 'REDACTED'. Other options include replacing it with the entity type (e.g., 'PERSON'), redacting completely (removing the text), replacing it with a consistent hash value, or masking it with stars ('*').

	#### Can I export or import redactions to/from other software like Adobe Acrobat?
	Yes, the app supports exporting and importing redaction data using the Adobe Acrobat comment file format (`.xfdf`). You can export suggested redactions from the app to an `.xfdf` file that can be opened in Adobe. Conversely, you can import an `.xfdf` file created in Adobe into the app to generate a review file (`CSV`) for further work within the application.

	## Troubleshooting

	#### Q1: The app missed some personal information or redacted things it shouldn't have. Is it broken?
	A: Not necessarily. The app is not 100% accurate and is designed as an aid. The `README` explicitly states: "NOTE: The app is not 100% accurate, and it will miss some personal information. It is essential that all outputs are reviewed by a human before using the final outputs."
	* Solution: Always use the 'Review redactions' tab to manually inspect, add, remove, or modify redactions.

	#### Q2: I uploaded a `PDF`, but no text was found, or redactions are very poor using the 'Local model - selectable text' option.
	A: This option only works if your `PDF` has actual selectable text. If your `PDF` is an image scan (even if it looks like text), this method won't work well.
	* Solution:
	* Try the 'Local OCR model - PDFs without selectable text' option. This uses Tesseract OCR to "read" the text from images.
	* For best results, especially with complex documents, handwriting, or signatures, use the 'AWS Textract service - all PDF types' if available.

	#### Q3: Handwriting or signatures are not being redacted properly.
	A: The 'Local' text/OCR methods (selectable text or Tesseract) struggle with handwriting and signatures.
	* Solution:
	* Use the 'AWS Textract service' for text extraction.
	* Ensure that on the main 'Redact PDFs/images' tab, under "Optional - select signature extraction" (when AWS Textract is chosen), you have enabled handwriting and/or signature detection. Note that signature detection has higher cost implications.

	#### Q4: The options for 'AWS Textract service' or 'AWS Comprehend' are missing or greyed out.
	A: These services are typically only available when the app is running in an AWS environment or has been specifically configured by your system admin to access these services (e.g., via `API` keys).
	* Solution:
	* Check if your instance of the app is supposed to have AWS services enabled.
	* If running outside AWS, see the "Using AWS Textract and Comprehend when not running in an AWS environment" section in the advanced guide. This involves configuring AWS access keys, which should be done with IT and data security approval.

	#### Q5: I re-processed the same document, and it seems to be taking a long time and potentially costing more with AWS services. Can I avoid this?
	A: Yes. If you have previously processed a document with AWS Textract or the Local OCR model, the app generates a `.json` output file (`..._textract.json` or `..._ocr_results_with_words.json`).
	* Solution: When re-uploading your original document for redaction, also upload the corresponding `.json` file. The app should detect this (the "Existing Textract output file found" box may be checked), skipping the expensive text extraction step.

	#### Q6: My app crashed, or I reloaded the page. Are my output files lost?
	A: If you are logged in via AWS Cognito and the server hasn't been shut down, you might be able to recover them.
	* Solution: Go to the 'Redaction settings' tab, scroll to the bottom, and look for 'View all output files from this session'.

	#### Q7: My custom allow list (terms to never redact) or deny list (terms to always redact) isn't working.
	A: There are a few common reasons:
	* File Format: Ensure your list is a `.csv` file with terms in the first column only, with no column header.
	* Case Sensitivity: Terms in the allow/deny list are case sensitive.
	* Deny List & 'CUSTOM' Entity: For a deny list to work, you must select the 'CUSTOM' entity type in 'Redaction settings' under 'Entities to redact'.
	* Manual Additions: If you manually added terms in the app interface (under 'Manually modify custom allow...'), ensure you pressed `Enter` after typing each term in its cell.
	* Fuzzy Search for Deny List: If you intend to use fuzzy matching for your deny list, ensure 'CUSTOM_FUZZY' is selected as an entity type, and you've configured the "maximum number of spelling mistakes allowed."

	#### Q8: I'm trying to review redactions, but the `PDF` in the viewer looks like it's already redacted with black boxes.
	A: You likely uploaded the `...redacted.pdf` file instead of the original document.
	* Solution: On the 'Review redactions' tab, ensure you upload the original, unredacted `PDF` alongside the `..._review_file.csv`.

	#### Q9: I can't move or pan the document in the 'Review redactions' viewer when zoomed in.
	A: You are likely in "add redaction boxes" mode.
	* Solution: Scroll to the bottom of the document viewer pane and click the hand icon. This switches to "modify mode," allowing you to pan the document by clicking and dragging, and also to move/resize existing redaction boxes.

	#### Q10: I accidentally clicked "Exclude all items in table from redactions" on the 'Review redactions' tab without filtering, and now all my redactions are gone!
	A: This can happen if you don't apply a filter first.
	* Solution: Click the 'Undo last element removal' button immediately. This should restore the redactions. Always ensure you have clicked the blue tick icon next to the search box to apply your filter before using "Exclude all items...".

	#### Q11: Redaction of my `CSV` or `XLSX` file isn't working correctly.
	A: The app expects a specific format for tabular data.
	* Solution: Ensure your data file has a simple table format, with the table starting in the first cell (`A1`). There should be no other information or multiple tables within the sheet you intend to redact. For `XLSX` files, each sheet to be redacted must follow this format.

	#### Q12: The "Identify duplicate pages" feature isn't finding duplicates I expect, or it's flagging too many pages.
	A: This feature uses text similarity based on the `ocr_outputs.csv` files and has a default similarity threshold (e.g., 90%).
	* Solution:
	* Ensure you've uploaded the correct `ocr_outputs.csv` files for all documents you're comparing.
	* Review the `page_similarity_results.csv` output to see the similarity scores. The 90% threshold might be too high or too low for your specific documents. The current version of the app described doesn't seem to allow changing this threshold in the `UI`, so you'd mainly use the output to inform your manual review.

	#### Q13: I exported a review file to Adobe (`.xfdf`), but when I open it in Adobe Acrobat, it can't find the `PDF` or shows no redactions.
	A: When Adobe Acrobat prompts you, it needs to be pointed to the exact original `PDF`.
	* Solution: Ensure you select the original, unredacted `PDF` file that was used to generate the `..._review_file.csv` (and subsequently the `.xfdf` file) when Adobe Acrobat asks for the associated document.

	#### Q14: My AWS Textract API job (submitted via "Submit whole document to AWS Textract API...") is taking a long time, or I don't know if it's finished.
	A: Large documents can take time. The document estimates about five seconds per page as a rough guide.
	* Solution:
	* After submitting, a Job ID will appear.
	* Periodically click the 'Check status of Textract job and download' button. Processing continues in the background.
	* Once ready, the `_textract.json` output will appear in the output area.

	#### Q15: I'm trying to redact specific terms from my deny list, but they are not being picked up, even though the 'CUSTOM' entity is selected.
	A: The deny list matches whole words with exact spelling by default.
	* Solution:
	* Double-check the spelling and case in your deny list.
	* If you expect misspellings to be caught, you need to use the 'CUSTOM_FUZZY' entity type and configure the "maximum number of spelling mistakes allowed" under 'Redaction settings'. Then, upload your deny list.

	#### Q16: I set the "Lowest page to redact" and "Highest page to redact" in 'Redaction settings', but the app still seems to process or show redactions outside this range.
	A: The page range setting primarily controls which pages have redactions applied in the final `...redacted.pdf`. The underlying text extraction (especially with OCR/Textract) might still process the whole document to generate the `...ocr_results.csv` or `..._textract.json`. When reviewing, the `review_file.csv` might initially contain all potential redactions found across the document.
	* Solution:
	* Ensure the `...redacted.pdf` correctly reflects the page range.
	* When reviewing, use the page navigation and filters on the 'Review redactions' tab to focus on your desired page range. The final application of redactions from the review tab should also respect the range if it's still set, but primarily it works off the `review_file.csv`.

	#### Q17: My "Full page redaction list" isn't working. I uploaded a `CSV` with page numbers, but those pages aren't blacked out.
	A: Common issues include:
	* File Format: Ensure your list is a `.csv` file with page numbers in the first column only, with no column header. Each page number should be on a new row.
	* Redaction Task: Simply uploading the list doesn't automatically redact. You need to:
	1. Upload the `PDF` you want to redact.
	2. Upload the full page redaction `CSV` in 'Redaction settings'.
	3. It's often best to deselect all other entity types in 'Redaction settings' if you only want to redact these full pages.
	4. Run the 'Redact document' process. The output `...redacted.pdf` should show the full pages redacted, and the `...review_file.csv` will list these pages.

	#### Q18: I merged multiple `...review_file.csv` files, but the output seems to have duplicate redaction boxes or some are missing.
	A: The merge feature simply combines all rows from the input review files.
	* Solution:
	* Duplicates: If the same redaction (same location, text, label) was present in multiple input files, it will appear multiple times in the merged file. You'll need to manually remove these duplicates on the 'Review redactions' tab or by editing the merged `...review_file.csv` in a spreadsheet editor before review.
	* Missing: Double-check that all intended `...review_file.csv` files were correctly uploaded for the merge. Ensure the files themselves contained the expected redactions.

	#### Q19: I imported an `.xfdf` Adobe comment file, but the `review_file.csv` generated doesn't accurately reflect the highlights or comments I made in Adobe Acrobat.
	A: The app converts Adobe's comment/highlight information into its review_file format. Discrepancies can occur if:
	* Comment Types: The app primarily looks for highlight-style annotations that it can interpret as redaction areas. Other Adobe comment types (e.g., sticky notes without highlights, text strike-throughs not intended as redactions) might not translate.
	* Complexity: Very complex or unusually shaped Adobe annotations might not convert perfectly.
	* PDF Version: Ensure the `PDF` uploaded alongside the `.xfdf` is the exact same original, unredacted `PDF` that the comments were made on in Adobe.
	* Solution: After import, always open the generated `review_file.csv` (with the original `PDF`) on the 'Review redactions' tab to verify and adjust as needed.

	#### Q20: The Textract API job status table (under "Submit whole document to AWS Textract API...") only shows recent jobs, or I can't find an older Job ID I submitted.
	A: The table showing Textract job statuses might have a limit or only show jobs from the current session or within a certain timeframe (e.g., "up to seven days old" is mentioned).
	* Solution:
	* It's good practice to note down the Job ID immediately after submission if you plan to check it much later.
	* If the `_textract.json` file was successfully created from a previous job, you can re-upload that `.json` file with your original `PDF` to bypass the `API` call and proceed directly to redaction or OCR conversion.

	#### Q21: I edited a `...review_file.csv` in Excel (e.g., changed coordinates, labels, colors), but when I upload it to the 'Review redactions' tab, the boxes are misplaced, the wrong color, or it causes errors.
	A: The `review_file.csv` has specific columns and data formats (e.g., coordinates, `RGB` color tuples like `(0,0,255)`).
	* Solution:
	* Coordinates (xmin, ymin, xmax, ymax): Ensure these are numeric and make sense for `PDF` coordinates. Drastic incorrect changes can misplace boxes.
	* Colors: Ensure the color column uses the `(R,G,B)` format, e.g., `(0,0,255)` for blue, not hex codes or color names, unless the app specifically handles that (the guide mentions `RGB`).
	* CSV Integrity: Ensure you save the file strictly as a `CSV`. Excel sometimes adds extra formatting or changes delimiters if not saved carefully.
	* Column Order: Do not change the order of columns in the `review_file.csv`.
	* Test Small Changes: Modify one or two rows/values first to see the effect before making bulk changes.

	#### Q22: The cost and time estimation feature isn't showing up, or it's giving unexpected results.
	A: This feature depends on admin configuration and certain conditions.
	* Solution:
	* Admin Enabled: Confirm with your system admin that the cost/time estimation feature is enabled in the app's configuration.
	* AWS Services: Estimation is typically most relevant when using AWS Textract or Comprehend. If you're only using 'Local' models, the estimation might be simpler or not show AWS-related costs.
	* Existing Output: If "Existing Textract output file found" is checked (because you uploaded a pre-existing `_textract.json`), the estimated cost and time should be significantly lower for the Textract part of the process.

	#### Q23: I'm prompted for a "cost code," but I don't know what to enter, or my search isn't finding it.
	A: Cost code selection is an optional feature enabled by system admins for tracking AWS usage.
	* Solution:
	* Contact Admin/Team: If you're unsure which cost code to use, consult your team lead or the system administrator who manages the redaction app. They should provide the correct code or guidance.
	* Search Tips: Try searching by project name, department, or any known identifiers for your cost center. The search might be case-sensitive or require exact phrasing.

	#### Q24: I selected "hash" as the anonymisation output format for my tabular data, but the output still shows "REDACTED" or something else.
	A: Ensure the selection was correctly registered before redacting.
	* Solution:
	* Double-check on the 'Open text or Excel/csv files' tab, under 'Anonymisation output format,' that "hash" (or your desired format) is indeed selected.
	* Try re-selecting it and then click 'Redact text/data files' again.
	* If the issue persists, it might be a bug or a specific interaction with your data type that prevents hashing. Report this to your app administrator. "Hash" should replace PII with a consistent unique `ID` for each unique piece of PII.

	#### Q25: I'm using 'CUSTOM_FUZZY' for my deny list. I have "Should fuzzy search match on entire phrases in deny list" checked, but it's still matching individual words within my phrases or matching things I don't expect.
	A: Fuzzy matching on entire phrases can be complex. The "maximum number of spelling mistakes allowed" applies to the entire phrase.
	* Solution:
	* Mistake Count: If your phrase is long and the allowed mistakes are few, it might not find matches if the errors are distributed. Conversely, too many allowed mistakes on a short phrase can lead to over-matching. Experiment with the mistake count.
	* Specificity: If "match on entire phrases" is unchecked, it will fuzzy match each individual word (excluding stop words) in your deny list phrases. This can be very broad. Ensure this option is set according to your needs.
	* Test with Simple Phrases: Try a very simple phrase with a known, small number of errors to see if the core fuzzy logic is working as you expect, then build up complexity.

	#### Q26: I "locked in" a new redaction box format on the 'Review redactions' tab (label, colour), but now I want to change it or go back to the pop-up for each new box.
	A: When a format is locked, a new icon (described as looking like a "gift tag") appears at the bottom of the document viewer.
	* Solution:
	* Click the "gift tag" icon at the bottom of the document viewer pane.
	* This will allow you to change the default locked format.
	* To go back to the pop-up appearing for each new box, click the lock icon within that "gift tag" menu again to "unlock" it (it should turn from blue to its original state).

	#### Q27: I clicked "Redact document," processing seemed to complete (e.g., progress bar finished, "complete" message shown), but no output files (`...redacted.pdf`, `...review_file.csv`) appeared in the output area.
	A: This could be due to various reasons:
	* No PII Found: If absolutely no PII was detected according to your settings (entities, allow/deny lists), the app might not generate a `...redacted.pdf` if there's nothing to redact, though a `review_file.csv` (potentially empty) and `ocr_results.csv` should still ideally appear.
	* Error During File Generation: An unhandled error might have occurred silently during the final file creation step.
	* Browser/UI Issue: The `UI` might not have refreshed to show the files.
	* Permissions: In rare cases, if running locally, there might be file system permission issues preventing the app from writing outputs.
	* Solution:
	* Try refreshing the browser page (if feasible without losing input data, or after re-uploading).
	* Check the 'Redaction settings' tab for 'View all output files from this session' (if logged in via Cognito) – they might be listed there.
	* Try a very simple document with obvious PII and default settings to see if any output is generated.
	* Check browser developer console (`F12`) for any error messages.

	#### Q28: When reviewing, I click on a row in the 'Search suggested redactions' table. The page changes, but the specific redaction box isn't highlighted, or the view doesn't scroll to it.
	A: The highlighting feature ("should change the colour of redaction box to blue") is an aid.
	* Solution:
	* Ensure you are on the correct page. The table click should take you there.
	* The highlighting might be subtle or conflict with other `UI` elements. Manually scan the page for the text/label mentioned in the table row.
	* Scrolling to the exact box isn't explicitly guaranteed, especially on very dense pages. The main function is page navigation.

	#### Q29: I rotated a page in the 'Review redactions' document viewer, and now all subsequent pages are also rotated, or if I navigate away and back, the rotation is lost.
	A: The `README` states: "When you switch page, the viewer will stay in your selected orientation, so if it looks strange, just rotate the page again and hopefully it will look correct!"
	* Solution:
	* The rotation is a viewing aid for the current page session in the viewer. It does not permanently alter the original `PDF`.
	* If subsequent pages appear incorrectly rotated, use the rotation buttons again for that new page.
	* The rotation state might reset if you reload files or perform certain actions. Simply re-apply rotation as needed for viewing.