ssnshred: Redact SSNs from PDFs

Redact SSNs from tax PDFs. Everything runs in your browser. Your file never leaves your computer. Scanned or image-based PDFs are not supported.

Why do we need multiple types of redaction?

A primer on PDFs

A PDF is not a simple document like a .txt file. It's a structured bundle of many different types of objects—text, images, fonts, form data, metadata, and even full attached files—all wrapped together.

To redact text from a PDF, we need to find and remove it from every place it might appear, not just the visible page.

Five places SSNs can hide

1. Page content streams (the visible text)

This is what we see when we open the PDF. Text is stored as low-level drawing commands like "move to position (72, 130) and draw the string '078-05-1120'." This case is handled by finding the drawing commands and removing them.

2. Metadata (document properties)

Every PDF has a properties panel (try File → Properties in any PDF viewer). It includes fields like Title, Author, Subject, and Keywords. Tax software sometimes stuffs SSNs into these fields — for example, titling a document "Tax Return - John Doe SSN 078-05-1120" for internal tracking. This metadata is stored separately from page content and must be scrubbed independently.

3. Form fields (fillable forms)

Many tax PDFs are interactive forms where the SSN box is a fillable text field rather than drawn text. Each form field stores its value in a data structure called a "widget." Even if we visually redact the page, the widget's stored value remains and is trivially readable. We must overwrite the widget value and regenerate its visual appearance.

4. Embedded file attachments

A PDF can carry other files inside it, like email attachments. Tax software sometimes embeds supplementary data files, calculation worksheets, or XML tax data that may contain SSNs. These live in a completely separate part of the PDF structure and are invisible on any page.

5. Orphaned objects (garbage from editing)

When a PDF is edited (including by redaction tools), the old content is unlinked but not necessarily deleted. The old text may still sit in the file as an "orphaned object" that no page points to anymore, but it's still there in the raw bytes. Anyone with a hex editor or PDF library can find and read it. Saving with garbage collection removes all unreferenced objects, truly deleting the old content.

A note on security posture for PDFs

PDFs were designed for rendering fidelity, i.e., making documents look the same everywhere. They were not designed for security. Most redaction tools only handle case #1 (page content) and miss the rest. The script in this repo attempts to redact better and inform users clearly about its remaining limitations.