How to Redact Personal Data from Excel and CSV Files

In August 2023, the Police Service of Northern Ireland responded to a Freedom of Information request about staff numbers. They published a spreadsheet. The visible sheet contained the summary data requested.

What they missed: a hidden worksheet in the same workbook containing the surnames, first initials, ranks, locations, and departments of all 9,483 serving officers and staff. The file went public. The data went public with it.

Nobody had done anything malicious. Someone had created the spreadsheet with both sheets, shared the visible one, and assumed the hidden sheet was invisible. It is — until someone clicks Format → Sheet → Unhide.

This is the Excel redaction problem in one incident: the file looks clean. The data is still there.

Why Excel Is Not Like a PDF

When you redact a PDF incorrectly, the failure mode is usually a black box over recoverable text. When you redact an Excel file incorrectly, the failure modes are more numerous and less obvious.

An XLSX file is not a single document — it is a ZIP archive containing multiple XML files, each representing a different layer of the workbook. The visible cells are only one of those layers. Personal data can exist simultaneously in:

Visible cell values — the obvious one. Names, email addresses, phone numbers, ID numbers in spreadsheet cells.

Hidden rows and columns — rows or columns set to zero height or width, or explicitly hidden via the Format menu. The data stays in the file. A recipient selects all, unhides, and sees everything.

Hidden worksheets — sheets hidden via the tab right-click menu, invisible in the tab bar but present in the file. And then there is the “very hidden” state, which cannot be unhidden through the normal Excel UI at all — it requires going into the VBA editor to change the sheet’s Visible property. This state exists for developer purposes, but files occasionally circulate with very hidden sheets containing source data that was never meant to leave the organization.

PivotTable caches — when you create a PivotTable from a data range, Excel stores a copy of the source data in a cache embedded in the file. If you delete the source data and share the file, the PivotTable cache still contains it. A recipient who knows to look at the PivotTable’s data source can extract the original records.

Formula references — a cell showing a value may be displaying the result of a VLOOKUP or INDEX/MATCH against a range in another sheet. Delete the visible value, the formula remains. The formula points to wherever it was originally reading from, which may still contain personal data.

Comments and notes — Excel supports two types of annotations: older-style notes (yellow sticky-note appearance) and newer threaded comments in Microsoft 365. Both are embedded in the file’s XML, survive most copy-paste operations, and can contain text that doesn’t appear anywhere in the cell values. “Check this SSN, doesn’t match the W-9” is a real category of comment that survives a redaction pass focused only on cell content.

Named ranges — the Name Manager in Excel stores named references to cell ranges. If a named range points to data that has been cleared from a sheet, the name definition may still reference the cleared location and expose what was there.

File metadata — author name, company, edit history, and revision history are stored at the workbook level, not in the cells. These survive all cell-level redaction approaches.

Embedded objects — charts pulled from other workbooks, images, OLE objects. Each carries its own data layer and its own metadata.

The Mistakes That Look Like Redaction But Aren’t

Black fill applied to cells. This is the Excel equivalent of the PDF black box problem. The cell background becomes black. The text becomes invisible. The data stays in the cell — copy it, paste it into another location, change the font color, and it returns.

White text on white background. Same failure mode. The cell appears empty. The text is there. Select the cell, change the font color, the data is visible again.

Hiding rows, columns, or sheets. Hidden is not redacted. One right-click, one menu selection: unhide. The PSNI incident is the canonical example.

Deleting visible content but leaving formulas. A cell showing a phone number that is actually the result of a formula will display nothing when the source cells are cleared — but the formula still references wherever it was pointing, and if that location is in another sheet or a named range, the data may still be accessible.

Running Document Inspector and assuming the job is done. Document Inspector is useful. It is not sufficient. More on this below.

What Document Inspector Actually Does — and Doesn’t Do

Document Inspector is Excel’s built-in tool for finding and removing hidden data. Access it via File → Info → Check for Issues → Inspect Document.

It will find and offer to remove: comments and notes, document properties and personal information, headers and footers, hidden rows and columns, hidden worksheets, and custom XML data.

What it cannot do: detect PII in cell values categorically. It does not know that a cell contains a Social Security Number versus an arbitrary nine-digit reference code. It does not flag names, email addresses, or phone numbers as personal data. It does not clean PivotTable caches of source data, and it does not address formula references pointing to cleared cells.

Document Inspector is a useful first step in a redaction workflow. It is not a substitute for one.

CSV vs. XLSX: A Simpler Problem

CSV files are plain text. There are no hidden layers, no metadata, no formulas, no pivot caches. What you see in a text editor is what the file contains.

This makes CSV redaction conceptually simpler: identify the columns or values containing personal data, replace or remove them, save the file. The challenge is scale — a CSV with 100,000 rows and PII scattered across multiple columns, including free-text fields, is not a realistic target for manual editing.

The hidden-layer problems described above do not apply to CSV. But the detection problem does: identifying which values constitute personal data, particularly in free-text fields where names, addresses, and other identifiers appear in natural language rather than structured columns, requires more than column deletion.

If you are redacting a CSV for compliance purposes, the practical considerations are:

Which columns are direct identifiers that should be removed entirely
Which columns contain indirect identifiers that require assessment for re-identification risk
Whether any columns contain free text with embedded PII
Whether the combination of retained columns creates a re-identification risk for small subgroups

A Practical Redaction Workflow for Excel Files

Step 1: Unhide everything before you start

Before reviewing anything, unhide all rows, columns, and sheets. In Excel: select all cells (Ctrl+A), then Format → Row → Unhide and Format → Column → Unhide. For sheets: right-click any sheet tab → Unhide, and check all listed sheets. For very hidden sheets, check the VBA editor (Alt+F11) and look for sheets with xlSheetVeryHidden visibility.

You cannot redact content you cannot see.

Step 2: Audit the Name Manager

Open the Name Manager (Formulas → Name Manager) and review every named range. Look for names that reference sheets or ranges containing personal data. Delete or redefine named ranges that point to data you are removing.

Step 3: Remove PivotTables or clear their caches

If the workbook contains PivotTables, either remove them or disconnect them from their source data. In Excel: right-click the PivotTable → PivotTable Options → Data tab → uncheck “Save source data with file.” Then refresh the PivotTable. The cache will no longer contain the original records.

Step 4: Clear cell content and replace formulas with values

For cells containing PII: delete the values. For cells containing formulas that reference personal data: use Paste Special → Values to convert them to static text, then delete the static values. This breaks the formula linkage.

Step 5: Review comments and notes

Right-click any cell with a comment indicator and review the comment text. Delete all comments in the file via the Review tab → Delete All Comments. Repeat for notes if using older Excel format.

Step 6: Run Document Inspector

File → Info → Check for Issues → Inspect Document. Accept all removal options. This catches remaining hidden data, document properties, and author information that cell-level editing doesn’t address.

Step 7: Save as a new file

Save As → new filename. This forces Excel to rebuild the XLSX archive from scratch, which clears some categories of orphaned data that remain after deletion operations on the original file.

Step 8: Verify

Open the saved file in Excel. Select all. Copy. Paste into a plain text editor. Review the output for any personal data that survived the process. Check that all previously hidden sheets and rows are either gone or contain no personal data.

When Manual Process Doesn’t Scale

The workflow above is appropriate for occasional, low-volume redaction of individual files. For teams processing spreadsheets regularly — payroll data before sharing with auditors, customer records before passing to analytics teams, HR files before external review — it is too slow and too error-prone.

A manually redacted file where one very hidden sheet was missed, one PivotTable cache was not cleared, or one comment containing a Social Security Number was overlooked produces the same compliance exposure as no redaction at all. The failure is invisible until someone looks.

Automated redaction of Excel files — where a tool reads the complete file structure including all hidden layers, detects PII across all data, and produces a clean output — handles volume operations without the manual step-count climbing with file complexity.

The structural requirement is the same as for any other format: the tool must process files locally, without transmitting them to an external service, if the files contain personal data.

PII Redaction Pro processes XLSX and CSV files locally on your Windows machine — automated PII detection across cell values and free-text fields, with no data leaving your system. Try it free for 7 days.