Stripping metadata from a docx file

TLDR: Stripping metadata from a Microsoft Office file can be done with a single 7-zip command:

7z d file.docx "docProps/*"

When you save a docx file, it can contain metadata like the author, title, and comments. On StackOverflow, a user asked how to remove this metadata.

One of the answers was useful to me, but it didn’t strip all the metadata and replaced some fields with the wrong data type. It uses the python-docx library, so you would need to pip install that first. This library knows that a docx is a zip in a trenchcoat, and can modify the XML in this zip inplace. I’ve modified the answer to set all metadata fields to an empty or null-ish value.

from datetime import datetime

from docx import Document

document_path = 'example.docx'

# Strip metadata
document = Document(document_path)
metadata_fields = {
    "author": str,
    "category": str,
    "comments": str,
    "content_status": str,
    "created": lambda: datetime(2000, 1, 1),
    "identifier": str,
    "keywords": str,
    "language": str,
    "last_modified_by": str,
    "last_printed": lambda: datetime(2000, 1, 1),
    "modified": lambda: datetime(2000, 1, 1),
    "revision": int,
    "subject": str,
    "title": str,
    "version": str,
}

for meta_field, factory in metadata_fields.items():
    setattr(document.core_properties, meta_field, factory())

# Save the document
document.save(document_path)

The snippet above was my preferred way of removing metadata for quite some time. However, when creating a document with WordPad I noticed it didn’t even create a docProps directory at all. So it seems this entire directory is optional for the specification, and therefore optional to be a valid docx file.

That means we don’t have to bother with changing the XML in the file at all. With a single command we can delete all metadata (including fields like edit duration of the document).

7z d file.docx "docProps/*"

On Windows you the 7-zip command line tool can be named 7z.exe instead, and it might not be on your PATH.

This command also works for other Microsoft Office tools like Excel or Powerpoint:

7z d file.xlsx "docProps/*"
7z d file.pptx "docProps/*"

Excel files can contain edit history, which may contain some data you might want to delete too. This is especially important if you’re trying to commit academic fraud. So if you want to be certain that some information is no longer in your file, extract your file and grep around for it. After you’ve done this, you might be tempted to check if you have not accidentally corrupted the file. Note that if you do that, and you accidentally save it, you have re-introduced the docProps directory. As a final warning, know that 7zip will not ask if you’re sure about the command, so you may lose work.