Stripping metadata from a docx file
14 Nov 2024TLDR: Stripping metadata from a Microsoft Office file can be done with a single 7-zip command:
7z d file.docx "docProps/*"
When you save a docx file, it can contain metadata like the author, title, and comments. On StackOverflow, a user asked how to remove this metadata.
One of the answers was useful to me, but it didn’t strip all the metadata and replaced some fields with the wrong data type.
It uses the python-docx
library, so you would need to pip install
that first.
This library knows that a docx is a zip in a trenchcoat, and can modify the XML in this zip inplace.
I’ve modified the answer to set all metadata fields to an empty or null-ish value.
from datetime import datetime
from docx import Document
document_path = 'example.docx'
# Strip metadata
document = Document(document_path)
metadata_fields = {
"author": str,
"category": str,
"comments": str,
"content_status": str,
"created": lambda: datetime(2000, 1, 1),
"identifier": str,
"keywords": str,
"language": str,
"last_modified_by": str,
"last_printed": lambda: datetime(2000, 1, 1),
"modified": lambda: datetime(2000, 1, 1),
"revision": int,
"subject": str,
"title": str,
"version": str,
}
for meta_field, factory in metadata_fields.items():
setattr(document.core_properties, meta_field, factory())
# Save the document
document.save(document_path)
The snippet above was my preferred way of removing metadata for quite some time.
However, when creating a document with WordPad I noticed it didn’t even create a docProps
directory at all.
So it seems this entire directory is optional for the specification, and therefore optional to be a valid docx file.
That means we don’t have to bother with changing the XML in the file at all. With a single command we can delete all metadata (including fields like edit duration of the document).
7z d file.docx "docProps/*"
On Windows you the 7-zip command line tool can be named 7z.exe
instead, and it might not be on your PATH
.
This command also works for other Microsoft Office tools like Excel or Powerpoint:
7z d file.xlsx "docProps/*"
7z d file.pptx "docProps/*"
Excel files can contain edit history, which may contain some data you might want to delete too.
This is especially important if you’re trying to commit academic fraud.
So if you want to be certain that some information is no longer in your file, extract your file and grep
around for it.
After you’ve done this, you might be tempted to check if you have not accidentally corrupted the file.
Note that if you do that, and you accidentally save it, you have re-introduced the docProps
directory.
As a final warning, know that 7zip will not ask if you’re sure about the command, so you may lose work.