14 Nov 2024
TLDR: Stripping metadata from a Microsoft Office file can be done with a single 7-zip command:
7z d file.docx "docProps/*"
When you save a docx file, it can contain metadata like the author, title, and comments.
On StackOverflow, a user asked how to remove this metadata.
One of the answers was useful to me, but it didn’t strip all the metadata and replaced some fields with the wrong data type.
It uses the python-docx
library, so you would need to pip install
that first.
This library knows that a docx is a zip in a trenchcoat, and can modify the XML in this zip inplace.
I’ve modified the answer to set all metadata fields to an empty or null-ish value.
from datetime import datetime
from docx import Document
document_path = 'example.docx'
# Strip metadata
document = Document(document_path)
metadata_fields = {
"author": str,
"category": str,
"comments": str,
"content_status": str,
"created": lambda: datetime(2000, 1, 1),
"identifier": str,
"keywords": str,
"language": str,
"last_modified_by": str,
"last_printed": lambda: datetime(2000, 1, 1),
"modified": lambda: datetime(2000, 1, 1),
"revision": int,
"subject": str,
"title": str,
"version": str,
}
for meta_field, factory in metadata_fields.items():
setattr(document.core_properties, meta_field, factory())
# Save the document
document.save(document_path)
The snippet above was my preferred way of removing metadata for quite some time.
However, when creating a document with WordPad I noticed it didn’t even create a docProps
directory at all.
So it seems this entire directory is optional for the specification, and therefore optional to be a valid docx file.
That means we don’t have to bother with changing the XML in the file at all.
With a single command we can delete all metadata (including fields like edit duration of the document).
7z d file.docx "docProps/*"
On Windows you the 7-zip command line tool can be named 7z.exe
instead, and it might not be on your PATH
.
This command also works for other Microsoft Office tools like Excel or Powerpoint:
7z d file.xlsx "docProps/*"
7z d file.pptx "docProps/*"
Excel files can contain edit history, which may contain some data you might want to delete too.
This is especially important if you’re trying to commit academic fraud.
So if you want to be certain that some information is no longer in your file, extract your file and grep
around for it.
After you’ve done this, you might be tempted to check if you have not accidentally corrupted the file.
Note that if you do that, and you accidentally save it, you have re-introduced the docProps
directory.
As a final warning, know that 7zip will not ask if you’re sure about the command, so you may lose work.
24 Jul 2024
If you’re used to the Dvorak keyboard layout, you might have a hard time using the GPD Pocket 3.
In a QWERTY layout the ;
is to the right of the L.
GPD moved this to the right of the spacebar on the Pocket 3.
The location of the ;
is used for the letter s
in the Dvorak layout.
That means a very frequently used letter is no longer on the home row, defeating the purpose of the Dvorak layout quite
a bit.
This was done on Ubuntu 24.04. Your experience on other Linux distributions might vary.
I found the following remapping quite easy to get used to:
- Enter becomes the letter S
- Backspace is the new Enter
- Delete is now backspace
- Shift + Delete is the delete action
(Although not related to the GPD Pocket 3 specifically I also remapped Caps Lock to Control.)
Note that these changes are applied on top of simply switching to the Dvorak layout:
Original layout |
Changes on top of setting the layout to Dvorak |
|
|
In practice this results in the following changes to the file /usr/share/X11/xkb/symbols/pc
config flle.
--- pc.bak 2024-05-14 08:04:51.859190653 +0200
+++ pc 2024-05-14 08:54:16.902085341 +0200
@@ -4,11 +4,11 @@
key <ESC> {[ Escape ]};
key <TAB> {[ Tab, ISO_Left_Tab ]};
- key <CAPS> {[ Caps_Lock ]};
+ key <CAPS> {[ Control_R ]};
- key <BKSP> {[ BackSpace, BackSpace ]};
+ key <BKSP> {[ Return ]};
key <BKSL> {[ backslash, bar ]};
- key <RTRN> {[ Return ]};
+ key <RTRN> {[ s, S ]};
// The extra key on many European keyboards:
key <LSGT> {[ less, greater, bar, brokenbar ]};
@@ -59,7 +59,7 @@
key <PAUS> {[ Pause, Break ], type="PC_CONTROL_LEVEL2" };
key <INS> {[ Insert ]};
- key <DELE> {[ Delete ]};
+ key <DELE> {[ BackSpace, Delete ]};
key <HOME> {[ Home ]};
key <END> {[ End ]};
key <PGUP> {[ Prior ]};
Note that this configuration only takes effect in X.
Your remap does not apply yet for example when booting and unlocking the full disk encryption.
The configuration change also does not survive some updates, so you might need to restore it from time to time.
It might be better to achieve the same results with xmodmap
.
I have not yet done this successfully though.
If my current solution of modifying the XKB config gives problems, I might try xmodmap
next.
27 May 2024
Most common implementations of the HDBSCAN clustering algorithm don’t have a predict
method.
There are some fundamental reasons that many implementations don’t have it.
These reasons mostly boil down to the following: Introducing a new data point might change the clustering.
But depending on your use case, it can be valid to want or need a predict
method anyway.
For example you could be using scikit learn pipelines, that expect a predict
method for the last transformer.
Or you might be sure that the samples you want to cluster look a lot like your training data.
Maybe the samples you offer to the predict
method are even drawn from the data you fitted the clusterer on.
For all those use cases the people maintaining the hdbscan
package have invented the approximate_predict
utility.
We could move the functionality in this method to the predict
method of the HDBSCAN
class.
Because not all implementations have something like this method, here I’ll assume the hdbscan
package is being used.
The code below shows how to monkey patch a predict method on an instance of HDBSCAN
:
from hdbscan import HDBSCAN
from hdbscan.prediction import approximate_predict
from sklearn.datasets import make_blobs
# Generate some sample data to cluster
blobs, _ = make_blobs(
n_samples=750,
centers=[[1, 1], [-1, -1], [1.5, -1.5]],
cluster_std=[0.4, 0.1, 0.75],
random_state=0,
)
# Instantiate HDBSCAN with prediction_data=True so approximate_predict will work
clusterer = HDBSCAN(prediction_data=True)
# Monkey patch the approximate_predict method as predict method on the instance
clusterer.predict = lambda x: approximate_predict(clusterer, x)[0]
# Now the predict method is available
fitted = list(clusterer.fit_predict(blobs))
predicted = list(clusterer.predict(blobs))
assert fitted == predicted
Alternatively you could subclass the HDBSCAN
class.
Since scikit learn does not accept varargs for init methods of estimators, this gets verbose:
from hdbscan import HDBSCAN
from hdbscan.prediction import approximate_predict
from joblib import Memory
class HDBSCANWithPredict(HDBSCAN):
def __init__(self,
min_cluster_size=5,
min_samples=None,
cluster_selection_epsilon=0.0,
max_cluster_size=0,
metric="euclidean",
alpha=1.0,
p=None,
algorithm="best",
leaf_size=40,
memory=Memory(None, verbose=0),
approx_min_span_tree=True,
gen_min_span_tree=False,
core_dist_n_jobs=4,
cluster_selection_method="eom",
allow_single_cluster=False,
prediction_data=True, # changed from the reference implementation
match_reference_implementation=False,
**kwargs):
super().__init__(min_cluster_size=min_cluster_size,
min_samples=min_samples,
cluster_selection_epsilon=cluster_selection_epsilon,
max_cluster_size=max_cluster_size,
metric=metric,
alpha=alpha,
p=p,
algorithm=algorithm,
leaf_size=leaf_size,
memory=memory,
approx_min_span_tree=approx_min_span_tree,
gen_min_span_tree=gen_min_span_tree,
core_dist_n_jobs=core_dist_n_jobs,
cluster_selection_method=cluster_selection_method,
allow_single_cluster=allow_single_cluster,
prediction_data=prediction_data,
match_reference_implementation=match_reference_implementation,
**kwargs)
def predict(self, points_to_predict):
return approximate_predict(self, points_to_predict=points_to_predict)[0]
Whether you choose the monkey-patching or the subclassing approach, you now have a predict
method available.
Even though inference is now possible for new points, it’s best to keep monitoring the performance of this clusterer.
Out-of-cluster samples can be recognised by a label of -1
.
A dead giveaway that your trained clusterer is no longer appropriate is when the fraction of out-of-cluster samples is:
- a lot higher than in the training set
- rising over time because of changing data
If the inference is no longer acceptable you should re-fit HDBSCAN.
02 Apr 2024
Streamlit has an API to add custom HTML to your document, but the added HTML will be nested in an iframe.
Since the policies on this iframe are quite relaxed, you can use window.parent
to escape to the parent document.
This can be useful if you want access to elements on the top-level DOM.
Using the following code, the HTML added will lift itself out of its containing iframe:
from streamlit.components.v1 import html
html_contents = """
<script id="extractorScript">
let currentScript = document.getElementById('extractorScript');
window.parent.document.querySelector('iframe').insertAdjacentElement(currentScript.nextSibling);
// window.parent.document.querySelector('iframe').remove();
</script>
<div>
<h1>Test contents</h1>
<p>The HTML contents that you want to move out of its iframe</p>
</div>
"""
html(html_contents)
Prepending the contents of the script
tag to your HTML tag (a div
in this case) allows it to find the content that needs to be lifted with the .nextSibling
attribute.
If you’re planning to use this code snippet more than one time, you will have multiple iframe
elements.
In that case it’s wise to come up with a more specific CSS selector than 'iframe'
.
27 Mar 2024
Update 2024-08-25: The code snippet and information below is now outdated if you’re using the latest version of Streamlit.
It can be fixed by checking bootstrap.py
in the Streamlit source code and changing your if __name__ == '__main__':
block accordingly.
Streamlit uses the Tornado web framework under the hood.
All traffic generated by Streamlit originates from Tornado handlers.
Streamlit doesn’t expose much of the Tornado API.
In this post I’ll show how you can use it anyway, to add custom handlers,
while still enjoying most of the conveniences provided by Streamlit.
The streamlit run
start is replaced by code that starts the Tornado server.
I build on top of my experiences in this previous post:
Adding “a main” to a streamlit dashboard
By subclassing Streamlit’s default Server
class, we can modify the routes just before we start the Tornado application.
After running the typical setup (Server._create_app()
), we add a new routing rule.
Since this is appended to the end, and the rule before is set so it matches everything, we need to reverse the order the rules are checked.
First the newly added specific rule should be checked, and only after that the default Streamlit routes.
import asyncio
import streamlit.web.bootstrap
from streamlit import config
from streamlit.web.server import Server
from streamlit.web.server.media_file_handler import MediaFileHandler
from streamlit.web.server.server import start_listening
from streamlit.web.server.server_util import make_url_path_regex
streamlit.markdown("# Contents of the streamlit app go here as usual")
class CustomHandler(MediaFileHandler):
def get_content(self, abspath, start=None, end=None):
# Implement a custom handler here
return b''
class CustomServer(Server):
async def start(self):
# Override the start of the Tornado server, so we can add custom handlers
app = self._create_app()
# Add a new handler
app.default_router.add_rules([(
make_url_path_regex(config.get_option("server.baseUrlPath"),
f"custom/(.*)"),
CustomHandler,
{"path": ""},
),
])
# Our new rules go before the rule matching everything, reverse the list
app.default_router.rules = list(reversed(app.default_router.rules))
start_listening(app)
await self._runtime.start()
if __name__ == '__main__':
if '__streamlitmagic__' not in locals():
# Code adapted from bootstrap.py in streamlit
streamlit.web.bootstrap._fix_sys_path(__file__)
streamlit.web.bootstrap._fix_tornado_crash()
streamlit.web.bootstrap._fix_sys_argv(__file__, [])
streamlit.web.bootstrap._fix_pydeck_mapbox_api_warning()
streamlit.web.bootstrap._fix_pydantic_duplicate_validators_error()
streamlit.web.bootstrap._install_pages_watcher(__file__)
server = CustomServer(__file__, is_hello=False)
async def run_server():
await server.start()
streamlit.web.bootstrap._on_server_start(server)
streamlit.web.bootstrap._set_up_signal_handler(server)
await server.stopped
asyncio.run(run_server())
There’s also a way to replace the default Streamlit routes.
In a next post I’ll show how to do that, to prevent unauthorized access to the media assets served by your app.