Sneakily giving HDBSCAN a predict method
27 May 2024Most common implementations of the HDBSCAN clustering algorithm don’t have a predict
method.
There are some fundamental reasons that many implementations don’t have it.
These reasons mostly boil down to the following: Introducing a new data point might change the clustering.
But depending on your use case, it can be valid to want or need a predict
method anyway.
For example you could be using scikit learn pipelines, that expect a predict
method for the last transformer.
Or you might be sure that the samples you want to cluster look a lot like your training data.
Maybe the samples you offer to the predict
method are even drawn from the data you fitted the clusterer on.
For all those use cases the people maintaining the hdbscan
package have invented the approximate_predict
utility.
We could move the functionality in this method to the predict
method of the HDBSCAN
class.
Because not all implementations have something like this method, here I’ll assume the hdbscan
package is being used.
The code below shows how to monkey patch a predict method on an instance of HDBSCAN
:
from hdbscan import HDBSCAN
from hdbscan.prediction import approximate_predict
from sklearn.datasets import make_blobs
# Generate some sample data to cluster
blobs, _ = make_blobs(
n_samples=750,
centers=[[1, 1], [-1, -1], [1.5, -1.5]],
cluster_std=[0.4, 0.1, 0.75],
random_state=0,
)
# Instantiate HDBSCAN with prediction_data=True so approximate_predict will work
clusterer = HDBSCAN(prediction_data=True)
# Monkey patch the approximate_predict method as predict method on the instance
clusterer.predict = lambda x: approximate_predict(clusterer, x)[0]
# Now the predict method is available
fitted = list(clusterer.fit_predict(blobs))
predicted = list(clusterer.predict(blobs))
assert fitted == predicted
Alternatively you could subclass the HDBSCAN
class.
Since scikit learn does not accept varargs for init methods of estimators, this gets verbose:
from hdbscan import HDBSCAN
from hdbscan.prediction import approximate_predict
from joblib import Memory
class HDBSCANWithPredict(HDBSCAN):
def __init__(self,
min_cluster_size=5,
min_samples=None,
cluster_selection_epsilon=0.0,
max_cluster_size=0,
metric="euclidean",
alpha=1.0,
p=None,
algorithm="best",
leaf_size=40,
memory=Memory(None, verbose=0),
approx_min_span_tree=True,
gen_min_span_tree=False,
core_dist_n_jobs=4,
cluster_selection_method="eom",
allow_single_cluster=False,
prediction_data=True, # changed from the reference implementation
match_reference_implementation=False,
**kwargs):
super().__init__(min_cluster_size=min_cluster_size,
min_samples=min_samples,
cluster_selection_epsilon=cluster_selection_epsilon,
max_cluster_size=max_cluster_size,
metric=metric,
alpha=alpha,
p=p,
algorithm=algorithm,
leaf_size=leaf_size,
memory=memory,
approx_min_span_tree=approx_min_span_tree,
gen_min_span_tree=gen_min_span_tree,
core_dist_n_jobs=core_dist_n_jobs,
cluster_selection_method=cluster_selection_method,
allow_single_cluster=allow_single_cluster,
prediction_data=prediction_data,
match_reference_implementation=match_reference_implementation,
**kwargs)
def predict(self, points_to_predict):
return approximate_predict(self, points_to_predict=points_to_predict)[0]
Whether you choose the monkey-patching or the subclassing approach, you now have a predict
method available.
Even though inference is now possible for new points, it’s best to keep monitoring the performance of this clusterer.
Out-of-cluster samples can be recognised by a label of -1
.
A dead giveaway that your trained clusterer is no longer appropriate is when the fraction of out-of-cluster samples is:
- a lot higher than in the training set
- rising over time because of changing data
If the inference is no longer acceptable you should re-fit HDBSCAN.