Sneakily giving HDBSCAN a predict method

Most common implementations of the HDBSCAN clustering algorithm don’t have a predict method. There are some fundamental reasons that many implementations don’t have it. These reasons mostly boil down to the following: Introducing a new data point might change the clustering.

But depending on your use case, it can be valid to want or need a predict method anyway. For example you could be using scikit learn pipelines, that expect a predict method for the last transformer. Or you might be sure that the samples you want to cluster look a lot like your training data. Maybe the samples you offer to the predict method are even drawn from the data you fitted the clusterer on.

For all those use cases the people maintaining the hdbscan package have invented the approximate_predict utility. We could move the functionality in this method to the predict method of the HDBSCAN class. Because not all implementations have something like this method, here I’ll assume the hdbscan package is being used.

The code below shows how to monkey patch a predict method on an instance of HDBSCAN:

from hdbscan import HDBSCAN
from hdbscan.prediction import approximate_predict
from sklearn.datasets import make_blobs

# Generate some sample data to cluster
blobs, _ = make_blobs(
    n_samples=750,
    centers=[[1, 1], [-1, -1], [1.5, -1.5]],
    cluster_std=[0.4, 0.1, 0.75],
    random_state=0,
)

# Instantiate HDBSCAN with prediction_data=True so approximate_predict will work
clusterer = HDBSCAN(prediction_data=True)

# Monkey patch the approximate_predict method as predict method on the instance
clusterer.predict = lambda x: approximate_predict(clusterer, x)[0]

# Now the predict method is available
fitted = list(clusterer.fit_predict(blobs))
predicted = list(clusterer.predict(blobs))
assert fitted == predicted

Alternatively you could subclass the HDBSCAN class. Since scikit learn does not accept varargs for init methods of estimators, this gets verbose:

from hdbscan import HDBSCAN
from hdbscan.prediction import approximate_predict
from joblib import Memory


class HDBSCANWithPredict(HDBSCAN):

    def __init__(self,
                 min_cluster_size=5,
                 min_samples=None,
                 cluster_selection_epsilon=0.0,
                 max_cluster_size=0,
                 metric="euclidean",
                 alpha=1.0,
                 p=None,
                 algorithm="best",
                 leaf_size=40,
                 memory=Memory(None, verbose=0),
                 approx_min_span_tree=True,
                 gen_min_span_tree=False,
                 core_dist_n_jobs=4,
                 cluster_selection_method="eom",
                 allow_single_cluster=False,
                 prediction_data=True,  # changed from the reference implementation
                 match_reference_implementation=False,
                 **kwargs):
        super().__init__(min_cluster_size=min_cluster_size,
                         min_samples=min_samples,
                         cluster_selection_epsilon=cluster_selection_epsilon,
                         max_cluster_size=max_cluster_size,
                         metric=metric,
                         alpha=alpha,
                         p=p,
                         algorithm=algorithm,
                         leaf_size=leaf_size,
                         memory=memory,
                         approx_min_span_tree=approx_min_span_tree,
                         gen_min_span_tree=gen_min_span_tree,
                         core_dist_n_jobs=core_dist_n_jobs,
                         cluster_selection_method=cluster_selection_method,
                         allow_single_cluster=allow_single_cluster,
                         prediction_data=prediction_data,
                         match_reference_implementation=match_reference_implementation,
                         **kwargs)

    def predict(self, points_to_predict):
        return approximate_predict(self, points_to_predict=points_to_predict)[0]

Whether you choose the monkey-patching or the subclassing approach, you now have a predict method available.

Even though inference is now possible for new points, it’s best to keep monitoring the performance of this clusterer. Out-of-cluster samples can be recognised by a label of -1. A dead giveaway that your trained clusterer is no longer appropriate is when the fraction of out-of-cluster samples is:

If the inference is no longer acceptable you should re-fit HDBSCAN.