# Sneakily giving HDBSCAN a predict method

27 May 2024Most common implementations of the HDBSCAN clustering algorithm don’t have a `predict`

method.
There are some fundamental reasons that many implementations don’t have it.
These reasons mostly boil down to the following: Introducing a new data point might change the clustering.

But depending on your use case, it can be valid to want or need a `predict`

method anyway.
For example you could be using scikit learn pipelines, that expect a `predict`

method for the last transformer.
Or you might be sure that the samples you want to cluster look a lot like your training data.
Maybe the samples you offer to the `predict`

method are even drawn from the data you fitted the clusterer on.

For all those use cases the people maintaining the `hdbscan`

package have invented the `approximate_predict`

utility.
We could move the functionality in this method to the `predict`

method of the `HDBSCAN`

class.
Because not all implementations have something like this method, here I’ll assume the `hdbscan`

package is being used.

The code below shows how to monkey patch a predict method on an instance of `HDBSCAN`

:

```
from hdbscan import HDBSCAN
from hdbscan.prediction import approximate_predict
from sklearn.datasets import make_blobs
# Generate some sample data to cluster
blobs, _ = make_blobs(
n_samples=750,
centers=[[1, 1], [-1, -1], [1.5, -1.5]],
cluster_std=[0.4, 0.1, 0.75],
random_state=0,
)
# Instantiate HDBSCAN with prediction_data=True so approximate_predict will work
clusterer = HDBSCAN(prediction_data=True)
# Monkey patch the approximate_predict method as predict method on the instance
clusterer.predict = lambda x: approximate_predict(clusterer, x)[0]
# Now the predict method is available
fitted = list(clusterer.fit_predict(blobs))
predicted = list(clusterer.predict(blobs))
assert fitted == predicted
```

Alternatively you could subclass the `HDBSCAN`

class.
Since scikit learn does not accept varargs for init methods of estimators, this gets verbose:

```
from hdbscan import HDBSCAN
from hdbscan.prediction import approximate_predict
from joblib import Memory
class HDBSCANWithPredict(HDBSCAN):
def __init__(self,
min_cluster_size=5,
min_samples=None,
cluster_selection_epsilon=0.0,
max_cluster_size=0,
metric="euclidean",
alpha=1.0,
p=None,
algorithm="best",
leaf_size=40,
memory=Memory(None, verbose=0),
approx_min_span_tree=True,
gen_min_span_tree=False,
core_dist_n_jobs=4,
cluster_selection_method="eom",
allow_single_cluster=False,
prediction_data=True, # changed from the reference implementation
match_reference_implementation=False,
**kwargs):
super().__init__(min_cluster_size=min_cluster_size,
min_samples=min_samples,
cluster_selection_epsilon=cluster_selection_epsilon,
max_cluster_size=max_cluster_size,
metric=metric,
alpha=alpha,
p=p,
algorithm=algorithm,
leaf_size=leaf_size,
memory=memory,
approx_min_span_tree=approx_min_span_tree,
gen_min_span_tree=gen_min_span_tree,
core_dist_n_jobs=core_dist_n_jobs,
cluster_selection_method=cluster_selection_method,
allow_single_cluster=allow_single_cluster,
prediction_data=prediction_data,
match_reference_implementation=match_reference_implementation,
**kwargs)
def predict(self, points_to_predict):
return approximate_predict(self, points_to_predict=points_to_predict)[0]
```

Whether you choose the monkey-patching or the subclassing approach, you now have a `predict`

method available.

Even though inference is now possible for new points, it’s best to keep monitoring the performance of this clusterer.
Out-of-cluster samples can be recognised by a label of `-1`

.
A dead giveaway that your trained clusterer is no longer appropriate is when the fraction of out-of-cluster samples is:

- a lot higher than in the training set
- rising over time because of changing data

If the inference is no longer acceptable you should re-fit HDBSCAN.