Hopkins test for cluster tendency

During my recent coding I stumbled upon the issue of determining whether or not a given data-set contains clusters or not. There are many ideas how to do that, but for this project we decided to use the Hopkins statistic. Interestingly enough there are not many implementations of it, so in the end I had to develop my own version (the original only worked with data values between 0 and 1).

The idea is simple – the closer the value to 1, the higher the likelihood of clusters.

from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan

def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) # heuristic from article [1]
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)

    rand_X = sample(range(0, n, 1), m)

    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])

    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print ujd, wjd
        H = 0

    return H

The original implementation can be found here, but as i said it only works in 1 dimension and value range of 0 to 1. My version works with any dimension (tested in 3D) and with any range of values.

2 Comments Add yours

  1. Robert says:

    Nice blog. I am the original author. I made a repo here: https://github.com/romusters/hopkins/tree/master

    Like

Leave a comment