[mlpack-git] [mlpack] Mean shift clustering (#388)

Sat Apr 11 12:16:34 EDT 2015

I apply range searcher and add a function to generate seeds from the data set as initial centroids like what scikit does. I use all the points as initial centroids before.  Generated seeds is much fewer than all the points.
I used the following test script for scikit:
```
import numpy
import time
from sklearn.cluster import MeanShift, estimate_bandwidth

d = numpy.genfromtxt('iris.csv', delimiter=',')
bw = estimate_bandwidth(d, quantile=0.2, n_samples=500)

print(bw)

ms = MeanShift(bandwidth=bw, bin_seeding = True)
t1 = time.time()
ms.fit(d)
t2 = time.time()
print t2 - t1

print(ms.cluster_centers_)
print(len(numpy.unique(ms.labels_)))
```
the result is 0.038s
0.912643298082
0.0381479263306
[[ 6.28301887  2.88679245  4.90754717  1.7       ]
 [ 4.97391304  3.39130435  1.47391304  0.24130435]]
2
then I call MS program with 
-v -i iris.csv -o assignments.csv -C centroids.csv -r 0.912643298082
I get 
clustering: 0.048295s
MS program is much faster than before.
But my implementation is still slower than scikit because of the algorithm.
```
          distances[0][j] /= radius;
          double weight = kernel.Gradient(distances[0][j]) / distances[0][j];
          sumWeight += weight;
          newCentroid += weight * data.unsafe_col(neighbors[0][j]);
```
I use this to iterate and can use user-defined kernel.
But scikit just use the average 
```
my_mean = np.mean(points_within, axis=0)
```
I tried to use the same simple algorithm like scikit, then MLPACK and scikit cost the same time.

---
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/388#issuecomment-91872693
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20150411/8fb2fb58/attachment.html>