[mlpack-git] [mlpack] Mean shift clustering (#388)
Shangtong Zhang
notifications at github.com
Sat Apr 11 12:16:34 EDT 2015
I apply range searcher and add a function to generate seeds from the data set as initial centroids like what scikit does. I use all the points as initial centroids before. Generated seeds is much fewer than all the points.
I used the following test script for scikit:
```
import numpy
import time
from sklearn.cluster import MeanShift, estimate_bandwidth
d = numpy.genfromtxt('iris.csv', delimiter=',')
bw = estimate_bandwidth(d, quantile=0.2, n_samples=500)
print(bw)
ms = MeanShift(bandwidth=bw, bin_seeding = True)
t1 = time.time()
ms.fit(d)
t2 = time.time()
print t2 - t1
print(ms.cluster_centers_)
print(len(numpy.unique(ms.labels_)))
```
the result is 0.038s
0.912643298082
0.0381479263306
[[ 6.28301887 2.88679245 4.90754717 1.7 ]
[ 4.97391304 3.39130435 1.47391304 0.24130435]]
2
then I call MS program with
-v -i iris.csv -o assignments.csv -C centroids.csv -r 0.912643298082
I get
clustering: 0.048295s
MS program is much faster than before.
But my implementation is still slower than scikit because of the algorithm.
```
distances[0][j] /= radius;
double weight = kernel.Gradient(distances[0][j]) / distances[0][j];
sumWeight += weight;
newCentroid += weight * data.unsafe_col(neighbors[0][j]);
```
I use this to iterate and can use user-defined kernel.
But scikit just use the average
```
my_mean = np.mean(points_within, axis=0)
```
I tried to use the same simple algorithm like scikit, then MLPACK and scikit cost the same time.
---
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/388#issuecomment-91872693
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20150411/8fb2fb58/attachment.html>
More information about the mlpack-git
mailing list