[mlpack-git] [mlpack] Mean shift clustering (#388)

Thu Apr 23 10:33:41 EDT 2015

Are you sure you compiled mlpack without debugging symbols?  Here is what I get when compiling mlpack with `-DDEBUG=OFF` and `-DPROFILE=OFF`.  I used this test program for scikit:

```
#!/usr/bin/python
import numpy
from sklearn.cluster import MeanShift
from sklearn.cluster import estimate_bandwidth
import time

d = numpy.genfromtxt('/home/ryan/datasets/corel.csv', delimiter=',')
bw = estimate_bandwidth(d, quantile=0.2, n_samples=500)

print(bw)

ms = MeanShift(bandwidth=bw, bin_seeding = True)
t1 = time.time()
ms.fit(d)
t2 = time.time()

print t2 - t1

print(len(numpy.unique(ms.labels_)))
```

This gave me the following output:

```
0.430335887828
7.03606009483
1
```

So, bandwidth of 0.430336, it took 7.036 seconds, and we got 1 cluster as a result.  Then, I use your implementation for mlpack:

```
$ mean_shift -i ~/datasets/corel.csv -r 0.430335887828 -v -C centers.csv
[INFO ] Loading '/home/ryan/datasets/corel.csv' as CSV data.  Size is 32 x 37749.
[INFO ] Performing mean shift clustering...
[INFO ] 46511 node combinations were scored.
[INFO ] 37749 base cases were calculated.
[INFO ] Found 1 centroids.
[WARN ] No extension given with filename ''; type unknown.  Save failed.
[INFO ] Saving CSV data to 'centers.csv'.
[INFO ] 
[INFO ] Execution parameters:
[INFO ]   bandwidth: (Unknown data type - )
[INFO ]   centroid_file: centers.csv
[INFO ]   help: false
[INFO ]   in_place: false
[INFO ]   info: ""
[INFO ]   inputFile: /home/ryan/datasets/corel.csv
[INFO ]   max_iterations: 1000
[INFO ]   output_file: ""
[INFO ]   radius: 0.430336
[INFO ]   verbose: true
[INFO ]   version: false
[INFO ] 
[INFO ] Program timers:
[INFO ]   clustering: 3.681358s
[INFO ]   computing_neighbors: 0.009845s
[INFO ]   loading_data: 0.459559s
[INFO ]   range_search/computing_neighbors: 2.075936s
[INFO ]   range_search/tree_building: 0.440392s
[INFO ]   saving_data: 0.000118s
[INFO ]   total_time: 4.143638s
[INFO ]   tree_building: 0.487500s
```

So, the mlpack implementation appears to be twice as fast as the scikit implementation.  (I'm using Python 2.7.9 with Debian's `python-sklearn` 0.15.2-3 package.)  I wouldn't be surprised if newer versions of scikit seem faster, but either way, the timings I'm getting are drastically different than you are, so maybe there is a configuration issue on your end?

With the covertype dataset and a bandwidth of 1524.6535, scikit takes 157.3325s while mlpack takes 42.159s.

---
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/388#issuecomment-95606901
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20150423/1e9bf3c1/attachment.html>