[mlpack] Question about KMeans benchmark
Ryan Curtin
gth671b at mail.gatech.edu
Thu Jul 17 11:57:30 EDT 2014
On Thu, Jul 17, 2014 at 11:27:14AM -0400, Liu Liu wrote:
> On Thu, Jul 17, 2014 at 11:14 AM, Liu Liu <lliu at stern.nyu.edu> wrote:
> > Hi Ryan,
> >
> > I am interested in using KMeans in MLPACK for my research purpose. I have
> > several questions about the benchmark of Kmeans in your website.
Hello Liu,
I will do my best to answer.
> > 1) What are the datasets? How large (# of items, # of features)?
>
> Ah, I should have read the benchmark results more carefully. I find the
> dataset names now.
Ok, good to hear you got that figured out.
> > 2) Is the result based on a single run or multiple run? Matlab has a
> > parameter to run Kmeans multiple times and choose the best one as final
> > result.
> > 3) Do you use Bradley-Fayyad "refined start" when test KMeans for
> > benchmark?
The result is based on a single run, but each run is supposed to start
from the same initial cluster centroids. As a result, the algorithm is
deterministic and will proceed to the same results for every library.
I have successfully determined that the benchmarking system has support
for starting from the same initial centroids for each library, but I
have not yet confirmed if that support is in use. I'll respond further
when I have figured it out (or alternately, maybe Marcus, the system's
author, can give some insight).
> > 4) How do you select other parameters for each dataset? The result only
> > showed # of clusters.
Once the initial cluster centroids are selected, the only other
parameter is the number of clusters. These are selected based on
properties of the dataset.
> > Regarding how to select a good initial start, you mentioned in the website
> > that there are multiple strategies for choosing initial points effectively
> > and MLPACK implements some of these, notably the Bradley-Fayyad algorithm.
> > Have you tried other initialization methods, e.g., KMeans++
> > <http://en.wikipedia.org/wiki/K-means%2B%2B> or XMeans
> > <http://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf>, or compared their
> > performance?
Both of those algorithms are good choices, but mlpack has neither
implemented. Thanks to the niceness of C++ templates, it is easy to
implement any other initialization strategy. See this section in the
tutorial for more information:
http://www.mlpack.org/doxygen.php?doc=kmtutorial.html#kmeans_initial_partition_kmtut
Once a new partitioning policy is implemented, one simply needs to make
a new KMeans object using the relevant partitioning policy:
KMeans<MetricType, YourNewPartitioningPolicy> kmeans;
kmeans.Cluster(...)
and it will use YourNewPartitioningPolicy to make the initial
assignments.
> > btw, I real like the project, the coding style and the nice documentation.
> > Thank you for making it available to us!!
Thanks! Feel free to contribute to the project -- anyone is welcome to
join the community.
--
Ryan Curtin | "Are you or are you not the Black Angel of Death?"
ryan at ratml.org | - Steve
More information about the mlpack
mailing list