# [mlpack-svn] r14957 - mlpack/trunk/doc/tutorials/kmeans

fastlab-svn at coffeetalk-1.cc.gatech.edu fastlab-svn at coffeetalk-1.cc.gatech.edu
Thu Apr 25 01:26:04 EDT 2013

Author: rcurtin
Date: 2013-04-25 01:26:03 -0400 (Thu, 25 Apr 2013)
New Revision: 14957

mlpack/trunk/doc/tutorials/kmeans/kmeans.txt
Log:
Wow, I spent a lot of time writing that.  It's mostly checked for errors.

===================================================================
--- mlpack/trunk/doc/tutorials/kmeans/kmeans.txt	                        (rev 0)
+++ mlpack/trunk/doc/tutorials/kmeans/kmeans.txt	2013-04-25 05:26:03 UTC (rev 14957)
@@ -0,0 +1,596 @@
+/*!
+
+ at file kmeans.txt
+ at author Ryan Curtin
+ at brief Tutorial for how to use k-means in mlpack.
+
+ at page kmtutorial K-Means tutorial (kmeans)
+
+ at section intro_kmtut Introduction
+
+The popular k-means algorithm for clustering has been around since the late
+1950s, and the standard algorithm was proposed by Stuart Lloyd in 1957.  Given a
+set of points \f$X \f$, k-means clustering aims to partition each point \f$x_i +\f$ into a cluster \f$c_j \f$ (where \f$j \le k \f$ and \f$k \f$, the number
+of clusters, is a parameter).  The partitioning is done to minimize the
+objective function
+
+\f[
+\sum_{j = 1}^{k} \sum_{x_i \in c_j} \| x_i - \mu_j \|^2
+\f]
+
+where \f$\mu_j\f$ is the centroid of cluster \f$c_j\f$.  The standard algorithm
+is a two-step algorithm:
+
+ - \b Assignment \b step.  Each point \f$x_i\f$ in \f$X\f$ is assigned to the
+   cluster whose centroid it is closest to.
+
+ - \b Update \b step.  Using the new cluster assignments, the centroids of each
+   cluster are recalculated.
+
+The algorithm has converged when no more assignment changes are happening with
+each iteration.  However, this algorithm can get stuck in local minima of the
+objective function and is particularly sensitive to the initial cluster
+assignments.  Also, situations can arise where the algorithm will never converge
+but reaches steady state -- for instance, one point may be changing between two
+cluster assignments.
+
+There is vast literature on the k-means algorithm and its uses, as well as
+strategies for choosing initial points effectively and keeping the algorithm
+from converging in local minima.  \b mlpack does implement some of these,
+refined initial points.  Importantly, the C++ \c KMeans class makes it very easy
+to improve the k-means algorithm in a modular way.
+
+ at code
+  title={Refining initial points for k-means clustering},
+  booktitle={Proceedings of the Fifteenth International Conference on Machine
+      Learning (ICML 1998)},
+  volume={66},
+  year={1998}
+}
+ at endcode
+
+\b mlpack provides:
+
+ - a \ref cli_kmtut "simple command-line executable" to run k-means
+ - a \ref kmeans_kmtut "simple C++ interface" to run k-means
+ - a \ref kmeans_template_kmtut "generic, extensible, and powerful C++ class"
+   for complex usage
+
+
+A list of all the sections this tutorial contains.
+
+ - \ref intro_kmtut
+ - \ref toc_kmtut
+ - \ref cli_kmtut
+   - \ref cli_ex1_kmtut
+   - \ref cli_ex2_kmtut
+   - \ref cli_ex3_kmtut
+   - \ref cli_ex4_kmtut
+   - \ref cli_ex5_kmtut
+   - \ref cli_ex6_kmtut
+ - \ref kmeans_kmtut
+   - \ref kmeans_ex1_kmtut
+   - \ref kmeans_ex2_kmtut
+   - \ref kmeans_ex3_kmtut
+   - \ref kmeans_ex4_kmtut
+   - \ref kmeans_ex5_kmtut
+   - \ref kmeans_ex6_kmtut
+   - \ref kmeans_ex7_kmtut
+ - \ref kmeans_template_kmtut
+   - \ref kmeans_metric_kmtut
+   - \ref kmeans_initial_partition_kmtut
+   - \ref kmeans_empty_cluster_kmtut
+ - \ref further_doc_kmtut
+
+ at section cli_kmtut Command-Line 'kmeans'
+
+\b mlpack provides a command-line executable, \c kmeans, to allow easy execution
+of the k-means algorithm on data.  Complete documentation of the executable can
+be found by typing
+
+ at code
+$kmeans --help + at endcode + +Below are several examples demonstrating simple use of the \c kmeans executable. + + at subsection cli_ex1_kmtut Simple k-means clustering + +We want to find 5 clusters using the points in the file dataset.csv. By +default, if any of the clusters end up empty, that cluster will be reinitialized +to contain the point furthest from the cluster with maximum variance. The +cluster assignments of each point will be stored in assignments.csv. Each row +in assignments.csv will correspond to the row in dataset.csv. + + at code +$ kmeans -c 5 -i dataset.csv -v -o assignments.csv
+ at endcode
+
+ at subsection cli_ex2_kmtut Saving the resulting centroids
+
+Sometimes it is useful to save the centroids of the clusters found by k-means;
+one example might be for plotting the points.  The \c -C (\c --centroid_file)
+option allows specification of a file into which the centroids will be saved
+(one centroid per line, if it is a CSV or other text format).
+
+ at code
+$kmeans -c 5 -i dataset.csv -v -o assignments.csv -C centroids.csv + at endcode + + at subsection cli_ex3_kmtut Allowing empty clusters + +If you would like to allow empty clusters to exist, instead of reinitializing +them, simply specify the \c -e (\c --allow_empty_clusters) option. Note that +when you save your clusters, some of the clusters may be filled with NaNs. This +is expected behavior -- if a cluster has no points, the concept of a centroid +makes no sense. + + at code +$ kmeans -c 5 -i dataset.csv -v -o assignments.csv -C centroids.csv
+ at endcode
+
+ at subsection cli_ex4_kmtut Limiting the maximum number of iterations
+
+As mentioned earlier, the k-means algorithm can often fail to converge.  In such
+a situation, it may be useful to stop the algorithm by way of limiting the
+maximum number of iterations.  This can be done with the \c -m (\c
+--max_iterations) parameter, which is set to 1000 by default.  If the maximum
+number of iterations is 0, the algorithm will run until convergence -- or
+potentially forever.  The example below sets a maximum of 250 iterations.
+
+ at code
+$kmeans -c 5 -i dataset.csv -v -o assignments.csv -m 250 + at endcode + + at subsection cli_ex5_kmtut Setting the overclustering factor + +The \b mlpack k-means implementation allows "overclustering", which is when the +k-means algorithm is run with more than the requested number of clusters. Upon +convergence, the clusters with the nearest centroids are merged until only the +requested number of centroids remain. This can provide better clustering +results. The overclustering factor, specified with \c -O or \c +--overclustering, determines how many more clusters are found than were +requested. For instance, with \c k set to 5 and an overclustering factor of 2, +10 clusters will be found. Note that the overclustering factor does not need to +be an integer. + +The following code snippet finds 5 clusters, but with an overclustering factor +of 2.4 (so 12 clusters are found and then merged together to produce 5 final +clusters). + + at code +$ kmeans -c 5 -O 2.4 -i dataset.csv -v -o assignments.csv
+ at endcode
+
+
+The method proposed by Bradley and Fayyad in their paper "Refining initial
+points for k-means clustering" is implemented in \b mlpack.  This strategy
+samples points from the dataset and runs k-means clustering on those points
+multiple times, saving the resulting clusters.  Then, k-means clustering is run
+on those clusters, yielding the original number of clusters.  The centroids of
+those resulting clusters are used as initial centroids for k-means clustering on
+the entire dataset.
+
+This technique generally gives better initial points than the default random
+partitioning, but depending on the parameters, it can take much longer.  This
+initialization technique is enabled with the \c -r (\c --refined_start) option.
+The \c -S (\c --samplings) parameter controls how many samplings of the dataset
+are performed, and the \c -p (\c --percentage) parameter controls how much of
+the dataset is randomly sampled for each sampling (it must be between 0.0 and
+1.0).  For more information on the refined start technique, see the paper
+referenced in the introduction of this tutorial.
+
+The example below performs k-means clustering, giving 5 clusters, using the
+refined start technique, sampling 10% of the dataset 25 times to produce the
+initial centroids.
+
+ at code
+\$ kmeans -c 5 -i dataset.csv -v -o assignments.csv -r -S 25 -p 0.2
+ at endcode
+
+ at section kmeans_kmtut The 'KMeans' class
+
+The \c KMeans<> class (with default template parameters) provides a simple way
+to run k-means clustering using \b mlpack in C++.  The default template
+parameters for \c KMeans<> will initialize cluster assignments randomly and
+disallow empty clusters.  When an empty cluster is encountered, the point
+furthest from the cluster with maximum variance is set to the centroid of the
+empty cluster.
+
+ at subsection kmeans_ex1_kmtut Running k-means and getting cluster assignments
+
+The simplest way to use the \c KMeans<> class is to pass in a dataset and a
+number of clusters, and receive the cluster assignments in return.  Note that
+the dataset must be column-major -- that is, one column corresponds to one
+
+ at code
+#include <mlpack/methods/kmeans/kmeans.hpp>
+
+using namespace mlpack::kmeans;
+
+// The dataset we are clustering.
+extern arma::mat data;
+// The number of clusters we are getting.
+extern size_t clusters;
+
+// The assignments will be stored in this vector.
+arma::Col<size_t> assignments;
+
+// Initialize with the default arguments.
+KMeans<> k;
+k.Cluster(data, clusters, assignments);
+ at endcode
+
+Now, the vector \c assignments holds the cluster assignments of each point in
+the dataset.
+
+ at subsection kmeans_ex2_kmtut Running k-means and getting centroids of clusters
+
+Often it is useful to not only have the cluster assignments, but the centroids
+of each cluster.  Another overload of \c Cluster() makes this easily possible:
+
+ at code
+#include <mlpack/methods/kmeans/kmeans.hpp>
+
+using namespace mlpack::kmeans;
+
+// The dataset we are clustering.
+extern arma::mat data;
+// The number of clusters we are getting.
+extern size_t clusters;
+
+// The assignments will be stored in this vector.
+arma::Col<size_t> assignments;
+// The centroids will be stored in this matrix.
+arma::mat centroids;
+
+// Initialize with the default arguments.
+KMeans<> k;
+k.Cluster(data, clusters, assignments, centroids);
+ at endcode
+
+Note that the centroids matrix has columns equal to the number of clusters and
+rows equal to the dimensionality of the dataset.  Each column represents the
+centroid of the according cluster -- \c centroids.col(0) represents the
+centroid of the first cluster.
+
+ at subsection kmeans_ex3_kmtut Limiting the maximum number of iterations
+
+The first argument to the constructor allows specification of the maximum number
+of iterations.  This is useful because often, the k-means algorithm does not
+converge, and is terminated after a number of iterations.  Setting this
+parameter to 0 indicates that the algorithm will run until convergence -- note
+that in some cases, convergence may never happen.  The default maximum number of
+iterations is 1000.
+
+ at code
+// The first argument is the maximum number of iterations.  Here we set it to
+// 500 iterations.
+KMeans<> k(500);
+ at endcode
+
+Then you can run \c Cluster() as normal.
+
+ at subsection kmeans_ex4_kmtut Setting the overclustering factor
+
+For a description of what overclustering is, see
+ at ref cli_ex5_kmtut "the command-line interface tutorial about overclustering".
+
+The overclustering factor, which by default is 1.0 (this indicates that no
+overclustering is happening), is specified in the second argument to the
+constructor.
+
+ at code
+// We will keep the default maximum iterations of 1000, but set the
+// overclustering factor to 2.5.
+KMeans<> k(1000, 2.5);
+ at endcode
+
+Then you can run \c Cluster() as normal.
+
+ at subsection kmeans_ex5_kmtut Setting initial cluster assignments
+
+If you have an initial guess for the cluster assignments for each point, you can
+fill the assignments vector with the guess and then pass an extra boolean
+(initialAssignmentGuess) as true to the \c Cluster() method.  Below are examples
+for either overload of \c Cluster().
+
+ at code
+#include <mlpack/methods/kmeans/kmeans.hpp>
+
+using namespace mlpack::kmeans;
+
+// The dataset we are clustering on.
+extern arma::mat dataset;
+// The number of clusters we are obtaining.
+extern size_t clusters;
+
+// A vector pre-filled with initial assignment guesses.
+extern arma::Col<size_t> assignments;
+
+KMeans<> k;
+
+// The boolean set to true indicates that our assignments vector is filled with
+// initial guesses.
+k.Cluster(dataset, clusters, assignments, true);
+ at endcode
+
+ at code
+#include <mlpack/methods/kmeans/kmeans.hpp>
+
+using namespace mlpack::kmeans;
+
+// The dataset we are clustering on.
+extern arma::mat dataset;
+// The number of clusters we are obtaining.
+extern size_t clusters;
+
+// A vector pre-filled with initial assignment guesses.
+extern arma::Col<size_t> assignments;
+
+// This will hold the centroids of the finished clusters.
+arma::mat centroids;
+
+KMeans<> k;
+
+// The boolean set to true indicates that our assignments vector is filled with
+// initial guesses.
+k.Cluster(dataset, clusters, assignments, centroids, true);
+ at endcode
+
+ at note
+If you have a heuristic or algorithm which makes initial guesses, a
+more elegant solution is to create a new class fulfilling the
+InitialPartitionPolicy template policy.  See \ref kmeans_initial_partition_kmtut
+"the section about changing the initial partitioning strategy" for more details.
+
+ at par
+
+ at note
+If you set the InitialPartitionPolicy parameter to something other than the
+default but give an initial cluster assignment guess, the InitialPartitionPolicy
+will not be used to initialize the algorithm.  See \ref kmeans_initial_partition_kmtut
+"the section about changing the initial partitioning strategy"
+for more details.
+
+ at subsection kmeans_ex6_kmtut Setting initial cluster centroids
+
+An equally important option to being able to make initial cluster assignment
+guesses is to make initial cluster centroid guesses without having to assign
+each point in the dataset to an initial cluster.  This is similar to the
+previous section, but now you must pass two extra booleans -- the first
+(initialAssignmentGuess) as false, indicating that there are not initial cluster
+assignment guesses, and the second (initialCentroidGuess) as true, indicating
+that the centroids matrix is filled with initial centroid guesses.
+
+This, of course, only works with the overload of \c Cluster() that takes a
+matrix to put the resulting centroids in.  Below is an example.
+
+ at code
+#include <mlpack/methods/kmeans/kmeans.hpp>
+
+using namespace mlpack::kmeans;
+
+// The dataset we are clustering on.
+extern arma::mat dataset;
+// The number of clusters we are obtaining.
+extern size_t clusters;
+
+// A matrix pre-filled with guesses for the initial cluster centroids.
+extern arma::mat centroids;
+
+// This will be filled with the final cluster assignments for each point.
+arma::Col<size_t> assignments;
+
+KMeans<> k;
+
+// Remember, the first boolean indicates that we are not giving initial
+// assignment guesses, and the second boolean indicates that we are giving
+// initial centroid guesses.
+k.Cluster(dataset, clusters, assignments, centroids, false, true);
+ at endcode
+
+ at note
+If you have a heuristic or algorithm which makes initial guesses, a
+more elegant solution is to create a new class fulfilling the
+InitialPartitionPolicy template policy.  See \ref kmeans_initial_partition_kmtut
+"the section about changing the initial partitioning strategy" for more details.
+
+ at par
+
+ at note
+If you set the InitialPartitionPolicy parameter to something other than the
+default but give an initial cluster centroid guess, the InitialPartitionPolicy
+will not be used to initialize the algorithm.  See \ref kmeans_initial_partition_kmtut
+"the section about changing the initial partitioning strategy" for more details.
+
+ at subsection kmeans_ex7_kmtut Running sparse k-means
+
+The \c Cluster() function can work on both sparse and dense matrices, so all of
+the above examples can be used with sparse matrices instead.  Below is a simple
+example.  Note that the centroids are returned as a sparse matrix also.
+
+ at code
+// The sparse dataset.
+extern arma::sp_mat sparseDataset;
+// The number of clusters.
+extern size_t clusters;
+
+// The assignments will be stored in this vector.
+arma::Col<size_t> assignments;
+// The centroids of each cluster will be stored in this sparse matrix.
+arma::sp_mat sparseCentroids;
+
+// No template parameter modification is necessary.
+KMeans<> k;
+k.Cluster(sparseDataset, clusters, assignments, sparseCentroids);
+ at endcode
+
+ at section kmeans_template_kmtut Template parameters for the 'KMeans' class
+
+The \c KMeans<> class also takes three template parameters, which can be
+modified to change the behavior of the k-means algorithm.  There are three
+template parameters:
+
+ - \c MetricType: controls the distance metric used for clustering (by
+   default, the squared Euclidean distance is used)
+ - \c InitialPartitionPolicy: the method by which initial clusters are set; by
+   default, \ref mlpack::kmeans::RandomPartition "RandomPartition" is used
+ - \c EmptyClusterPolicy: the action taken when an empty cluster is encountered;
+   by default, \ref mlpack::kmeans::MaxVarianceNewCluster "MaxVarianceNewCluster"
+   is used
+
+The class is defined like below:
+
+ at code
+template<
+  typename DistanceMetric = mlpack::metric::SquaredEuclideanDistance,
+  typename InitialPartitionPolicy = RandomPartition,
+  typename EmptyClusterPolicy = MaxVarianceNewCluster
+>
+class KMeans;
+ at endcode
+
+In the following sections, each policy is described further, with examples of
+how to modify them.
+
+ at subsection kmeans_metric_kmtut Changing the distance metric used for k-means
+
+Most machine learning algorithms in \b mlpack support modifying the distance
+metric, and \c KMeans<> is no exception.  Similar to \ref
+neighbor::NeighborSearch NeighborSearch (see \ref metric_type_doc_nstut
+"the section in the NeighborSearch tutorial"), any class in mlpack::metric can
+be given as an argument.  The mlpack::metric::LMetric class is a good example
+implementation.
+
+A class fulfilling the MetricType policy must provide the following two
+functions:
+
+ at code
+// Empty constructor is required.
+MetricType();
+
+// Computer the distance between two points.
+template<typename VecType>
+double Evaluate(const VecType& a, const VecType& b);
+ at endcode
+
+Most of the standard metrics that could be used are stateless and therefore the
+\c Evaluate() method is implemented statically.  However, there are metrics,
+such as the Mahalanobis distance (mlpack::metric::MahalanobisDistance), that
+store state.  To this end, an instantiated MetricType object is stored within the
+\c KMeans class.  The example below shows how to pass an instantiated
+MahalanobisDistance in the constructor.
+
+ at code
+// The initialized Mahalanobis distance.
+extern mlpack::metric::MahalanobisDistance distance;
+
+// We keep the default arguments for the maximum number of iterations and
+// overclustering factor, but pass our instantiated metric.
+KMeans<mlpack::metric::MahalanobisDistance> k(1000, 1.0, distance);
+ at endcode
+
+ at note
+While the MetricType policy only requires two methods, one of which is an empty
+constructor, more can always be added.  mlpack::metric::MahalanobisDistance also
+has constructors with parameters, because it is a stateful metric.
+
+ at subsection kmeans_initial_partition_kmtut Changing the initial partitioning strategy used for k-means
+
+There have been many initial cluster strategies for k-means proposed in the
+literature.  Fortunately, the \c KMeans<> class makes it very easy to implement
+one of these methods and plug it in without needing to modify the existing
+algorithm code at all.
+
+By default, the \c KMeans<> class uses mlpack::kmeans::RandomPartition, which
+randomly partitions points into clusters.  However, writing a new policy is
+simple; it needs to only implement the following functions:
+
+ at code
+// Empty constructor is required.
+InitialPartitionPolicy();
+
+// This function is called to initialize the clusters.
+template<typename MatType>
+void Cluster(MatType& data,
+             const size_t clusters,
+             arma::Col<size_t> assignments);
+ at endcode
+
+The templatization of the \c Cluster() function allows both dense and sparse
+matrices to be passed in.  If the desired policy does not work with sparse (or
+dense) matrices, then the method can be written specifically for one type of
+matrix -- however, be warned that if you try to use \c KMeans with that policy
+and the wrong type of matrix, you will get many ugly compilation errors!
+
+ at code
+// The Cluster() function specialized for dense matrices.
+void Cluster(arma::mat& data,
+             const size_t clusters,
+             arma::Col<size_t> assignments);
+ at endcode
+
+One alternate to the default RandomPartition policy is the RefinedStart policy,
+which is an implementation of the Bradley and Fayyad approach for finding
+initial points detailed in "Refined initial points for k-means clustering" and
+other places in this document.  Also see the documentation for
+
+The \c Cluster() method must return valid initial assignments for every point in
+the dataset.
+
+As with the MetricType template parameter, an initialized InitialPartitionPolicy
+can be passed to the constructor of \c KMeans as a fourth argument.
+
+ at subsection kmeans_empty_cluster_kmtut Changing the action taken when an empty cluster is encountered
+
+Sometimes, during clustering, a situation will arise where a cluster has no
+points in it.  The \c KMeans class allows easy customization of the action to be
+taken when this occurs.  By default, the point furthest from the centroid of the
+cluster with maximum variance is taken as the centroid of the empty cluster;
+this is implemented in the mlpack::kmeans::MaxVarianceNewCluster class.  Another
+alternate choice is the mlpack::kmeans::AllowEmptyClusters class, which simply
+allows empty clusters to persist.
+
+A custom policy can be written and it must implement the following methods:
+
+ at code
+// Empty constructor is required.
+EmptyClusterPolicy();
+
+// This function is called when an empty cluster is encountered.  emptyCluster
+// indicates the cluster which is empty, and then the clusterCounts and
+// assignments are meant to be modified by the function.  The function should
+// return the number of modified points.
+template<typename MatType>
+size_t EmptyCluster(const MatType& data,
+                    const size_t emptyCluster,
+                    const MatType& centroids,
+                    arma::Col<size_t>& clusterCounts,
+                    arma::Col<size_t>& assignments);
+ at endcode
+
+The \c EmptyCluster() function is called for each cluster that is empty at each
+iteration of the algorithm.  As with InitialPartitionPolicy, the \c
+EmptyCluster() function does not need to be generalized to support both dense
+and sparse matrices -- but usage with the wrong type of matrix will cause
+compilation errors.
+
+Like the other template parameters to \c KMeans, EmptyClusterPolicy
+implementations that have state can be passed to the constructor of \c KMeans as
+a fifth argument.  See the kmeans::KMeans documentation for further details.
+
+ at section further_doc_kmtut Further documentation
+
+For further documentation on the KMeans class, consult the \ref
+mlpack::kmeans::KMeans "complete API documentation".
+
+*/