[mlpack-svn] r14983 - mlpack/trunk/doc/tutorials/fastmks

fastlab-svn at coffeetalk-1.cc.gatech.edu fastlab-svn at coffeetalk-1.cc.gatech.edu
Wed May 1 18:45:56 EDT 2013


Author: rcurtin
Date: 2013-05-01 18:45:54 -0400 (Wed, 01 May 2013)
New Revision: 14983

Added:
   mlpack/trunk/doc/tutorials/fastmks/fastmks.txt
Log:
I don't have a reliable connection, so check this in so I can work on it from my
local system.


Added: mlpack/trunk/doc/tutorials/fastmks/fastmks.txt
===================================================================
--- mlpack/trunk/doc/tutorials/fastmks/fastmks.txt	                        (rev 0)
+++ mlpack/trunk/doc/tutorials/fastmks/fastmks.txt	2013-05-01 22:45:54 UTC (rev 14983)
@@ -0,0 +1,492 @@
+/*!
+
+ at file fastmks.txt
+ at author Ryan Curtin
+ at brief Tutorial for how to use FastMKS in mlpack.
+
+ at page fmkstutorial Fast max-kernel search tutorial (fastmks)
+
+ at section intro_fmkstut Introduction
+
+The FastMKS algorithm (fast exact max-kernel search) is a recent algorithm
+proposed in the following paper:
+
+ at code
+ at inproceedings{curtin2013fast,
+  title={Fast Exact Max-Kernel Search},
+  author={Curtin, Ryan R. and Ram, Parikshit and Gray, Alexander G.},
+  booktitle={Proceedings of the 2013 SIAM International Conference on Data
+      Mining (SDM 13)},
+  year={2013}
+}
+ at endcode
+
+Given a set of query points \f$Q\f$ and a set of reference points \f$R\f$, the
+FastMKS algorithm is a fast algorithm which finds
+
+\f[
+\argmax_{p_r \in R} K(p_q, p_r)
+\f]
+
+for all points \f$p_q \in Q\f$ and for some Mercer kernel \f$K(\cdot, \cdot)\f$.
+A Mercer kernel is a kernel that is positive semidefinite; these are the classes
+of kernels that can be used with the kernel trick.  In short, the positive
+semidefiniteness of a Mercer kernel means that any kernel matrix (or Gram
+matrix) created on a dataset must be positive semidefinite.
+
+The FastMKS algorithm builds trees on the datasets \f$Q\f$ and \f$R\f$ in such a
+way that explicit representation of the points in the kernel space is
+unnecessary, by using cover trees (\ref mlpack::tree::CoverTree).  This allows
+the algorithm to be run, for instance, on string kernels, where there is no
+sensible explicit representation.  For more details, see the paper.
+
+At the time of this writing there is no other fast algorithm for exact
+max-kernel search.  Also, \b mlpack implements dual-tree FastMKS, while the
+paper referenced above only explains single-tree search.
+
+\b mlpack provides:
+
+ - a \ref cli_fmkstut "simple command-line executable" to run FastMKS
+ - a \ref fastmks_fmkstut "C++ interface" to run FastMKS
+
+ at section toc_fmkstut Table of Contents
+
+A list of all the sections this tutorial contains.
+
+ - \ref intro_fmkstut
+ - \ref toc_fmkstut
+ - \ref cli_fmkstut
+   - \ref cli_ex1_fmkstut
+   - \ref cli_ex2_fmkstut
+   - \ref cli_ex3_fmkstut
+   - \ref cli_ex4_fmkstut
+   - \ref cli_ex5_fmkstut
+ - \ref fastmks_fmkstut
+   - \ref fastmks_ex1_fmkstut
+   - \ref fastmks_ex2_fmkstut
+   - \ref fastmks_ex3_fmkstut
+   - \ref fastmks_ex4_fmkstut
+ - \ref writing_kernel_fmkstut
+ - \ref custom_trees_fmkstut
+ - \ref objects_fmkstut
+ - \ref further_doc_fmkstut
+
+ at section cli_fmkstut Command-line FastMKS (fastmks)
+
+\b mlpack provides a command-line program, \c fastmks, which is used to perform
+FastMKS on a given query and reference dataset.  It supports numerous different
+types of kernels:
+
+ - \ref mlpack::kernel::LinearKernel "linear kernel"
+ - \ref mlpack::kernel::PolynomialKernel "polynomial kernel"
+ - \ref mlpack::kernel::CosineDistance "cosine distance"
+ - \ref mlpack::kernel::GaussianKernel "Gaussian kernel"
+ - \ref mlpack::kernel::EpanechnikovKernel "Epanechnikov kernel"
+ - \ref mlpack::kernel::TriangularKernel "triangular kernel"
+ - \ref mlpack::kernel::HyperbolicTangentKernel "hyperbolic tangent kernel"
+
+The following examples detail usage of the \c fastmks program.  Note that you
+can get documentation on all the possible parameters by typing:
+
+ at code
+$ fastmks --help
+ at endcode
+
+ at subsection cli_ex1_fmkstut FastMKS with a linear kernel on one dataset
+
+If only one dataset is specified (with \c -r or \c --reference_file), the
+reference dataset is taken to be both the query and reference datasets.  The
+example below finds the 5 maximum kernels of each point in dataset.csv, using
+the default linear kernel.
+
+ at code
+$ fastmks -r dataset.csv -k 5 -v -p products.csv -i indices.csv
+ at endcode
+
+When the operation completes, the values of the kernels are saved in
+products.csv and the indices of the points which give the maximum kernels are
+saved in indices.csv.
+
+ at code
+$ head indices.csv
+762,910,863,890,614
+762,910,426,568,863
+910,762,863,426,249
+762,910,863,426,890
+863,910,614,762,159
+762,863,910,614,890
+762,910,488,568,426
+762,910,863,426,614
+910,762,863,426,249
+863,762,910,614,890
+ at endcode
+
+ at code
+$ head products.csv
+1.6221652894e+00,1.5998743443e+00,1.5898890769e+00,1.5406789753e+00,1.5399969285e+00
+1.3387953449e+00,1.3317349486e+00,1.2966613184e+00,1.2774493620e+00,1.2702400994e+00
+1.6386110476e+00,1.6332029753e+00,1.5952629124e+00,1.5887195330e+00,1.5564789777e+00
+1.0917545803e+00,1.0820878726e+00,1.0668992636e+00,1.0419838050e+00,1.0339546654e+00
+1.2272441028e+00,1.2169643942e+00,1.2104597963e+00,1.2067780154e+00,1.1966583848e+00
+1.5720962456e+00,1.5618504956e+00,1.5609069923e+00,1.5235605095e+00,1.5106847348e+00
+1.3655478674e+00,1.3548593212e+00,1.3311547298e+00,1.3250728881e+00,1.3230827266e+00
+2.0119149744e+00,2.0043668067e+00,1.9847289214e+00,1.9298280046e+00,1.9262610223e+00
+1.1586923205e+00,1.1494586097e+00,1.1274872962e+00,1.1248172766e+00,1.1025268196e+00
+4.4789820372e-01,4.4618539778e-01,4.4200024852e-01,4.3989721792e-01,4.3277728840e-01
+ at endcode
+
+We can see in this example that for point 0, the point with maximum kernel value
+is point 762, with a kernel value of 1.622165.  For point 3, the point with
+fifth largest kernel value is point 890, with a kernel value of 1.033954.
+
+ at subsection cli_ex2_fmkstut FastMKS on a reference and query dataset
+
+The query points may be different than the reference points.  To specify a
+different query set, the \c -q (or \c --query_file) option is used, as in the
+example below.
+
+ at code
+$ fastmks -q query_set.csv -r reference_set.csv -k 5 -i indices.csv -p products.csv
+ at endcode
+
+ at subsection cli_ex3_fmkstut FastMKS with a different kernel
+
+The \c fastmks program offers more than just the linear kernel.  Valid options
+are \c 'linear', \c 'polynomial', \c 'cosine', \c 'gaussian', \c 'epanechnikov',
+\c 'triangular', and \c 'hyptan' (the hyperbolic tangent kernel).  Note that the
+hyperbolic tangent kernel is provably not a Mercer kernel but is positive
+semidefinite on most datasets and is commonly used as a kernel.  Note also that
+the Gaussian kernel gives the same results as nearest neighbor search (\see
+nstutorial).
+
+The kernel to use is specified with the \c -K (or \c --kernel) option.  The
+example below uses the cosine similarity as a kernel.
+
+ at code
+$ fastmks -r dataset.csv -k 5 -K cosine -i indices.csv -p products.csv -v
+ at endcode
+
+ at subsection cli_ex4_fmkstut Using single-tree search or naive search
+
+In some cases, it may be useful to not use the dual-tree FastMKS algorithm.
+Instead you can specify the \c --single option, indicating that a tree should be
+built only on the reference set, and then the queries should be processed in a
+linear scan (instead of in a tree).  Alternately, the \c -N (or \c --naive)
+option makes the program not build trees at all and instead use brute-force
+search to find the solutions.
+
+The example below uses single-tree search on two datasets.
+
+ at code
+$ fastmks -q query_set.csv -r reference_set.csv --single -k 5 -p products.csv \
+> -i indices.csv
+ at endcode
+
+The example below uses naive search on one dataset.
+
+ at code
+$ fastmks -r reference_set.csv -k 5 -N -p products.csv -i indices.csv
+ at endcode
+
+ at subsection cli_ex5_fmkstut Paramters for alternate kernels
+
+Many of the alternate kernel choices have parameters which can be chosen; these
+are detailed in this section.
+
+ - \b \c -w (\c --bandwidth): this sets the bandwidth of the kernel, and is
+   applicable to the \c 'gaussian', \c 'epanechnikov', and \c 'triangular'
+   kernels.  This is the "spread" of the kernel.
+
+ - \b \c -d (\c --degree): this sets the degree of the polynomial kernel (the
+   power to which the result is raised).  It is only applicable to the \c
+   'polynomial' kernel.
+
+ - \b \c -o (\c --offset): this sets the offset of the kernel, for the \c
+   'polynomial' and \c 'hyptan' kernel.  See \ref
+   mlpack::kernel::PolynomialKernel "the polynomial kernel documentation" and
+   \ref mlpack::kernel::HyperbolicTangentKernel
+   "the hyperbolic tangent kernel documentation" for more information.
+
+ - \b \c -s (\c --scale): this sets the scale of the kernel, and is only
+   applicable to the \c 'hyptan' kernel.  See \ref
+   mlpack::kernel::HyperbolicTangentKernel
+   "the hyperbolic tangent kernel documentation" for more information.
+
+ at section fastmks_fmkstut The 'FastMKS' class
+
+The \c FastMKS<> class offers a simple API for use within C++ applications, and
+allows further flexibility in kernel choice and tree type choice.  However,
+\c FastMKS<> has no default template parameter for the kernel type -- that must
+be manually specified.  Choices that \b mlpack provides include:
+
+ - \ref mlpack::metric::LinearKernel
+ - \ref mlpack::metric::PolynomialKernel
+ - \ref mlpack::kernel::CosineDistance
+ - \ref mlpack::kernel::GaussianKernel
+ - \ref mlpack::kernel::EpanechnikovKernel
+ - \ref mlpack::kernel::TriangularKernel
+ - \ref mlpack::kernel::HyperbolicTangentKernel
+
+The following examples use kernels from that list.  Writing your own kernel is
+detailed in \ref writing_kernel_fmkstut "the next section".  Remember that when
+you are using the C++ interface, the data matrices must be column-major.  See
+\ref matrices for more information.
+
+ at subsection fastmks_ex1_fmkstut FastMKS on one dataset
+
+Given only a reference dataset, the following code will run FastMKS with k set
+to 5.
+
+ at code
+#include <mlpack/methods/fastmks/fastmks.hpp>
+#include <mlpack/core/kernels/linear_kernel.hpp>
+
+using namespace mlpack::fastmks;
+
+// The reference dataset, which is column-major.
+extern arma::mat data;
+
+// This will initialize the FastMKS object with the linear kernel with default
+// options: K(x, y) = x^T y.  The tree is built in the constructor.
+FastMKS<LinearKernel> f(data);
+
+// The results will be stored in these matrices.
+arma::Mat<size_t> indices;
+arma::mat products;
+
+// Run FastMKS.
+f.Search(5, indices, products);
+ at endcode
+
+ at subsection fastmks_ex2_fmkstut FastMKS with a query and reference dataset
+
+In this setting we have both a query and reference dataset.  We search for 10
+maximum kernels.
+
+ at code
+#include <mlpack/methods/fastmks/fastmks.hpp>
+#include <mlpack/core/kernels/triangular_kernel.hpp>
+
+using namespace mlpack::fastmks;
+using namespace mlpack::kernel;
+
+// The reference and query datasets, which are column-major.
+extern arma::mat referenceData;
+extern arma::mat queryData;
+
+// This will initialize the FastMKS object with the triangular kernel with
+// default options (bandwidth of 1).  The trees are built in the constructor.
+FastMKS<TriangularKernel> f(queryData, referenceData);
+
+// The results will be stored in these matrices.
+arma::Mat<size_t> indices;
+arma::mat products;
+
+// Run FastMKS.
+f.Search(10, indices, products);
+ at endcode
+
+ at subsection fastmks_ex3_fmkstut FastMKS with an initialized kernel
+
+Often, kernels have parameters which need to be specified.  \c FastMKS<> has
+constructors which take initialized kernels.  Note that temporary kernels cannot
+be passed as an argument.  The example below initializes a \c PolynomialKernel
+object and then runs FastMKS with a query and reference dataset.
+
+ at code
+#include <mlpack/methods/fastmks/fastmks.hpp>
+#include <mlpack/core/kernels/polynomial_kernel.hpp>
+
+using namespace mlpack::fastmks;
+using namespace mlpack::kernel;
+
+// The reference and query datasets, which are column-major.
+extern arma::mat referenceData;
+extern arma::mat queryData;
+
+// Initialize the polynomial kernel with degree of 3 and offset of 2.5.
+PolynomialKernel pk(3.0, 2.5);
+
+// Create the FastMKS object with the initialized kernel.
+FastMKS<PolynomialKernel> f(referenceData, queryData, pk);
+
+// The results will be stored in these matrices.
+arma::Mat<size_t> indices;
+arma::mat products;
+
+// Run FastMKS.
+f.Search(10, indices, products);
+ at endcode
+
+The syntax for running FastMKS with one dataset and an initialized kernel is
+very similar:
+
+ at code
+FastMKS<PolynomialKernel> f(referenceData, pk);
+ at endcode
+
+ at subsection fastmks_ex4_fmkstut FastMKS with an already-created tree
+
+By default, \c FastMKS<> uses the cover tree datastructure (see \ref
+mlpack::tree::CoverTree).  Sometimes, it is useful to modify the parameters of
+the cover tree.  In this scenario, a tree must be built outside of the
+constructor, and then passed to the appropriate \c FastMKS<> constructor.  An
+example on just a reference dataset is shown below, where the base of the cover
+tree is modified.
+
+We also use an instantiated kernel, but because we are building our own tree, we
+must use \c IPMetric<> so that our tree is built on the metric induced by our
+kernel function.
+
+ at code
+#include <mlpack/methods/fastmks/fastmks.hpp>
+#include <mlpack/core/kernels/polynomial_kernel.hpp>
+
+// The reference dataset, which is column-major.
+extern arma::mat data;
+
+// Initialize the polynomial kernel with a degree of 4 and offset of 2.0.
+PolynomialKernel pk(4.0, 2.0);
+
+// Create the metric induced by this kernel (because a kernel is not a metric
+// and we can't build a tree on a kernel alone).
+IPMetric<PolynomialKernel> metric(pk);
+
+// Now build a tree on the reference dataset using the instantiated metric and
+// the custom base of 1.5 (default is 1.3).  We have to be sure to use the right
+// type here -- FastMKS needs the FastMKSStat object as the tree's
+// StatisticType.
+typedef tree::CoverTree<IPMetric<PolynomialKernel>, tree::FirstPointIsRoot,
+    FastMKSStat> TreeType; // Convenience typedef.
+TreeType* tree = new TreeType(data, metric, 1.5);
+
+// Now initialize FastMKS with that statistic.  We don't need to specify the
+// TreeType template parameter since we are still using the default.  We don't
+// need to pass the kernel because that is contained in the tree.
+FastMKS<PolynomialKernel> f(data, tree);
+
+// The results will be stored in these matrices.
+arma::Mat<size_t> indices;
+arma::mat products;
+
+// Run FastMKS.
+f.Search(10, indices, products);
+ at endcode
+
+The syntax is similar for the case where different query and reference datasets
+are given; but trees for both need to be built in the manner specified above.
+Be sure to build both trees using the same metric (or at least a metric with the
+exact same parameters).
+
+ at code
+FastMKS<PolynomialKernel> f(referenceData, referenceTree, queryData, queryTree);
+ at endcode
+
+ at section writing_kernel_fmkstut Writing a custom kernel for FastMKS
+
+While \b mlpack provides some number of kernels in the mlpack::kernel namespace,
+it is easy to create a custom kernel.  To satisfy the KernelType policy, a class
+must implement the following methods:
+
+ at code
+// Empty constructor is required.
+KernelType();
+
+// Evaluate the kernel between two points.
+template<typename VecType>
+double Evaluate(const VecType& a, const VecType& b);
+ at endcode
+
+The template parameter \c VecType is helpful (but not necessary) so that the
+kernel can be used with both sparse and dense matrices (\c arma::sp_mat and \c
+arma::mat).
+
+ at section custom_tree_fmkstut Using other tree types for FastMKS
+
+The use of the cover tree (\see tree::CoverTree) is not necessary for FastMKS,
+although it is the default tree type.  A different type of tree can be specified
+with the TreeType template parameter.  However, the tree type is required to
+have \ref fastmks::FastMKSStat FastMKSStat as the StatisticType.
+
+Below is an example of the \ref mlpack::tree::BinarySpaceTree BinarySpaceTree
+being used as the tree type for FastMKS.  In this example FastMKS is only run on
+one dataset.
+
+ at code
+#include <mlpack/methods/fastmks/fastmks.hpp>
+#include <mlpack/core/tree/binary_space_tree/binary_space_tree.hpp>
+
+using namespace mlpack::fastmks;
+using namespace mlpack::tree;
+
+// The dataset that FastMKS will be run on.
+extern arma::mat data;
+
+// The tree used for FastMKS.  We use a convenience typedef to make things a
+// little prettier.
+typedef BinarySpaceTree<bound::HRectBound<2, true>, FastMKSStat> TreeType;
+
+// The FastMKS constructor will create the tree.
+FastMKS<LinearKernel, TreeType> f(data);
+
+// These will hold the results.
+arma::Mat<size_t> indices;
+arma::mat products;
+
+// Run FastMKS.
+f.Search(5, indices, products);
+ at endcode
+
+Note that the \c BinarySpaceTree class used here is not the only possible option
+for the \c TreeType parameter; custom trees can be written, or, alternately,
+different template parameters can be used to configure either the \c
+BinarySpaceTree or the \c CoverTree.
+
+ at section objects_fmkstut Running FastMKS on objects
+
+FastMKS has a lot of utility on objects which are not representable in some sort
+of metric space.  These objects might be strings, graphs, models, or other
+objects.  For these types of objects, questions based on distance don't really
+make sense.  One good example is with strings.  The question "how far is 'dog'
+from 'Taki Inoue'?" simply doesn't make sense.  We can't have a centroid of the
+terms 'Fritz', 'E28', and 'popsicle'.
+
+However, what we can do is define some sort of kernel on these objects.  These
+kernels generally correspond to some similarity measure, with one example being
+the p-spectrum string kernel (see \ref mlpack::kernel::PSpectrumStringKernel).
+Using that, we can say "how similar is 'dog' to 'Taki Inoue'?" and get an actual
+numerical result by evaluating K('dog', 'Taki Inoue') (where K is our p-spectrum
+string kernel).
+
+The only requirement on these kernels is that they are positive definite kernels
+(or Mercer kernels).  For more information on those details, refer to the
+FastMKS paper ("Fast exact max-kernel search" by R.R. Curtin, P. Ram, and A.G.
+Gray).
+
+Remember that FastMKS is a tree-based method.  But trees like the binary space
+tree require centroids -- and as we said earlier, centroids often don't make
+sense with these types of objects.  Therefore, we need a type of tree which is
+built \i exclusively on points in the dataset -- those are points which we can
+evaluate our kernel function on.  The cover tree is one example of a type of
+tree satisfying this condition; its construction will only call the kernel
+function on two points that are in the dataset.
+
+But, we have one more problem.  The \c CoverTree class is built on \c arma::mat
+objects (dense matrices).  Our objects, however, are not necessarily
+representable in a column of a matrix.  To use the example we have been using,
+strings cannot be represented easily in a matrix because they may all have
+different lengths.
+
+The way to work around this problem is to create a "fake" data matrix which
+simply holds indices to objects.  A good example of how to do this is detailed
+in the documentation for the \ref mlpack::kernel::PSpectrumStringKernel
+"PSpectrumStringKernel".
+
+In short, the trick
+
+ at section further_doc_fmkstut Further documentation
+
+For further documentation on the FastMKS class, consult the \ref
+mlpack::fastmks::FastMKS "complete API documentation".
+
+*/




More information about the mlpack-svn mailing list