[mlpack-git] master: Document the state of loading sparse matrices. (e36eec5)

Tue May 31 09:30:55 EDT 2016

Repository : https://github.com/mlpack/mlpack
On branch  : master
Link       : https://github.com/mlpack/mlpack/compare/1dad2b662d595097d77f4f0608e22aaa5546bd67...e36eec5cb250d8c36a49aba5cc1bae6a68723d29

>---------------------------------------------------------------

commit e36eec5cb250d8c36a49aba5cc1bae6a68723d29
Author: Ryan Curtin <ryan at ratml.org>
Date:   Tue May 31 09:30:55 2016 -0400

    Document the state of loading sparse matrices.


>---------------------------------------------------------------

e36eec5cb250d8c36a49aba5cc1bae6a68723d29
 doc/guide/formats.hpp | 43 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 40 insertions(+), 3 deletions(-)

diff --git a/doc/guide/formats.hpp b/doc/guide/formats.hpp
index 4b2d737..ef834af 100644
--- a/doc/guide/formats.hpp
+++ b/doc/guide/formats.hpp
@@ -16,9 +16,10 @@ suitable numeric representation.  Therefore, in general, datasets on disk should
 contain only numeric features in order to be loaded successfully by mlpack.
 
 The types of datasets that mlpack can load are roughly the same as the types of
-matrices that Armadillo can load.  When datasets are loaded by mlpack, \b the
-\b "file's type is detected using the file's extension".  mlpack supports the
-following file types:
+matrices that Armadillo can load.  However, the load functionality that mlpack
+provides \b "only supports loading dense datasets".  When datasets are loaded by
+mlpack, \b the \b "file's type is detected using the file's extension".  mlpack
+supports the following file types:
 
  - csv (comma-separated values), denoted by .csv or .txt
  - tsv (tab-separated values), denoted by .tsv, .csv, or .txt
@@ -101,6 +102,42 @@ As with the command-line programs, the type of data to be loaded is
 automatically detected from the filename extension.  For more details, see the
 mlpack::data::Load() and mlpack::data::Save() documentation.
 
+ at section sparseload Dealing with sparse matrices
+
+As mentioned earlier, support for loading sparse matrices in mlpack is not
+available at this time.  To use a sparse matrix with mlpack code, you will have
+to write a C++ program instead of using any of the command-line tools, because
+the command-line tools all use dense datasets internally.  (There is one
+exception: the \c mlpack_cf program, for collaborative filtering, loads sparse
+coordinate lists.)
+
+In addition, the \c mlpack::data::Load() function does not support loading any
+sparse format; so the best idea is to use undocumented Armadillo functionality
+to load coordinate lists.  Suppose you have a coordinate list file like the one
+below:
+
+\code
+$ cat cl.csv
+0 0 0.332
+1 3 3.126
+4 4 1.333
+\endcode
+
+This represents a 5x5 matrix with three nonzero elements.  We can load this
+using Armadillo:
+
+\code
+arma::sp_mat matrix;
+matrix.load("cl.csv", arma::coord_ascii);
+matrix = matrix.t(); // We must transpose after load!
+\endcode
+
+The transposition after loading is necessary if the coordinate list is in
+row-major format (that is, if each row in the matrix represents a point and each
+column represents a feature).  Be sure that the matrix you use with mlpack
+methods has points as columns and features as rows!  See \ref matrices for more
+information.
+
 @section formatcat Categorical features and command line programs
 
 In some situations it is useful to represent data not just as a numeric matrix