[mlpack-git] master: Add a tutorial for the formats. (6ccf865)

gitdub at mlpack.org gitdub at mlpack.org
Tue Apr 12 10:43:52 EDT 2016

Repository : https://github.com/mlpack/mlpack
On branch  : master
Link       : https://github.com/mlpack/mlpack/compare/eeba6bdc50ad4d785cb6880edbaba78173036ca6...8d77f4231046703d5c0c05ed4795458f98267968


commit 6ccf8654454db88c11b26d0a4b65ac898d7fab53
Author: Ryan Curtin <ryan at ratml.org>
Date:   Fri Apr 8 18:51:55 2016 +0000

    Add a tutorial for the formats.


 doc/guide/formats.hpp | 304 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 304 insertions(+)

diff --git a/doc/guide/formats.hpp b/doc/guide/formats.hpp
new file mode 100644
index 0000000..0ce8702
--- /dev/null
+++ b/doc/guide/formats.hpp
@@ -0,0 +1,304 @@
+/*! @page formatdoc File formats in mlpack
+ at section formatintro Introduction
+mlpack supports a wide variety of data and model formats for use in both its
+command-line programs and in C++ programs using mlpack via the
+mlpack::data::Load() function.  This tutorial discusses the formats that are
+supported and how to use them.
+ at section formattypes Supported dataset types
+Datasets in mlpack are represented internally as sparse or dense numeric
+matrices (specifically, as \c arma::mat or \c arma::sp_mat or similar).  This
+means that when datasets are loaded from file, they must be converted to a
+suitable numeric representation.  Therefore, in general, datasets on disk should
+contain only numeric features in order to be loaded successfully by mlpack.
+The types of datasets that mlpack can load are roughly the same as the types of
+matrices that Armadillo can load.  When datasets are loaded by mlpack, \b the
+\b "file's type is detected using the file's extension".  mlpack supports the
+following file types:
+ - csv (comma-separated values), denoted by .csv or .txt
+ - tsv (tab-separated values), denoted by .tsv, .csv, or .txt
+ - ASCII (raw ASCII, with space-separated values), denoted by .txt
+ - Armadillo ASCII (Armadillo's text format with a header), denoted by .txt
+ - PGM, denoted by .pgm
+ - PPM, denoted by .ppm
+ - Armadillo binary, denoted by .bin
+ - Raw binary, denoted by .bin \b "(note: this will be loaded as"
+   \b "one-dimensional data, which is likely not what is desired.)"
+ - HDF5, denoted by .hdf, .hdf5, .h5, or .he5 \b "(note: HDF5 must be enabled"
+   \b "in the Armadillo configuration)"
+ - ARFF, denoted by .arff \b "(note: this is not supported by all mlpack"
+   \b "command-line programs"; see \ref formatinfo )
+Datasets that are loaded by mlpack should be stored with \b "one row for "
+\b "one point" and \b "one column for one dimension".  Therefore, a dataset with
+three two-dimensional points \f$(0, 1)\f$, \f$(3, 1)\f$, and \f$(5, -5)\f$ would
+be stored in a csv file as:
+0, 1
+3, 1
+5, -5
+As noted earlier, the format is automatically detected at load time.  Therefore,
+a dataset can be loaded in many ways:
+$ mlpack_logistic_regression -t dataset.csv -v
+[INFO ] Loading 'dataset.csv' as CSV data.  Size is 32 x 37749.
+$ mlpack_logistic_regression -t dataset.txt -v
+[INFO ] Loading 'dataset.txt' as raw ASCII formatted data.  Size is 32 x 37749.
+$ mlpack_logistic_regression -t dataset.h5 -v
+[INFO ] Loading 'dataset.h5' as HDF5 data.  Size is 32 x 37749.
+Similarly, the format to save to is detected by the extension of the given
+ at section formatcpp Loading simple matrices in C++
+When C++ is being written, the mlpack::data::Load() and mlpack::data::Save()
+functions are used to load and save datasets, respectively.  These functions
+should be preferred over the built-in Armadillo \c .load() and \c .save()
+Matrices in mlpack are column-major, meaning that each column should correspond
+to a point in the dataset and each row should correspond to a dimension; for
+more information, see \ref matrices .  This is at odds with how the data is
+stored in files; therefore, a transposition is required during load and save.
+The mlpack::data::Load() and mlpack::data::Save() functions do this
+automatically (unless otherwise specified), which is why they are preferred over
+the Armadillo functions.
+To load a matrix from file, the call is straightforward.  After creating a
+matrix object, the data can be loaded:
+arma::mat dataset; // The data will be loaded into this matrix.
+mlpack::data::Load("dataset.csv", dataset);
+Saving matrices is equally straightforward.  The code below generates a random
+matrix with 10 points in 3 dimensions and saves it to a file as HDF5.
+// 3 dimensions (rows), with 10 points (columns).
+arma::mat dataset = arma::randu<arma::mat>(3, 10);
+mlpack::data::Save("dataset.h5", dataset);
+As with the command-line programs, the type of data to be loaded is
+automatically detected from the filename extension.  For more details, see the
+mlpack::data::Load() and mlpack::data::Save() documentation.
+ at section formatcat Categorical features and command line programs
+In some situations it is useful to represent data not just as a numeric matrix
+but also as categorical data (i.e. with numeric but unordered categories).  This
+support is useful for, e.g., decision trees and other models that support
+categorical features.  
+In some machine learning situations, such as, e.g., decision trees, categorical
+data can be used.  Categorical data might look like this (in CSV format):
+0, 1, "true", 3
+5, -2, "false", 5
+2, 2, "true", 4
+3, -1, "true", 3
+4, 4, "not sure", 0
+0, 7, "false", 6
+In the example above, the third dimension (which takes values "true", "false",
+and "not sure") is categorical.  mlpack can load and work with this data, but
+the strings must be mapped to numbers, because all dataset in mlpack are
+represented by Armadillo matrix objects.
+From the perspective of an mlpack command-line program, this support is
+transparent; mlpack will attempt to load the data file, and if it detects
+entries in the file that are not numeric, it will map them to numbers and then
+print, for each dimension, the number of mappings.  For instance, if we run the
+\c mlpack_hoeffding_tree program (which supports categorical data) on the
+dataset above (stored as dataset.csv), we receive this output during loading:
+$ mlpack_hoeffding_tree -t dataset.csv -l dataset.labels.csv -v
+[INFO ] Loading 'dataset.csv' as CSV data.  Size is 6 x 4.
+[INFO ] 0 mappings in dimension 0.
+[INFO ] 0 mappings in dimension 1.
+[INFO ] 3 mappings in dimension 2.
+[INFO ] 0 mappings in dimension 3.
+ at section formatcatcpp Categorical features and C++
+When writing C++, loading categorical data is slightly more tricky: the mappings
+from strings to integers must be preserved.  This is the purpose of the
+mlpack::data::DatasetInfo class, which stores these mappings and can be used and
+load and save time to apply and de-apply the mappings.
+When loading a dataset with categorical data, the overload of
+mlpack::data::Load() that takes an mlpack::data::DatasetInfo object should be
+used.  An example is below:
+arma::mat dataset; // Load into this matrix.
+mlpack::data::DatasetInfo info; // Store information about dataset in this.
+// Load the ARFF dataset.
+mlpack::data::Load("dataset.arff", dataset, info);
+After this load completes, the \c info object will hold the information about
+the mappings necessary to load the dataset.  It is possible to re-use the
+\c DatasetInfo object to load another dataset with the same mappings.  This is
+useful when, for instance, both a training and test set are being loaded, and it
+is necessary that the mappings from strings to integers for categorical features
+are identical.  An example is given below.
+arma::mat trainingData; // Load training data into this matrix.
+mlpack::data::DatasetInfo info; // This will store the mappings.
+// Load the training data, and create the mappings in the 'info' object.
+mlpack::data::Load("training_data.arff", trainingData, info);
+// Load the test data, but re-use the 'info' object with the already initialized
+// mappings.  This means that the same mappings will be applied to the test set.
+mlpack::data::Load("test_data.arff", trainingData, info);
+When saving data, pass the same DatasetInfo object it was loaded with in order
+to unmap the categorical features correctly.  The example below demonstrates
+this functionality: it loads the dataset, increments all non-categorical
+features by 1, and then saves the dataset with the same DatasetInfo it was
+loaded with.
+arma::mat dataset; // Load data into this matrix.
+mlpack::data::DatasetInfo info; // This will store the mappings.
+// Load the dataset.
+mlpack::data::Load("dataset.tsv", dataset, info);
+// Loop over all features, and add 1 to all non-categorical features.
+for (size_t i = 0; i < info.Dimensionality(); ++i)
+  // The Type() function returns whether or not the data is numeric or
+  // categorical.
+  if (info.Type(i) != mlpack::data::Datatype::categorical)
+    dataset.row(i) += 1.0;
+// Save the modified dataset using the same DatasetInfo.
+mlpack::data::Save("dataset-new.tsv", dataset, info);
+There is more functionality to the DatasetInfo class; for more information, see
+the mlpack::data::DatasetInfo documentation.
+ at section formatmodels Loading and saving models
+Using \c boost::serialization, mlpack is able to load and save machine learning
+models with ease.  These models can currently be saved in three formats:
+ - binary (.bin); this is not human-readable, but it is small
+ - text (.txt); this is sort of human-readable and relatively small
+ - xml (.xml); this is human-readable but very verbose and large
+The type of file to save is determined by the given file extension, as with the
+other loading and saving functionality in mlpack.  Below is an example where a
+dataset stored as TSV and labels stored as ASCII text are used to train a
+logistic regression model, which is then saved to model.xml.
+$ mlpack_logistic_regression -t training_dataset.tsv -l training_labels.txt \
+> -M model.xml
+Many mlpack command-line programs have support for loading and saving models
+through the \c --input_model_file (\c -m) and \c --output_model_file (\c -M)
+options; for more information, see the documentation for each program
+(accessible by passing \c --help as a parameter).
+ at section formatmodels Loading and saving models in C++
+mlpack uses the \c boost::serialization library internally to perform loading
+and saving of models, and provides convenience overloads of mlpack::data::Load()
+and mlpack::data::Save() to load and save these models.
+To be serializable, a class must implement the method
+template<typename Archive>
+void Serialize(Archive& ar, const unsigned int version);
+For more information on this method and how it works, see the TODO: add link
+boost::serialization documentation.  Note that mlpack uses a \c Serialize()
+method and not a \c serialize() method, and also mlpack uses the
+mlpack::data::CreateNVP() method instead of \c BOOST_SERIALIZATION_NVP() ; this
+is for coherence with the mlpack style guidelines, and is done via a
+particularly complex bit of template metaprogramming in
+src/mlpack/core/data/serialization_shim.hpp (read that file if you want your
+head to hurt!).
+Examples of Serialize() methods can be found in most classes; one fairly
+straightforward example is the mlpack::math::Range class (TODO: add link).  A
+more complex example is the mlpack::tree::BinarySpaceTree class (TODO: add
+Using the mlpack::data::Load() and mlpack::data::Save() classes is easy if the
+type being saved has a \c Serialize() method implemented: simply call either
+function with a filename, a name for the object to save, and the object itself.
+The example below, for instance, creates an mlpack::math::Range object and saves
+it as range.txt.  Then, that range is loaded from file into another
+mlpack::math::Range object.
+// Create range and save it.
+mlpack::math::Range r(0.0, 5.0);
+mlpack::data::Save("range.txt", "range", r);
+// Load into new range.
+mlpack::math::Range newRange;
+mlpack::data::Load("range.txt", "range", newRange);
+It is important to be sure that you load the appropriate type; if you save, for
+instance, an mlpack::regression::LogisticRegression object and attempt to load
+it as an mlpack::math::Range object, the load will fail and an exception will be
+thrown.  (When the object is saved as binary (.bin), it is possible that the
+load will not fail, but instead load with mangled data, which is perhaps even
+ at section formatfinal Final notes
+If the examples here are unclear, it would be worth looking into the ways that
+mlpack::data::Load() and mlpack::data::Save() are used in the code.  Some
+example files that may be useful to this end:
+ - src/mlpack/methods/logistic_regression/logistic_regression_main.cpp
+ - src/mlpack/methods/hoeffding_trees/hoeffding_tree_main.cpp
+ - src/mlpack/methods/neighbor_search/allknn_main.cpp
+If you are interested in adding support for more data types to mlpack, it would
+be preferable to add the support upstream to Armadillo instead, so that may be a
+better direction to go first.  Then very little code modification for mlpack
+will be necessary.

More information about the mlpack-git mailing list