[mlpack-git] master: The beginning of a document about the TreeType API. (02c5995)
gitdub at big.cc.gt.atl.ga.us
gitdub at big.cc.gt.atl.ga.us
Thu Jul 16 12:16:39 EDT 2015
Repository : https://github.com/mlpack/mlpack
On branch : master
Link : https://github.com/mlpack/mlpack/compare/b4659b668021db631b3c8a48e3d735b513706fdc...015d79104286231ef70ea0ba1d99a6c397025b3e
>---------------------------------------------------------------
commit 02c5995f392243a8b50a6299984577cee6e60e27
Author: Ryan Curtin <ryan at ratml.org>
Date: Wed Jul 15 14:26:44 2015 +0000
The beginning of a document about the TreeType API.
>---------------------------------------------------------------
02c5995f392243a8b50a6299984577cee6e60e27
doc/policies/trees.hpp | 796 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 796 insertions(+)
diff --git a/doc/policies/trees.hpp b/doc/policies/trees.hpp
new file mode 100644
index 0000000..4a06de9
--- /dev/null
+++ b/doc/policies/trees.hpp
@@ -0,0 +1,796 @@
+/*! @page trees The TreeType policy in mlpack
+
+ at section treeintro Introduction
+
+Trees are an important data structure in mlpack and are used in a number of the
+machine learning algorithms that mlpack implements. Often, the use of trees can
+allow significant acceleration of an algorithm; this is generally done by
+pruning away large parts of the tree during computation.
+
+Most mlpack algorithms that use trees are not tied to a specific tree but
+instead allow the user to choose a tree via the \c TreeType template parameter.
+Any tree passed as a \c TreeType template parameter will need to implement a
+certain set of functions.
+
+This document aims to clarify the abstractions underlying mlpack trees, list and
+describe the required functionality of the \c TreeType policy, and point users
+towards existing types of trees. A table of contents is below:
+
+ - \ref treeintro
+ - \ref whatistree
+ - \ref treetype_template_params
+ - \ref treetype_api
+ - \ref treetype_rigorous
+ - \ref treetype_rigorous_template
+ - \ref treetype_rigorous_constructor
+ - \ref treetype_rigorous_basic
+ - \ref treetype_rigorous_complex
+ - \ref treetype_rigorous_serialization
+ - \ref treetype_more
+
+Although this document is long, there may still be errors and unclear areas. If
+you are having trouble understanding anything, please get in touch on Github or
+on the mailing list and someone will help you (and possibly update the
+documentation afterwards).
+
+ at section whatistree What is a tree?
+
+In mlpack, we assume that we have some sort of data matrix, which might be
+sparse or dense (that is, it could be of type \c arma::mat or \c arma::sp_mat,
+or any variant that implements the Armadillo API). This data matrix corresponds
+to a collection of points in some space (usually a Euclidean space). A tree is
+a way of organizing this data matrix in a hierarchical manner---so, points that
+are nearby should lie in similar nodes.
+
+We can rigorously define what a tree is, using the definition of \b "space tree"
+introduced in the following paper:
+
+ at quote
+R.R. Curtin, W.B. March, P. Ram, D.V. Anderson, A.G. Gray, and C.L. Isbell Jr.,
+"Tree-independent dual-tree algorithms," in Proceedings of the 30th
+International Conference on Machine Learning (ICML '13), pp. 1435--1443, 2013.
+ at endquote
+
+The definition is:
+
+ at defn
+A \b space \b tree on a dataset \f$ S \in \mathcal{R}^{N \times d} \f$ is an
+undirected, connected, acyclic, rooted simple graph with the following
+properties:
+
+ - Each node (or vertex) holds a number of points (possibly zero) and is
+connected to one parent node and a number of child nodes (possibly zero).
+
+ - There is one node in every space tree with no parent; this is the root node
+of the tree.
+
+ - Each point in \f$S\f$ is contained in at least one node.
+
+ - Each node corresponds to some subset of \f$\mathcal{R}^d\f$ that contains
+each point in the node and also the subsets that correspond to each child of the
+node.
+ at enddefn
+
+This is really a quite straightforward definition: a tree is hierarchical, and
+each node corresponds to some region of the input space. Each node may have
+some number of children, and may hold some number of points. However, there is
+an important terminology distinction to make: the term
+\b "points held by a node" has a different meaning than the term
+\b "descendant points held by a node". The points held in a node are just
+that---points held only in the node. The descendant points of a node are the
+combination of the points held in a node with the points held in the node's
+children and the points held in the node's children's children (and so forth).
+For the purposes of clarity in all discussions about trees, care is taken to
+differentiate the terms "descendant point" and "point".
+
+Now, it's also important to note that a point does not \e need to hold any
+children, and that a node \e can hold the same points as its children (or its
+parent). Some types of trees do this. For instance, each node in the cover
+tree holds only one point, and may have a child that holds the same point. As
+another example, the \f$kd\f$-tree holds its points only in the leaves (at the
+bottom of the tree). More information on space trees can be found in either the
+"Tree-independent dual-tree algorithms" paper or any of the related literature.
+
+So there is a huge amount of possible variety in the types of trees that can
+fall into the class of \e "space trees". Therefore, it's important to treat
+them abstractly, and the \c TreeType policy allows us to do just that. All we
+need to remember is that a node in a tree can be represented as the combination
+of some points held in the node, some child nodes, and some geometric structure
+that represents the space that all of the descendant points fall into (this is a
+restatement of the fourth part of the definition).
+
+ at section treetype_template_params Template parameters required by the TreeType policy
+
+Most everything in mlpack is decomposed into a series of configurable template
+parameters, and trees are no exception. In order to ease usage of high-level
+mlpack algorithms, each \c TreeType itself must be a template class taking three
+parameters:
+
+ - \c MetricType -- the underlying metric that the tree will be built on
+ - \c StatisticType -- holds any auxiliary information that individual
+algorithms may need
+ - \c MatType -- the type of the matrix used to represent the data
+
+The reason that these three template parameters are necessary is so that each
+\c TreeType can be used as a template template parameter, which can radically
+simplify the required syntax for instantiating mlpack algorithms. By using
+template template parameters, a user needs only to write
+
+ at code
+// The RangeSearch class takes a MetricType and a TreeType template parameter.
+
+// This code instantiates RangeSearch with the ManhattanDistance and a
+// QuadTree. Note that the QuadTree itself is a template, and takes a
+// MetricType, StatisticType, and MatType, just like the policy requires.
+
+// This example ignores the constructor parameters, for the sake of simplicity.
+RangeSearch<ManhattanDistance, QuadTree> rs(...);
+ at endcode
+
+as opposed to the far more complicated alternative, where the user must specify
+the values of each template parameter of the tree type:
+
+ at code
+// This is a much worse alternative, where the user must specify the template
+// arguments of their tree.
+RangeSearch<ManhattanDistance,
+ QuadTree<ManhattanDistance, EmptyStatistic, arma::mat>> rs(...);
+ at endcode
+
+Unfortunately, the price to pay for this user convenience is that \e every
+\c TreeType must have three template parameters, and they must be in exactly
+that order. Fortunately, there is an additional benefit: we are guaranteed that
+the tree is built using the same metric as the method (that is, a user can't
+specify different metric types to the algorithm and to the tree, which they can
+without template template parameters).
+
+There are two important notes about this:
+
+ - Not every possible input of MetricType, StatisticType, and/or MatType
+necessarily need to be valid or work correctly for each type of tree. For
+instance, the QuadTree is limited to Euclidean metrics and will not work
+otherwise. Either compile-time static checks or detailed documentation can help
+keep users from using invalid combinations of template arguments.
+
+ - Some types of trees have more template parameters than just these three. One
+example is the generalized binary space tree, where the bounding shape of each
+node is easily made into a fourth template parameter (the \c BinarySpaceTree
+class calls this the \c BoundType parameter), and the procedure used to split a
+node is easily made into a fifth template parameter (the \c BinarySpaceTree
+class calls this the \c SplitType parameter). However, the syntax of template
+template parameters \e requires that the class only has the correct number of
+template parameters---no more, no less. Fortunately, C++11 allows template
+typedefs, which can be used to provide partial specialization of template
+classes:
+
+ at code
+// This is the definition of the BinarySpaceTree class, which has five template
+// parameters.
+template<typename MetricType,
+ typename StatisticType,
+ typename MatType,
+ typename BoundType,
+ typename SplitType>
+class BinarySpaceTree;
+
+// The 'using' keyword gives us a template typedef, so we can define the
+// MeanSplitKDTree template class, which has three parameters and is a valid
+// TreeType policy class.
+template<typename MetricType, typename StatisticType, typename MatType>
+using MeanSplitKDTree = BinarySpaceTree<MetricType,
+ StatisticType,
+ MatType,
+ HRectBound<MetricType>
+ MeanSplit<BoundType, MetricType>>;
+ at endcode
+
+Now, the \c MeanSplitKDTree class has only three template parameters and can be
+used as a \c TreeType policy class in various mlpack algorithms. Many types of
+trees in mlpack have more than three template parameters and rely on template
+typedefs to provide simplified \c TreeType interfaces.
+
+ at section treetype_api The TreeType API
+
+As a result of the definition of \e "space tree" in the previous section, a
+simplified API presents itself quite easily. However, more complex
+functionality is often necessary in mlpack, so this leads to more functions
+being necessary for a class to satisfy the \c TreeType policy. Combining this
+with the template parameters required for trees given in the previous section
+gives us the complete API required for a class implementing the \c TreeType
+policy. Below is the minimal set of functions required with minor
+documentation for each function. (More extensive documentation and explanation
+is given afterwards.)
+
+ at code
+// The three template parameters will be supplied by the user, and are detailed
+// in the previous section.
+template<typename MetricType,
+ typename StatisticType,
+ typename MatType>
+class ExampleTree
+{
+ public:
+ //////////////////////
+ //// Constructors ////
+ //////////////////////
+
+ // This batch constructor does not modify the dataset, and builds the entire
+ // tree using a default-constructed MetricType.
+ ExampleTree(const MatType& data);
+
+ // This batch constructor does not modify the dataset, and builds the entire
+ // tree using the given MetricType.
+ ExampleTree(const MatType& data, MetricType& metric);
+
+ // Initialize the tree from a given boost::serialization archive. SFINAE (the
+ // second argument) is necessary to ensure that the archive is loading, not
+ // saving.
+ template<typename Archive>
+ ExampleTree(
+ Archive& ar,
+ const typename boost::enable_if<typename Archive::is_loading>::type* = 0);
+
+ // Release any resources held by the tree.
+ ~ExampleTree();
+
+ // ///////////////////////// //
+ // // Basic functionality // //
+ // ///////////////////////// //
+
+ // Get the dataset that the tree is built on.
+ const MatType& Dataset();
+
+ // Get the metric that the tree is built with.
+ MetricType& Metric();
+
+ // Get/modify the StatisticType for this node.
+ StatisticType& Stat();
+
+ // Return the parent of the node, or NULL if this is the root.
+ ExampleTree* Parent();
+
+ // Return the number of children held by the node.
+ size_t NumChildren();
+ // Return the i'th child held by the node.
+ ExampleTree& Child(const size_t i);
+
+ // Return the number of points held in the node.
+ size_t NumPoints();
+ // Return the index of the i'th point held in the node.
+ size_t Point(const size_t i);
+
+ // Return the number of descendant nodes of this node.
+ size_t NumDescendantNodes();
+ // Return the i'th descendant node of this node.
+ ExampleTree& DescendantNode(const size_t i);
+
+ // Return the number of descendant points of this node.
+ size_t NumDescendants();
+ // Return the index of the i'th descendant point of this node.
+ size_t Descendant(const size_t i);
+
+ // ///////////////////////////////////////////////// //
+ // // More complex distance-related functionality // //
+ // ///////////////////////////////////////////////// //
+
+ // Return the distance between the center of this node and the center of
+ // its parent.
+ double ParentDistance();
+
+ // Return an upper bound on the furthest possible distance between the
+ // center of the node and any point held in the node.
+ double FurthestPointDistance();
+
+ // Return an upper bound on the furthest possible distance between the
+ // center of the node and any descendant point of the node.
+ double FurthestDescendantDistance();
+
+ // Return a lower bound on the minimum distance between the center and any
+ // edge of the node's bounding shape.
+ double MinimumBoundDistance();
+
+ // Return a lower bound on the minimum distance between the given point and
+ // the node.
+ template<typename VecType>
+ double MinDistance(VecType& point);
+
+ // Return a lower bound on the minimum distance between the given node and
+ // this node.
+ double MinDistance(ExampleTree& otherNode);
+
+ // Return an upper bound on the maximum distance between the given point and
+ // the node.
+ template<typename VecType>
+ double MaxDistance(VecType& point);
+
+ // Return an upper bound on the maximum distance between the given node and
+ // this node.
+ double MaxDistance(ExampleTree& otherNode);
+
+ // Return the combined results of MinDistance() and MaxDistance().
+ template<typename VecType>
+ math::Range RangeDistance(VecType& point);
+
+ // Return the combined results of MinDistance() and MaxDistance().
+ math::Range RangeDistance(ExampleTree& otherNode);
+
+ // //////////////////////////////////// //
+ // // Serialization (loading/saving) // //
+ // //////////////////////////////////// //
+
+ // Return a string representation of the tree.
+ std::string ToString() const;
+
+ // Serialize the tree (load from the given archive / save to the given
+ // archive, depending on its type).
+ template<typename Archive>
+ void Serialize(Archive& ar, const unsigned int version);
+
+ protected:
+ // A default constructor; only meant to be used by boost::serialization. This
+ // must be protected so that boost::serialization will work; it does not need
+ // to return a valid tree.
+ ExampleTree();
+
+ // Friend access must be given for the default constructor.
+ friend class boost::serialization::access;
+};
+ at endcode
+
+Although this is significantly more complex than the four-item definition of
+\e "space tree" might suggest, it turns out many of these methods are not
+difficult to implement for most reasonable tree types. It is also important to
+realize that this is a \e minimum API; you may implement more complex tree types
+at your leisure (and you may include more template parameters too, though you
+will have to use template typedefs to provide versions with three parameters;
+see \ref treetype_template_params "the previous section").
+
+Before diving into the detailed documentation for each function, let us consider
+a few important points about the implications of this API:
+
+ - \b "Trees are not default-constructible" and should not (in general) provide
+a default constructor. This helps prevent invalid trees. In general, any
+instantiated mlpack object should be valid and ready to use---and a tree built
+on no points is not valid or ready to use.
+
+ - \b "Trees only need to provide batch constructors." Although many tree types
+do have algorithms for incremental insertions, in mlpack this is not required
+because the tree-based algorithms that mlpack implements generally assume
+fully-built, non-modifiable trees. For this purpose, batch construction is
+perfectly sufficient. (It's also worth pointing out that for some types of
+trees, like kd-trees, the cost of a handful of insertions often outweighs the
+cost of completely rebuilding the tree.)
+
+ - \b "Trees must provide a number of distance bounding functions." The utility
+of trees generally stems from the ability to place quick bounds on
+distance-related quantities. For instance, if all the descendant points of a
+node are bounded by a ball of radius \f$\lambda\f$ and the center of the node
+is a point \f$c\f$, then the minimum distance between some point \f$p\f$ and any
+descendant point of the node is equal to the distance between \f$p\f$ and
+\f$c\f$ minus the radius \f$\lambda\f$: \f$d(p, c) - \lambda\f$. This is a fast
+calculation, and (usually) provides a decent bound on the minimum distance
+between \f$p\f$ and any descendant point of the node.
+
+ - \b "Trees need to be able to be serialized." mlpack uses the
+boost::serialization library for saving and loading objects. Trees---which can
+be a part of machine learning models---therefore must have the ability to be
+saved and loaded. Making this all work requires a protected constructor (part
+of the API) and generally makes it impossible to hold references instead of
+pointers internally, because if a tree is loaded from a file then it must own
+the dataset it is built on and the metric it uses (this also means that a
+destructor must exist for freeing these resources).
+
+Now, we can consider each part of the API more rigorously.
+
+ at section treetype_rigorous Rigorous API documentation
+
+This section is divided into five parts:
+
+ - \ref treetype_rigorous_template
+ - \ref treetype_rigorous_constructor
+ - \ref treetype_rigorous_basic
+ - \ref treetype_rigorous_complex
+ - \ref treetype_rigorous_serialization
+
+ at subsection treetype_rigorous_template Template parameters
+
+\ref treetype_template_param "An earlier section" discussed the three different
+template parameters that are required by the \c TreeType policy.
+
+The \c MetricType policy provides one method that will be useful for tree
+building and other operations:
+
+ at code
+// This function is required by the MetricType policy.
+// Evaluate the metric between two points (which may be of different types).
+template<typename VecTypeA, typename VecTypeB>
+void Evaluate(VecTypeA& a, VecTypeB& b);
+ at endcode
+
+Note that this method is not necessarily static, so a \c MetricType object
+should be held internally and its \c Evaluate() method should be called whenever
+the distance between two points is required. \b "It is generally a bad idea to
+hardcode any distance calculation in your tree." This will make the tree unable
+to generalize to arbitrary metrics. If your tree must depend on certain
+assumptions holding about the metric (i.e. the metric is a Euclidean metric),
+then make that clear in the documentation of the tree, so users do not try to
+use the tree with an inappropriate metric.
+
+The second template parameter, \c StatisticType, is for auxiliary information
+that is required by certain algorithms. For instance, consider an algorithm
+which repeatedly uses the variance of the descendant points of a node. It might
+be tempting to add a \c Variance() method to the required \c TreeType API, but
+this quickly leads to code bloat (after all, the API already has quite enough
+functions as it is). Instead, it is better to create a \c StatisticType class
+which provides the \c Variance() method, and then call \c Stat().Variance() when
+the variance is required. This also holds true for cached data members.
+
+Each node should have its own instance of a \c StatisticType class. The
+\c StatisticType must provide the following constructor:
+
+ at code
+// This constructor is required by the StatisticType policy.
+template<typename TreeType>
+StatisticType(TreeType& node);
+ at endcode
+
+This constructor should be called with \c (*this) after the node is constructed
+(usually, this ends up being the last line in the constructor of a node).
+
+The last template parameter is the \c MatType parameter. This is generally
+\c arma::mat or \c arma::sp_mat, but could be any Armadillo type, including
+matrices that hold data points of different precisions (such as \c float or even
+\c int). It generally suffices to write \c MatType assuming that \c arma::mat
+will be used, since the vast majority of the time this will be what is used.
+
+ at subsection treetype_rigorous_constructor Constructors and destructors
+
+The \c TreeType API requires at least three constructors. Technically, it does
+not \e require a destructor, but almost certainly your tree class will be doing
+some memory management internally and should have one (though not always).
+
+The first two constructors are variations of the same idea:
+
+ at code
+// This batch constructor does not modify the dataset, and builds the entire
+// tree using a default-constructed MetricType.
+ExampleTree(const MatType& data);
+
+// This batch constructor does not modify the dataset, and builds the entire
+// tree using the given MetricType.
+ExampleTree(const MatType& data, MetricType& metric);
+ at endcode
+
+All that is required here is that a constructor is available that takes a
+dataset and optionally an instantiated metric. If no metric is provided, then
+it should be assumed that the \c MetricType class has a default constructor and
+a default-constructed metric should be used. The constructor \e must return a
+valid, fully-constructed, ready-to-use tree that satisfies the definition of
+\e "space tree" that was \ref whatistree "given earlier".
+
+It is possible to implement both these constructors as one by using \c
+boost::optional.
+
+The third constructor requires the tree to be initializable from a \c
+boost::serialization archive:
+
+ at code
+// Initialize the tree from a given boost::serialization archive. SFINAE (the
+// second argument) is necessary to ensure that the archive is loading, not
+// saving.
+template<typename Archive>
+ExampleTree(
+ Archive& ar,
+ const typename boost::enable_if<typename Archive::is_loading>::type* = 0);
+ at endcode
+
+This has implications on how the tree must be stored. In this case, the dataset
+is \e "not yet loaded" and therefore the tree \b "may be required to have"
+\b "ownership of the data matrix". This means that realistically the most
+reasonable way to represent the data matrix internally in a tree class is not
+with a reference but instead with a pointer. If this is true, then a destructor
+will be required:
+
+ at code
+// Release any resources held by the tree.
+~ExampleTree();
+ at endcode
+
+and, if the data matrix is represented internally with a pointer, this
+destructor will need to release the memory for the data matrix (in the case that
+the tree was created via \c boost::serialization ).
+
+Note that these constructors are not necessarily the only constructors that a
+\c TreeType implementation can provide. One important example of when more
+constructors are useful is when the tree rearranges points internally; this
+might be desired for the sake of speed or memory optimization. But to do this
+with the required constructors would necessarily incur a copy of the data
+matrix, because the user will pass a \c "const MatType&". One alternate
+solution is to provide a constructor which takes an rvalue reference to a
+\c MatType:
+
+ at code
+template<typename Archive>
+ExampleTree(MatType&& data);
+ at endcode
+
+(and another overload that takes an instantiated metric), and then the user can
+use \c std::move() to build the tree without copying the data matrix, although
+the data matrix will be modified:
+
+ at code
+ExampleTree exTree(std::move(dataset));
+ at endcode
+
+It is, of course, possible to add even more constructors if desired.
+
+ at subsection treetype_rigorous_basic Basic tree functionality
+
+The basic functionality of a class implementing the \c TreeType API is quite
+straightforward and intuitive.
+
+ at code
+// Get the dataset that the tree is built on.
+const MatType& Dataset();
+ at endcode
+
+This should return a \c const reference to the dataset the tree is built on.
+The fact that this function is required essentially means that each node in the
+tree must store a pointer to the dataset (this is not the only option, but it is
+the most obvious option).
+
+ at code
+// Get the metric that the tree is built with.
+MetricType& Metric();
+ at endcode
+
+Each node must also store an instantiated metric or a pointer to one (note that
+this is required even for metrics that have no state and have a \c static \c
+Evaluate() function).
+
+ at code
+// Get/modify the StatisticType for this node.
+StatisticType& Stat();
+ at endcode
+
+As discussed earlier, each node must hold a \c StatisticType; this is accessible
+through the \c Stat() function.
+
+ at code
+// Return the parent of the node, or NULL if this is the root.
+ExampleTree* Parent();
+
+// Return the number of children held by the node.
+size_t NumChildren();
+// Return the i'th child held by the node.
+ExampleTree& Child(const size_t i);
+
+// Return the number of points held in the node.
+size_t NumPoints();
+// Return the index of the i'th point held in the node.
+size_t Point(const size_t i);
+
+// Return the number of descendant nodes of this node.
+size_t NumDescendantNodes();
+// Return the i'th descendant node of this node.
+ExampleTree& DescendantNode(const size_t i);
+
+// Return the number of descendant points of this node.
+size_t NumDescendants();
+// Return the index of the i'th descendant point of this node.
+size_t Descendant(const size_t i);
+ at endcode
+
+These functions are all fairly self-explanatory. Most algorithms will use the
+\c Parent(), \c Children(), \c NumChildren(), \c Point(), and \c NumPoints()
+functions, so care should be taken when implementing those functions to ensure
+they will be efficient. Note that \c Point() and \c Descendant() should return
+indices of points, so the actual points can be accessed by calling
+\c "Dataset().col(Point(i))" for some index \c i (or something similar).
+
+An important note about the \c Descendant() function is that each descendant
+point should be unique. So if a node holds the point with index 6 and it has
+one child that holds the points with indices 6 and 7, then \c NumDescendants()
+should return 2, not 3. The ordering in which the descendants are returned can
+be arbitrary; so, \c Descendant(0) can return 6 \b or 7, and \c Descendant(1)
+should return the other index.
+
+ at subsection treetype_rigorous_complex Complex tree functionality and bounds
+
+A node in a tree should also be able to calculate various distance-related
+bounds; these are particularly useful in tree-based algorithms. Note that any
+of these bounds does not necessarily need to be maximally tight; generally it is
+more important that each bound can be easily calculated.
+
+Details on each bounding function that the \c TreeType API requires are given
+below.
+
+ at code
+// Return the distance between the center of this node and the center of
+// its parent.
+double ParentDistance();
+ at endcode
+
+Remember that each node corresponds to some region in the space that the dataset
+lies in. For most tree types this shape is often something geometrically
+simple: a ball, a cone, a hyperrectangle, a slice, or something similar. The
+\c ParentDistance() function should return the distance between the center of
+this node's region and the center of the parent node's region.
+
+In practice this bound is often used in dual-tree (or single-tree) algorithms to
+place an easy \c MinDistance() (or \c MaxDistance() ) bound for a child node;
+the parent's \c MinDistance() (or \c MaxDistance() ) function is called and then
+adjusted with \c ParentDistance() to provide a possibly loose but efficient
+bound on what the result of \c MinDistance() (or \c MaxDistance() ) would be
+with the child.
+
+ at code
+// Return an upper bound on the furthest possible distance between the
+// center of the node and any point held in the node.
+double FurthestPointDistance();
+
+// Return an upper bound on the furthest possible distance between the
+// center of the node and any descendant point of the node.
+double FurthestDescendantDistance();
+ at endcode
+
+It is often very useful to be able to bound the radius of a node, which is
+effectively what \c FurthestDescendantDistance() does. Often it is easiest to
+simply calculate and cache the furthest descendant distance at tree construction
+time. Some trees, such as the cover tree, are able to give guarantees that the
+points held in the node will necessarily be closer than the descendant points;
+therefore, the \c FurthestPointDistance() function is also useful.
+
+It is permissible to simply have \c FurthestPointDistance() return the result of
+\c FurthestDescendantDistance(), and that will still be a valid bound, but
+depending on the type of tree it may be possible to have \c
+FurthestPointDistance() return a tighter bound.
+
+ at code
+// Return a lower bound on the minimum distance between the center and any
+// edge of the node's bounding shape.
+double MinimumBoundDistance();
+ at endcode
+
+This is, admittedly, a somewhat complex and weird quantity. It is one of the
+less important bounding functions, so it is valid to simply return 0...
+
+The bound is a bound on the minimum distance between the center of the node and
+any edge of the shape that bounds all of the descendants of the node. So, if
+the bounding shape is a ball (as in a ball tree or a cover tree), then
+\c MinimumBoundDistance() should just return the radius of the ball. If the
+bounding shape is a hypercube (as in a generalized octree), then
+\c MinimumBoundDistance() should return the side length divided by two. If the
+bounding shape is a hyperrectangle (as in a kd-tree or a spill tree), then
+\c MinimumBoundDistance() should return half the side length of the
+hyperrectangle's smallest side.
+
+ at code
+// Return a lower bound on the minimum distance between the given point and
+// the node.
+template<typename VecType>
+double MinDistance(VecType& point);
+
+// Return a lower bound on the minimum distance between the given node and
+// this node.
+double MinDistance(ExampleTree& otherNode);
+
+// Return an upper bound on the maximum distance between the given point and
+// the node.
+template<typename VecType>
+double MaxDistance(VecType& point);
+
+// Return an upper bound on the maximum distance between the given node and
+// this node.
+double MaxDistance(ExampleTree& otherNode);
+
+// Return the combined results of MinDistance() and MaxDistance().
+template<typename VecType>
+math::Range RangeDistance(VecType& point);
+
+// Return the combined results of MinDistance() and MaxDistance().
+math::Range RangeDistance(ExampleTree& otherNode);
+ at endcode
+
+These six functions are almost without a doubt the most important functionality
+of a tree. Therefore, it is preferable that these methods be implemented as
+efficiently as possible, as they may potentially be called many millions of
+times in a tree-based algorithm. It is also preferable that these bounds be as
+tight as possible. In tree-based algorithms, these are used for pruning away
+work, and tighter bounds mean that more pruning is possible.
+
+Of these six functions, there are only really two bounds that are desired here:
+the \e "minimum distance" between a node and an object, and the
+\e "maximum distance" between a node and an object. The object may be either a
+vector (usually \c arma::vec ) or another tree node.
+
+Consider the first case, where the object is a vector. The result of
+\c MinDistance() needs to be less than or equal to the true minimum distance,
+which could be calculated as below:
+
+ at code
+// We assume that we have a vector 'vec', and a tree node 'node'.
+double trueMinDist = DBL_MAX;
+for (size_t i = 0; i < node.NumDescendants(); ++i)
+{
+ const double dist = node.Metric().Evaluate(vec,
+ node.Dataset().col(node.Descendant(i)));
+ if (dist < trueMinDist)
+ trueMinDist = dist;
+}
+// At the end of the loop, trueMinDist will hold the true minimum distance
+// between 'vec' and any descendant point of 'node'.
+ at endcode
+
+Often the bounding shape of a node will allow a quick calculation that will make
+a reasonable bound. For instance, if the node's bounding shape is a ball with
+radius \c r and center \c ctr, the calculation is simply
+\c "(node.Metric().Evaluate(vec, ctr) - r)". Usually a good \c MinDistance() or
+\c MaxDistance() function will make only one call to the \c Evaluate() function
+of the metric.
+
+The \c RangeDistance() function allows a way for both bounds to be calculated at
+once. It is possible to implement this as a call to \c MinDistance() followed
+by a call to \c MaxDistance(), but this may incur more metric \c Evaluate()
+calls than necessary. Often calculating both bounds at once can be more
+efficent and can be done with fewer \c Evaluate() calls than calling both
+\c MinDistance() and \c MaxDistance().
+
+ at subsection treetype_rigorous_serialization Serialization
+
+The last two public functions that the \c TreeType API requires are related to
+serialization and printing.
+
+ at code
+// Return a string representation of the tree.
+std::string ToString() const;
+ at endcode
+
+There are few restrictions on the precise way that the \c ToString() function
+should operate, but generally it should behave similarly to the \c ToString()
+function in other mlpack methods. Generally, a user will call \c ToString()
+when they want to inspect the object and see what it looks like. For a tree,
+printing the entire tree may be way more information than the user was
+expecting, so it may be a better option to print either only the node itself or
+the node plus one or two levels of children.
+
+ at code
+// Serialize the tree (load from the given archive / save to the given
+// archive, depending on its type).
+template<typename Archive>
+void Serialize(Archive& ar, const unsigned int version);
+
+protected:
+// A default constructor; only meant to be used by boost::serialization. This
+// must be protected so that boost::serialization will work; it does not need
+// to return a valid tree.
+ExampleTree();
+
+// Friend access must be given for the default constructor.
+friend class boost::serialization::access;
+ at endcode
+
+On the other hand, the specifics of the functionality required for the
+\c Serialize() function are somewhat more difficult. The \c Serialize()
+function will be called either when a tree is being saved to disk or loaded from
+disk. The \c boost::serialization documentation is fairly comprehensive, but
+when writing a \c Serialize() method for mlpack trees you should use
+\c data::CreateNVP() instead of \c BOOST_SERIALIZATION_NVP(). This is because
+mlpack classes implement \c Serialize() instead of \c serialize() in order to
+conform to the mlpack style guidelines, and making this work requires some
+interesting shim code, which is hidden inside of \c data::CreateNVP(). It may
+be useful to look at other \c Serialize() methods contained in other mlpack
+classes as an example.
+
+An important note is that it is very difficult to use references with
+\c boost::serialization, because \c Serialize() may be called at any time during
+the object's lifetime, and references cannot be re-seated. In general this will
+require the use of pointers, which then require manual memory management.
+Therefore, be careful that \c Serialize() (and the tree's destructor) properly
+handle memory management!
+
+ at section treetype_more A list of trees in mlpack and more information
+
+
+
+*/
More information about the mlpack-git
mailing list