[mlpack-git] (blog) master: Add final post. (ba5ffd0)
gitdub at mlpack.org
gitdub at mlpack.org
Tue Aug 23 02:01:10 EDT 2016
Repository : https://github.com/mlpack/blog
On branch : master
Link : https://github.com/mlpack/blog/compare/54bf885436a4faafd0027e4356d1208fc65aea85...ba5ffd0e6fda547d9f193c730e4900a9fc1de081
Author: MarcosPividori <marcos.pividori at gmail.com>
Date: Tue Aug 23 03:01:10 2016 -0300
Add final post.
content/blog/MarcosFinal.md | 227 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 227 insertions(+)
diff --git a/content/blog/MarcosFinal.md b/content/blog/MarcosFinal.md
new file mode 100644
@@ -0,0 +1,227 @@
+Title: Approximate Nearest Neighbor Search - Conclusion
+Date: 2016-08-23 3:00:00
+Tags: gsoc, knn, kfn, spill-tree
+Author: Marcos Pividori
+I have been working this summer on the GSoC Project:
+"Approximate Nearest Neighbor Search" (Project site: []).
+The main contribution can be summarized in these pull requests:
+ + [[pull/768]][pull/768] (Api details)
+ + [[pull/766]][pull/766] (Fix furthest neighbor sort policy)
+ + [[pull/765]][pull/765] (Api improvement - move semantics)
+ + [[pull/764]][pull/764] (Fix tests)
+ + [[pull/762]][pull/762] (Greedy Traverser)
+ + [[pull/747]][pull/747] (Spill trees Implementation)
+ + [[pull/732]][pull/732] (Heaps for mlpack)
+ + [[pull/731]][pull/731] (Fix memory leak)
+ + [[pull/727]][pull/727] (Fix TreeTrait for BallTree)
+ + [[pull/693]][pull/693] (Modify NSModel to use boost variant)
+ + [[pull/684]][pull/684] (Approx KNN for Dual tree algorithms)
+ + [[pull/682]][pull/682] (Properly resetting auxBound)
+ + [[pull/668]][pull/668] (Add B_aux according to issue #642)
+ + [[pull/655]][pull/655] (Fix erroneous ball tree definition)
+ + [[pull/646]][pull/646] (Remove duplicated code for traversal info)
+ + [[pull/645]][pull/645] (Properly use Enum type)
+ + [[pull/637]][pull/637] (Fix a mistake in metric policy)
+ + [[pull/19]][pull/19] (Improve programs for ANN and FLANN)
+ + [[pull/17]][pull/17] (Approx KNN for the benchmarking system)
+ + [[pull/14]][pull/14] (Add Num of BaseCases as a metric)
+(List of commits: [[link]][commitlist])
+## Approximate nearest neighbor search:
+I modified mlpack's code to include an *epsilon* value, which represents the
+maximum relative error. It is considered by the prune rules when deciding if a
+given node combination should be analyzed, (as suggested at the end of the paper
+[]), with a general implementation that works for both *KFN* and *KNN*.
+The command line tools: *mlpack_knn* and *mlpack_kfn*, were updated to include
+an extra option *"-e"*, to specify the maximum relative error
+(default value: 0).
+The main code developed was included in the [[pull/684]][pull/684].
+Take into into account that epsilon represents the maximum relative error.
+So, the average relative error (effective error) will be much smaller than the
+epsilon value provided.
+As expected, higher values of the epsilon parameter implies that more nodes are
+pruned nodes and, therefore, we have a faster algorithm, as can be seen in the next graphic for the dataset *isolet*:
+## Spill Trees:
+I have implemented Spill Trees, as defined in: "An Investigation of Practial
+Approximate Nearest Neighbor Algorithms" [] ([[pull/747]][pull/747]).
+It is a variant of binary space trees in which the children of a node
+can "spill over" each other, and contain shared datapoints.
+One problem with Spill Trees is that their depth varies considerably depending
+on the overlapping size $\tau$.
+For that reason, I have implemented Hybrid Spill Trees [],
+which provide better guarantee in the logarithmic depth of the tree.
+The extension was incorporated to existing *mlpack_knn*.
+You can specify *"-t spill"* to consider spill trees, and the command line
+paramenters *"--tau"* to set different values for the overlapping size (default
+value is tau=0), and *"--rho"* to set different values for the balance threshold
+(default value is rho=0.7).
+### Spill Trees's decision boundary:
+We have considered many different approaches for the implementation of Spill
+Trees, see discussions in:
+[[issues/728]][issues/728] and [[spilltrees.pdf]][spilltree.pdf].
+Finally, we decided to implement a similar approach to the one mentioned in Ting
+### Splitting Hyperplanes
+Actual implementation of Spill Trees can be configured to choose between different
+kind of splitting Hyperplanes:
++) Providing the template parameter `HyperplaneType` between two candidate classes:
+`AxisOrthogonalHyperplane` and `Hyperplane` (not necessarily Axis-Orthogonal).
+(In both cases the hyperplane with the widest range of projection values will be
++) Providing the template parameteter `SplitType`, between two candidate classes:
+`MidpointSpaceSplit`, `MeanSpaceSplit`, which determines the splitting value
+to be considered.
+By default, *mlpack_knn* considers *Axis-Orthogonal Hyperplanes* and *Midpoint
+Splits*, because it is the most efficient option (we can project any vector in
+### Defeatist Traversers
+I have implemented *Hybrid SP-Tree Search* as defined in [].
+We can control the hybrid by varying $\tau$. If $\tau$ is zero, we have a pure
+spill tree with defeatist search, very efficient but not accurate enough.
+If $\tau$ is a very large number, then every node is a non-overlapping node and
+we get back to the traditional metric tree, with prunning rules, perfectly
+accurate but not very efficient.
+By setting different values for $\tau$, we have a trade-off between efficiency
+Also, I implemented a Defeatist Dual Tree Traverser, where the *query tree* is
+built without overlapping.
+The `DefeatistDualTreeTraverser` is faster than the
+`DefeatistSingleTreeTraverser`, specially when the value of tau increases.
+Some results can be seen in the next graphic for the dataset *isolet*.
+## General Greedy Search Traverser:
+Also, I implemented a *general greedy single tree traverser* to perform greedy
+search, that always chooses the child with the best score when traversing the
+tree, and doesn't consider backtracking: `GreedySingleTreeTraverser`.
+It is a tree independent implementation (based on the general TreeType API).
+As expected, the greedy traverser performs considerably faster than other
+approachs, at the cost of some relative error in the results.
+(PR: [[pull/762]][pull/762]). Further disccusion in: [[issues/761]][issues/761].
+We can simply reduce the relative error by increasing the leaf size of the tree,
+as is shown in the next graphic for the dataset *isolet*.
+## Other improvements
+Also, I have been improving some parts of the existing code:
+### Bound Issues:
+I found some issues in the definition of the B2 bound. We were discussing about
+it in [[issues/642]][issues/642] and, after thinking about the different
+special cases, we found some examples where actual definition could fail.
+I fixed existing code to consider slighty different bounds.
+### Improve NSModel implementation:
+I have been improving existing code for `NSModel`, as was suggested
+in [[issues/674]][issues/674], using boost variant to manage different
+options for tree types. (PR: [[pull/694]][pull/693]).
+### Heaps for the list of candidates:
+I realized in many parts of the code, we were keeping track of the best k
+candidates visited through a sorted list. Instead of maintaining a
+sorted list, a better approach was to use a priority queue. I implemented
+this in [[pull/732]][pull/732].
+### Benchmarking system:
+I have been updating the benchmarking system to include approximate neighbor
+search not only with mlpack, but also with other libraries like [[ANN]][ANN] and
+[[FLANN]][FLANN] (PR: [[pull/14]][pull/14] and [[pull/19]][pull/19] ).
+Also, I created a new view to plot the progress of a specific metric for
+different values of a method parameter. For example, for knn, it is possible
+to analyze the number of base cases and runtime for different values of
+approximation error (epsilon), with different libraries/configurations.
+I want to thank the mlpack community for giving me the opportunity to work with
+them this summer, it has been a fascinating experience!
+They were very responsive, I always found someone to talk in the IRC channel,
+willing to offer their time to discuss ideas or help me with my project! :)
+For further information see my previous [[blog posts]][blog-posts].
More information about the mlpack-git