[mlpack-git] [mlpack/mlpack] Spill Trees implementation. (#728)

Fri Jul 22 10:47:53 EDT 2016

I took a look through the document; thanks for writing it up.  I agree with your proof and the results.  (Also as a side note if you don't want `leftbound` and `rightbound` rendered as math text you can use `\mbox{leftbound}` or possibly `\operatorname{leftbound}`, and if you want numbered equations then enter math mode with `\begin{equation}` instead of `$$`.)

You are right that the tighter bounds and `MinDistance()`/`MaxDistance()` are used to enhance pruning for backtracking, but with defeatist search, backtracking is not necessary.  One big issue is that `MinDistance()` and `MaxDistance()` each take `O(d)` time to calculate, whereas determining which side of a splitting hyperplane a point is on is an `O(1)` calculation.  This can be a big difference in higher-dimensional spaces.

I also think that it's not necessarily reasonable to use `NeighborSearchRules` because there is a lot of overhead there, with calculating bounds, and even during the recursion.  For `BinarySpaceTree`, the rules will calculate the minimum distance between both children (`2O(d)`) and then decide which is best to go into first.  There we need both minimum distance values to inform pruning decisions.  But for defeatist search, we only need to solve the decisino version of the problem: which node is better to recurse into?  We can do this in `O(d)` if we don't use splitting hyperplanes and in `O(1)` if we do use splitting hyperplanes.  That's a significant acceleration for such an integral part of the search.

But this would mean that

 * We can't use the `SingleTreeTraverser` or `DualTreeTraverser`; we have to write a generic defeatist traverser.  I think that's fairly straightforward.
 * We need to make the splitting dimension accessible for quick calculation.  I think this could be done by a custom `StatisticType`.
 * We need a function that can calculate which child is best.  I think this too could be done in `StatisticType`, as some function like `size_t GetBestChild(VecType& v)` or `size_t GetBestChild(TreeType& n)`.  Alternately we could add that to the TreeType API but I am not so enthusiastic about that, since I want to keep that API as small as is feasible.

Still, I think that approach (which maps to the second option in your paper) is going to be the option that produces faster results with only slightly worse results.

Some comments from IRC:

```
16:56 < sumedhghaisas> second... I am assuming spill trees are used widely ....
16:57 < sumedhghaisas> So it won;t be a good decision to implement an improvised version of spill trees but not the original...
```

I agree with this.  If we say we are implementing spill trees, and there is only one well-known way to do that, we should try and stick with the original as much as is reasonably possible.  For something like the kd-trees themselves, we already deviate from Bentley's original formulation by having only leaves hold points, but there are so many variations of kd-trees that I think it is okay.  But I think the case is different with spill trees.

```
17:03 < sumedhghaisas> Its difficult to gauge their effectiveness... like which one is better... the only way to prove that hrect bounds are better is to compare it to the actual implementation...
```

I agree.  I thought about this for a while.  Although we can show in the proof that you gave that we inspect every reference point within tau/2 of the query, and with the original spill tree, we inspect every reference point within tau of the query, we have no guarantees on what happens when the nearest neighbor is further than tau from the query.  (Like if the query point is very far from any of the reference points.)  It might be interesting to investigate that point, but I am not sure if it is worth the time, because I think any difference is probably going to be marginal at best.

```
17:08 < marcosirc> Ok, yes I agree that the difference in the implementations is a problem if we want to benchmark and make assertions about spill trees in general.
```

That's true, but we must keep in mind that if we are benchmarking and comparing algorithms, it's not possible for us to say "spill trees are better than kd-trees for approximate nearest neighbor search", even on a single dataset.  All we can say is "this implementation of spill trees is better than this implementation of kd-trees for approximate nearest neighbor search"; we can't separate implementation quality from our results.  The best we can do is try to make both as fast as possible. :)

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/issues/728#issuecomment-234564149
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160722/0cf6e685/attachment.html>