[mlpack-git] [mlpack/mlpack] Vantage point tree (#708)
Ryan Curtin
notifications at github.com
Thu Jul 21 20:09:50 EDT 2016
Ok, it looks pretty good but I think there are a lot of optimizations we can make. Aside from the comments I've made in the diffs, it looks like there is a test failure for `RSModelTest`, probably for the VP tree. I know why this is, but it is a subtle detail that I am not sure is documented:
The mlpack dual-tree algorithms all assume that all of the query points are in the leaf nodes of the query tree. This is true for the kd tree, ball tree, cover tree, and rectangle tree (... I think? Correct me if I am wrong), but not the VP tree as implemented here, since we are not duplicating points inside the tree. As a result, what happens is that if I have two two-level binary trees like this:
```
N_q0
| |
N_q1 N_q2
```
```
N_r0
| |
N_r1 N_r2
```
then we will visit node combinations like this (if nothing is pruned):
```
(N_q0, N_r0)
(N_q1, N_r1)
(N_q1, N_r2)
(N_q2, N_r1)
(N_q2, N_r2)
```
and at each visit, we will perform base cases between the points held in each node. But note that we never visit `(N_q0, N_r1)` or `(N_q0, N_r2)`, so base cases will not be performed between any points in `N_q0` and `N_r1` or `N_r2`, and the results end up being wrong.
Therefore, we need to, in our traverser, ensure that base cases between high-level query points and low-level reference points still happen. We could do this if we forced a single-tree traversal with a vantage point. So if we say that `N_q0` contains a single point `p_q0` that neither of its children do, then our results would be correct if we performed a single-tree traversal with the point-node combinations `(p_q0, N_r1)` and `(p_q0, N_r2)`. (This generalizes to the case where there are more than three nodes in the tree.)
Similarly, the converse case is true---`N_q1` and `N_q2` will not have BaseCase() called with their points and whatever point is held by `N_r0`. So for any given query/reference node combination, we need to
* fire off single-tree recursions for the query vantage point and the reference nodes
* perform `BaseCase()` between all points in the query node and all of the parent vantage points of the reference nodes
Hopefully this makes sense. We should do the single-tree recursions and the extra base cases before we score the child nodes, so that some of the more interesting terms in the bounding functions will be used (like the `NeighborSearchRules` bounds that @MarcosPividori worked on).
So I guess what I am saying is, the dual-tree recursion structure should look like this:
```
Traverse(VPTree& q, VPTree& r):
- Perform base cases as usual.
- Fire off single-tree reference recursions if this is a left child: Traverse(q.Point(0), r.Left()), Traverse(q.Point(0), r.Right());
- Do extra base cases: BaseCase(q.Point(i), r.Parent()->Point(0)), BaseCase(q.Point(i), r.Parent()->Parent()->Point(0)), etc.
- Do usual traversal into (q.Left(), r.Left()), (q.Left(), r.Right()), (q.Right(), r.Left()), (q.Right(), r.Right())
```
Let me know if I can clarify anything about what I have written here. I think maybe it is a bit confusing.
My other concern is that the implementation seems to not prune very well in either single-tree or dual-tree mode. If I run with the corel dataset (37749x32, I can supply it if you like) with the kd-tree, I get these statistics:
```
$ bin/mlpack_knn -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -t kd --single | grep "were"
[INFO ] 16220040 node combinations were scored.
[INFO ] 13861878 base cases were calculated.
```
```
$ bin/mlpack_knn -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -t vp --single | grep "were"
[INFO ] 194931712 node combinations were scored.
[INFO ] 1214753327 base cases were calculated.
```
Maybe there is an error somewhere, where the bounds are way looser than they need to be?
---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/708#issuecomment-234420857
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160721/5f7899cc/attachment-0001.html>
More information about the mlpack-git
mailing list