[mlpack-git] [mlpack/mlpack] Spill trees (#747)
MarcosPividori
notifications at github.com
Wed Aug 10 09:34:55 EDT 2016
> + }
> +
> + std::vector<size_t> leftPoints, rightPoints;
> + // Split the node.
> + overlappingNode = SplitPoints(tau, rho, points, leftPoints, rightPoints);
> +
> + // We don't need the information in points, so lets clean it.
> + std::vector<size_t>().swap(points);
> +
> + // Now we will recursively split the children by calling their constructors
> + // (which perform this splitting process).
> + left = new SpillTree(this, leftPoints, tau, maxLeafSize, rho);
> + right = new SpillTree(this, rightPoints, tau, maxLeafSize, rho);
> +
> + // Update count number, to represent the number of descendant points.
> + count = left->NumDescendants() + right->NumDescendants();
Hi,
I think this is a complicated problem. I have been thinking about it for a while.
The problem is this:
Suppose we have a overlapping node N with 2 children: N<sub>1</sub> and N<sub>2</sub>.
As N<sub>1</sub> and N<sub>2</sub> share some data points, we know that:
N<sub>1</sub>.NumDescendants() + N<sub>2</sub>.NumDescendants() > N.NumDescendants()
We want to avoid keeping a list of indexes for each non-leaf node. So, we want to access the:
`[0, N.NumDescendants())` set of descendants points of the node N, based on the methods: N<sub>1</sub>.Descendant() and N<sub>2</sub>.Descendant(), setting proper indexes.
For example, we could sort the points so shared points appear at the end of the list (with the greatest indexes), and we record the index `overlapIndex` where the set of duplicated points starts.
Let's call `Shared1` the set of points shared by N<sub>1</sub> and N<sub>2</sub>:
N<sub>1</sub>: | ................ | ..... `Shared1` ..... |
N<sub>2</sub>: | ................ | ..... `Shared1` ..... |
So, if we want to access to `N.Descendant(i)`, we will do:
```
size_t Descendant(const size_t index) const
{
if (IsLeaf())
return (*pointsIndex)[index];
size_t num = left->overlapIndex;
if (index < num)
return left->Descendant(index);
if (right)
return right->Descendant(index - num);
}
```
The problem is that this can not be maintained recursively.
Suppose we split N<sub>1</sub> according to another splitting hyperplane, resulting in: N<sub>11</sub> and N<sub>12</sub>.
Let's call `Shared2` the set of points shared by N<sub>11</sub> and N<sub>12</sub>:
N<sub>11</sub>: | ................ | ..... `Shared2` ..... | ..... A subset of `Shared1`: S<sub>1</sub>..... |
N<sub>12</sub>: | ................ | ..... `Shared2` ..... | ..... A subset of `Shared1`: S<sub>2</sub>..... |
Which value do we record as `overlapIndex`? the index where S<sub>1</sub> starts or the index where `Shared2` starts?
If we consider the index where S<sub>1</sub> starts. We will manage to properly access the descendants points of the node N. But, we won't be able to access to all the descendants points of the node N<sub>1</sub>, because points in S<sub>1</sub> won't be considered.
If we consider the index where `Shared2` starts, we will manage to properly access the descendants points of the node N. But, we won't be able to access to all the descendants points of the node N<sub>1</sub>, because points in S<sub>1</sub> won't be considered.
So, I don't think this can be easily implemented.
I can think of 2 alternative solutions:
+ Don't change anything, and add a note that `NumDescendants()` includes duplicated points. Also we should mencion that Spill Tree shouldn't be considered for `RangeSearch`.
+ Hold a list of indexes in each node, including non-leaf nodes. This would require O(n logn) space, comparing to current implementation which requires O(n) space.
---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/747/files/a71b57caa90311f5542180bc0553449c3691395d#r74244401
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160810/94046b3e/attachment.html>
More information about the mlpack-git
mailing list