# [mlpack-git] [mlpack/mlpack] Spill trees (#747)

Wed Aug 10 09:34:55 EDT 2016

```> +  }
> +
> +  std::vector<size_t> leftPoints, rightPoints;
> +  // Split the node.
> +  overlappingNode = SplitPoints(tau, rho, points, leftPoints, rightPoints);
> +
> +  // We don't need the information in points, so lets clean it.
> +  std::vector<size_t>().swap(points);
> +
> +  // Now we will recursively split the children by calling their constructors
> +  // (which perform this splitting process).
> +  left = new SpillTree(this, leftPoints, tau, maxLeafSize, rho);
> +  right = new SpillTree(this, rightPoints, tau, maxLeafSize, rho);
> +
> +  // Update count number, to represent the number of descendant points.
> +  count = left->NumDescendants() + right->NumDescendants();

Hi,
I think this is a complicated problem. I have been thinking about it for a while.
The problem is this:
Suppose we have a overlapping node N with 2 children: N<sub>1</sub> and N<sub>2</sub>.
As N<sub>1</sub> and N<sub>2</sub> share some data points, we know that:

N<sub>1</sub>.NumDescendants() + N<sub>2</sub>.NumDescendants() > N.NumDescendants()

We want to avoid keeping a list of indexes for each non-leaf node. So, we want to access the:
`[0, N.NumDescendants())` set of descendants points of the node N, based on the methods: N<sub>1</sub>.Descendant() and N<sub>2</sub>.Descendant(), setting proper indexes.

For example, we could sort the points so shared points appear at the end of the list (with the greatest indexes), and we record the index `overlapIndex` where the set of duplicated points starts.
Let's call `Shared1` the set of points shared by N<sub>1</sub> and N<sub>2</sub>:

N<sub>1</sub>:   | ................ | ..... `Shared1` ..... |

N<sub>2</sub>:   | ................ | ..... `Shared1` ..... |

```
size_t Descendant(const size_t index) const
{
if (IsLeaf())
return (*pointsIndex)[index];
size_t num = left->overlapIndex;
if (index < num)
return left->Descendant(index);
if (right)
return right->Descendant(index - num);
}
```

The problem is that this can not be maintained recursively.
Suppose we split N<sub>1</sub> according to another splitting hyperplane, resulting in: N<sub>11</sub> and N<sub>12</sub>.
Let's call `Shared2` the set of points shared by N<sub>11</sub> and N<sub>12</sub>:

N<sub>11</sub>:   | ................ | ..... `Shared2` ..... | ..... A subset of `Shared1`: S<sub>1</sub>..... |

N<sub>12</sub>:   | ................ | ..... `Shared2` ..... | ..... A subset of `Shared1`: S<sub>2</sub>..... |

Which value do we record as `overlapIndex`? the index where S<sub>1</sub> starts or the index where `Shared2` starts?

If we consider the index where S<sub>1</sub> starts. We will manage to properly access the descendants points of the node N. But, we won't be able to access to all the descendants points of the node N<sub>1</sub>, because points in S<sub>1</sub> won't be considered.

If we consider the index where `Shared2` starts, we will manage to properly access the descendants points of the node N. But, we won't be able to access to all the descendants points of the node N<sub>1</sub>, because points in S<sub>1</sub> won't be considered.

So, I don't think this can be easily implemented.

I can think of 2 alternative solutions:
+ Don't change anything, and add a note that `NumDescendants()` includes duplicated points. Also we should mencion that Spill Tree shouldn't be considered for `RangeSearch`.
+ Hold a list of indexes in each node, including non-leaf nodes. This would require O(n  logn) space, comparing to current implementation which requires O(n) space.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/747/files/a71b57caa90311f5542180bc0553449c3691395d#r74244401
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160810/94046b3e/attachment.html>
```