[mlpack-git] [mlpack/mlpack] Spill trees (#747)

Wed Aug 10 09:34:55 EDT 2016

> +  }
> +
> +  std::vector<size_t> leftPoints, rightPoints;
> +  // Split the node.
> +  overlappingNode = SplitPoints(tau, rho, points, leftPoints, rightPoints);
> +
> +  // We don't need the information in points, so lets clean it.
> +  std::vector<size_t>().swap(points);
> +
> +  // Now we will recursively split the children by calling their constructors
> +  // (which perform this splitting process).
> +  left = new SpillTree(this, leftPoints, tau, maxLeafSize, rho);
> +  right = new SpillTree(this, rightPoints, tau, maxLeafSize, rho);
> +
> +  // Update count number, to represent the number of descendant points.
> +  count = left->NumDescendants() + right->NumDescendants();

Hi,
I think this is a complicated problem. I have been thinking about it for a while.
The problem is this:
Suppose we have a overlapping node N with 2 children: N1 and N2.
As N1 and N2 share some data points, we know that:

   N<sub>1</sub>.NumDescendants() + N<sub>2</sub>.NumDescendants() > N.NumDescendants()

We want to avoid keeping a list of indexes for each non-leaf node. So, we want to access the:
`[0, N.NumDescendants())` set of descendants points of the node N, based on the methods: N1.Descendant() and N2.Descendant(), setting proper indexes.

For example, we could sort the points so shared points appear at the end of the list (with the greatest indexes), and we record the index `overlapIndex` where the set of duplicated points starts.
Let's call `Shared1` the set of points shared by N1 and N2:

  N<sub>1</sub>:   | ................ | ..... `Shared1` ..... |

  N<sub>2</sub>:   | ................ | ..... `Shared1` ..... |

So, if we want to access to `N.Descendant(i)`, we will do:

```
size_t Descendant(const size_t index) const
{
  if (IsLeaf())
    return (*pointsIndex)[index];
  size_t num = left->overlapIndex;
  if (index < num)
    return left->Descendant(index);
  if (right)
    return right->Descendant(index - num);
}
```

The problem is that this can not be maintained recursively.
Suppose we split N1 according to another splitting hyperplane, resulting in: N11 and N12.
Let's call `Shared2` the set of points shared by N11 and N12:

  N<sub>11</sub>:   | ................ | ..... `Shared2` ..... | ..... A subset of `Shared1`: S<sub>1</sub>..... |

  N<sub>12</sub>:   | ................ | ..... `Shared2` ..... | ..... A subset of `Shared1`: S<sub>2</sub>..... |

Which value do we record as `overlapIndex`? the index where S1 starts or the index where `Shared2` starts?

If we consider the index where S1 starts. We will manage to properly access the descendants points of the node N. But, we won't be able to access to all the descendants points of the node N1, because points in S1 won't be considered.

If we consider the index where `Shared2` starts, we will manage to properly access the descendants points of the node N. But, we won't be able to access to all the descendants points of the node N1, because points in S1 won't be considered.

So, I don't think this can be easily implemented.

I can think of 2 alternative solutions:
+ Don't change anything, and add a note that `NumDescendants()` includes duplicated points. Also we should mencion that Spill Tree shouldn't be considered for `RangeSearch`.
+ Hold a list of indexes in each node, including non-leaf nodes. This would require O(n  logn) space, comparing to current implementation which requires O(n) space.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/747/files/a71b57caa90311f5542180bc0553449c3691395d#r74244401
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160810/94046b3e/attachment.html>