[mlpack-git] [mlpack/mlpack] Spill trees (#747)

Fri Aug 5 12:28:20 EDT 2016

> +  left->ParentDistance() = leftParentDistance;
> +  right->ParentDistance() = rightParentDistance;
> +}
> +
> +template<typename MetricType,
> +         typename StatisticType,
> +         typename MatType,
> +         template<typename HyperplaneMetricType> class HyperplaneType,
> +         template<typename SplitMetricType, typename SplitMatType>
> +             class SplitType>
> +bool SpillTree<MetricType, StatisticType, MatType, HyperplaneType, SplitType>::
> +    SplitPoints(const double tau,
> +                const double rho,
> +                const std::vector<size_t>& points,
> +                std::vector<size_t>& leftPoints,
> +                std::vector<size_t>& rightPoints)

I have a comment about this function.  I think that this function can be pretty slow because of the large number of allocations that may be incurred as `std::vector` resizes itself when `push_back()` is called.  Preventing many allocations was a big concern when I implemented sparse matrices for Armadillo, and the typical strategy looked like this:

 - take a first pass over the data, and determine how large the results need to be
 - take a second pass over the data and fill the results

So in this case, in the first pass you could calculate `leftSize`, `leftFrontierSize`, `rightSize`, and `rightFrontierSize`, and cache the values of the projections for each point.  In the second pass, you could just use those cached values to assign points to the correct side.  Let me know what you think.  I think for big datasets, it will make a difference.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/747/files/a71b57caa90311f5542180bc0553449c3691395d#r73720820
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160805/e3358dc8/attachment-0001.html>