[mlpack-git] [mlpack/mlpack] Refactor for faster assembly of secondHashTable. (#675)

Sun Jun 5 14:17:35 EDT 2016

> the major change of your code is that bucketSize is ignored (and std::vector<size_t> is used to avoid allocating more space than necessary).

Not exactly, though ignoring it is a side-effect. The major change is that we can have buckets that occupy variable lengths - one bucket can be very large without requiring all other buckets to grow with it as was done when specifying (for example) bucketSize to be 5000-6000. I don't think we can do something like that with a monolithic matrix because it has to be rectangular.

> Do you think we should remove the bucketSize parameter? I think that it is helpful to keep un-tuned LSH from running for too long, and the user can always specify bucketSize to be equal to the size of the reference set. (We could change behavior so that if the user specifies bucketSize = 0 then we take bucketSize = referenceSet.n_cols, that would not be an issue.)

I agree, we shouldn't remove the choice from the user to set a maximum capacity to the buckets. My implementation ignores bucketSize because I just wanted to quickly demonstrate my idea - the final code should definately work as you say, have some default bucketSize and if the user specifies 0 simply store all points.

> The other concern is that if we want to compress secondHashTable after we build it, the best way is to take one pass over the data and calculate the size (like in my changes here), then allocate the right amount of space. If we are doing that, then we can just use arma::Mat<size_t> instead of std::vector<size_t>* in order to allocate space only once, instead of possibly many times with std::vector, so it should be faster and also use less or the same amount of memory. What do you think?

I can't see your changes any more because there's something wrong with the commits (it looks like there's 100 commits and 74 files changed, but can't find the changes you made in lsh_search_impl.hpp). It is true that allocating even an empty std::vector takes some memory, and some time, but I believe the memory we save (by not wasting it on empty buckets) makes up for it. Of course if we can avoid both it would be cool, but I'm not sure how :/
Another concern with std::vectors is memory coherency - if each vector is spread around in RAM there will be overhead. On the other hand, we don't access the buckets sequentially, so there's already some moving around being done.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/675#issuecomment-223828300
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160605/d51bfeda/attachment.html>