[mlpack-git] [mlpack/mlpack] Refactor for faster assembly of secondHashTable. (#675)

Sun Jun 5 13:37:22 EDT 2016

> I think there's still a drawback in this though - if only a few hash codes are present then we'll have few buckets, all of them full. That way we discard a lot of points due to capacity and still allocate the same size since maxBucketSize = bucketSize.

I agree, definitely we can end up with a case where we discard a lot of points.  I wasn't trying to fix that situation here because I'm not 100% sure what the best option is: keeping all of the points in every bucket can lead to very long runtimes (because very large buckets, like you saw with the Corel data).

> A third consideration is armadillo uses compressed sparse column (not row) representation, so we'll need to transpose anything that has to do with secondHashTable in order to be more efficient.

I actually think we should transpose how secondHashTable is stored; I think it is being accessed by rows.  To me it would make the most sense to have the points in each bin stored in a column, not in a row like it is now.

> Here's what I have in mind regarding an std::vector<size_t> array. 

I took a look through the code and the benchmarks that you sent.  I noticed that the recall was much higher in the std::vector<size_t> implementation, but the runtime tended to be much higher for the larger datasets (though it was lower for small datasets).  This makes sense: in the existing approach, we are taking extra time to calculate the size of the minimal second hash table and then fill it, but in your approach we do not do that (there may be many empty `std::vector<size_t>`s).

It seems like our changes are actually a bit orthogonal: my idea with this PR was to take the existing approach (with the existing bucket size limitation enforced by `bucketSize`) and produce the same results without changing any behavior.  But I think that your changes address a different problem: specifically, that there are some situations where we discard massive numbers of points that fall into a single bucket.  If I have not misunderstood, the major change of your code is that `bucketSize` is ignored (and `std::vector<size_t>` is used to avoid allocating more space than necessary).

Do you think we should remove the `bucketSize` parameter?  I think that it is helpful to keep un-tuned LSH from running for too long, and the user can always specify `bucketSize` to be equal to the size of the reference set.  (We could change behavior so that if the user specifies `bucketSize = 0` then we take `bucketSize = referenceSet.n_cols`, that would not be an issue.)

The other concern is that if we want to compress `secondHashTable` after we build it, the best way is to take one pass over the data and calculate the size (like in my changes here), then allocate the right amount of space.  If we are doing that, then we can just use `arma::Mat<size_t>` instead of `std::vector<size_t>*` in order to allocate space only once, instead of possibly many times with `std::vector`, so it should be faster and also use less or the same amount of memory.  What do you think?

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/675#issuecomment-223826259
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160605/29155843/attachment-0001.html>