[mlpack-git] [mlpack/mlpack] Refactor for faster assembly of secondHashTable. (#675)

Yannis Mentekidis notifications at github.com
Sat Jun 4 04:34:20 EDT 2016


Essentially you reshape the secondHashTable from secondHashSize x bucketSize to secondHashSize x maxBucketSize if I understand correctly.

I'll run some tests to see if that improves search time.

I think there's still a drawback in this though - if only a few hash codes are present then we'll have few buckets, all of them full. That way we discard a lot of points due to capacity and still allocate the same size since maxBucketSize = bucketSize.

An idea would be to make secondHashTable from arma::Mat to arma::SpMat. The problem with this is currently we denote an empty bucket position by setting it to N, not 0, because 0 corresponds to point 0 in the reference set.
So if we do SpMat we will need to change that notation everywhere, which will probably cause compatibility issues when reading saved models.
Another drawback of this, is we still set a maximum number of points that secondHashTable stores per bucket - but this number can be set arbitrarily high because we're more efficient, memory-wise.
A third consideration is armadillo uses compressed sparse column (not row) representation, so we'll need to transpose anything that has to do with secondHashTable in order to be more efficient.
I started trying to do this but wasted 2 hours and I'm not done yet. It seemed like a quick and dirty trick but it's quite complicated :(

An alternative would be to have a C++ array of std::vectors. Each vector holds the contents (indices) of the corresponding bucket. This might require a little more refactoring but since vectors use only the memory they require (plus whatever they need for amortization of expansion) I think we will be better off than using SpMat.

Let me know what you think

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/675#issuecomment-223744179
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160604/e4abce42/attachment.html>


More information about the mlpack-git mailing list