[mlpack-git] [mlpack/mlpack] Refactor for faster assembly of secondHashTable. (#675)

Ryan Curtin notifications at github.com
Sun Jun 5 14:26:05 EDT 2016


Oh, right, I did not think about the fact that different buckets have different numbers of points in them!  Now that I think of that, I do think that perhaps `std::vector<size_t>*` is the right way to go (or actually maybe `std::vector<arma::Col<size_t>>`).

I think that we can have the best of both worlds if we do it like this:

 * Use `std::vector<arma::Col<size_t>>` for representing `secondHashTable` (this also avoids memory allocation, which is good---I am pretty sure your code had a subtle bug where the user could initialize the `LSHSearch` object without training, but then the destructor would still try to delete the `std::vector<size_t>*` object which would cause a crash).

 * Before filling `secondHashTable`, calculate the sizes of each bin (the code I wrote does this), truncating the length to `bucketSize`.  Then we can allocate the exact correct size for each `arma::Col<size_t>` (and also allocate exactly the right number of `arma::Col<size_t>`s), and then fill them like your code does.

 * When the object is constructed, if `bucketSize = 0`, set `bucketSize = referenceSet.n_cols`.

What do you think, do you think this would work?  We would have to modify the serialization again, but I don't think we need to increment the version from 1 to 2 because we did not release mlpack with the serialization change we did before (which was the change from `std::vector<arma::mat>` to `arma::cube`).  I was going to try and release mlpack 2.0.2 today, but, if we are going to change serialization again I will wait on this otherwise we will end up with more-complex-than-necessary legacy code to handle. :)

> I can't see your changes any more because there's something wrong with the commits

Yes, there was a force push to the repository to the state it was in about 20 days ago, but I restored the current state earlier today.  It seems like the PR interface has not been updated though, so it still shows way more commits.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/675#issuecomment-223828754
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160605/e1211c9b/attachment.html>


More information about the mlpack-git mailing list