[mlpack-git] [mlpack/mlpack] Hilbert R tree (#664)

Ryan Curtin notifications at github.com
Mon Jun 13 10:15:12 EDT 2016


I think I should have responded to this earlier, sorry for the delay.  Here is what I understand of the most recent set of changes: all of the dual-tree algorithms are refactored to handle trees that do not hold just one dataset matrix (this is represented by `TreeTraits<TreeType>::HasDataset`).  But the primary difficulty here is that it is no longer clear what indices should be returned by the dual-tree algorithms, since there are many different matrices where the datasets can be stored.  That means that we must create a mapping from descendant ID to actual vector, which is done for the `RectangleTree`.  All of this refactoring means that now the required `BaseCase()` and `Score()` functions must take extra parameters, some of which (like the query map) are not used when `HasDataset` is false.

I think I had said previously, that the idea of a local dataset would be nice, because it would allow users to insert and delete points without too much trouble.  The crux of that argument was that `Mat<eT>::insert_cols()` could take a very long time for a large dataset, especially if one point was being inserted.  But if instead we stored a large number of small matrices, then `insert_cols()` would be short.  From that perspective, the idea of local datasets makes sense.

But if we now consider all of the other changes that have to be made to make the idea work, like you did in your refactoring, I have to say, my opinion is changed---this makes it a lot harder to write dual-tree algorithms since there is much more to consider.  This is a big problem, because dual-tree algorithms are already very complex to write and it's important that we keep the `BaseCase()` and `Score()` functions as simple as possible to implement (and even then, right now they are not very simple!)  I think that we can avoid the original problem (`insert_cols()` takes a long time) by providing an `Insert()` method that can take many points at once, so `insert_cols()` only needs to be called once.  This still means `Insert()` is slower than possible, but I think that is okay.  When we insert a new point into the matrix, its index can be the largest index in the matrix (i.e. we insert a point at the end), and this is no problem because the RectangleTree does not rearrange the point
 s in the matrix during construction (maybe that would be an interesting optimization for later, but it makes it less clear how we add new points and index them, so I am not sure it is worth considering now).

I know that you have spent a lot of time on this, so I am sorry to disagree with your refactoring and wish I had developed my opinions before you did this.  I apologize for that.  What do you think, design-wise?  Do you agree with my thoughts, or have a better idea?

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/664#issuecomment-225593575
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160613/ec492b0c/attachment.html>


More information about the mlpack-git mailing list