[mlpack] GSoC MLPack - Approximate Nearest Neighbor Search
Ryan Curtin
ryan at ratml.org
Mon Mar 21 10:33:28 EDT 2016
On Mon, Mar 21, 2016 at 08:21:27AM -0300, Marcos Pividori wrote:
> Dear all,
>
> I am Marcos Pividori, a Computer Science student from the National
> University of Rosario, Argentina ([1]). I am currently finishing my degree.
> (5 year long degree)
> I am writing because I am really interested in working this GSoC for mlpack.
Hi Marcos,
Thanks for getting in touch.
> I have been reading the papers recommended for this topic. I think I have a
> general idea of the main methods for the Approximate Nearest Neighbor
> problem.
> Also, I have installed the mlpack library and looked at the main code
> implementation.
> So, I should split the project with these main goals:
> + Implementation of approximate nearest neighbor search.
> + Implementation of a command line tool (simple task).
> + Comparisons between mlpack implementation and similar libraries, and
> writing of proper documentation.
>
> About the implemenation of ann: do you have a preference on which method to
> use? Should we implement many of these different methods and compare their
> results?
> I have found many different options in the literature:
> + "An optimal algorithm for ANN Searching in Fixed dimensions" proposes
> to use BBD-Trees for the computation, and it is implemented in the ANN
> library. But I couldn't find more information on BBD trees. ANN library
> also uses KD-Trees.
> + "An Investigation of Practical Approximate Nearest Neighbor Algorithms"
> proposes to use Spill-Trees (maybe Hybrid) with overlapping in points
> separation.
> + "Fast approximate nearest neighbors with automatic configuration"
> reviews and compares the state of the art in ann. It mentioned the
> randomized kd-trees and k-means clustering trees as the most promising
> methods according to their results. FLANN library implements many of these
> methods.
> + scikit-learn library implements Locality-Sensitive Hashing for ann.
mlpack has an LSH implementation already (see src/mlpack/methods/lsh/).
The easiest (and best, in my opinion) route to a good approximate
nearest neighbor search algorithm is to extend the existing dual-tree
algorithm to be an approximate algorithm instead of an exact algorithm.
If you read the paper "Tree-independent dual-tree algorithms" you will
see that the pruning rule for dual-tree nearest neighbor search is of
the form
"prune if d_min(N_q, N_r) > B(N_q)"
but this can easily be made approximate by a change to something like
"if d_min(N_q, N_r) > B(N_q) + epsilon". This would yield a pretty
simple change (or generalization) of the algorithm that's implemented.
But I don't think that's enough work for the whole summer, so one of the
papers you referenced might be worth implementing.
I might consider implementing spill trees and then the defeatist search
used by that algorithm.
Then you could spend the rest of the time comparing and benchmarking
implementations.
I hope this is helpful.
Thanks,
Ryan
--
Ryan Curtin | "He takes such wonderful pictures with
ryan at ratml.org | his paws." - Radio man
More information about the mlpack
mailing list