<div dir="ltr"><div>Hello,<br><br>Thank you for the clarifications, I managed to run specific tests after that.<br><br></div><div>I'll look at Datar's paper and your comment, and hopefully I'll be able to have a draft idea about a testing procedure soon. Should I get back at you with it to discuss it further, or should I just start coding away and see how it goes?<br><br></div><div>I believe the dataset used for this shouldn't matter that much, so we probably will not need to include a different one. I'll probably toy around with the iris data for now.<br><br></div><div>Thank you very much for the input!<br><br></div><div>Yannis<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 8, 2016 at 2:41 PM, Ryan Curtin <span dir="ltr"><<a href="mailto:ryan@ratml.org" target="_blank">ryan@ratml.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Mon, Mar 07, 2016 at 01:03:15PM +0000, Yannis Mentekidis wrote:<br>
> Hello again,<br>
><br>
> sorry for the late reply<br>
><br>
> Yes, definitely other forms of LSH or improvement to the existing LSH<br>
> > implementation would be a good way to go. Dong's paper is where LSHKIT<br>
> > came from, right? If so it might be interesting to implement that, but<br>
> > only if we are pretty sure we can outperform LSHKIT (i.e. if we can make<br>
> > a faster implementation).<br>
><br>
> From what I can recall, LSHKIT is quite well optimized, so it might be a<br>
> hard task to outperform it. I'm not sure if it has been parallelized,<br>
> though, so that might be a way to achieve that (if you don't consider that<br>
> cheating :P).<br>
<br>
</span>Parallelization is fine, I don't think LSHKIT is parallel. OpenMP could<br>
be a good way to do that, since there will probably be a lot of sections<br>
that could be embarrassingly parallel. I see that LSHKIT uses floats<br>
internally, so if we test the two implementations against each other<br>
we'll probably want to use floats for mlpack's test (instead of the<br>
default double type that's used for most matrices).<br>
<span class=""><br>
> > One thing I found with other LSH implementations was that most of them<br>
> > are a little difficult to work with... a user is most likely going to<br>
> > want to say "I want k approximate nearest neighbors with a desired<br>
> > maximum of e% relative error", so they specify k and e and then the<br>
> > algorithm returns results with the desired error bound. But some LSH<br>
> > packages do not offer this kind of interface and can be very hard to<br>
> > work with (E2LSH is one example).<br>
><br>
> Yes, that is true, most have a big number of parameters that people not<br>
> familiar with the algorithm might not understand at first. That's why I<br>
> proposed implementing something similar to Dong's paper, since they propose<br>
> a way to identify sets of parameters that will yield results near some<br>
> specified recall/selectivity.<br>
> Of course this is probably going to be slower than the naive kNN<br>
> implementation, but it is reasonable to run the tuning program once for a<br>
> given dataset, produce favorable sets of parameters, and store these for<br>
> further runs of LSH.<br>
> An alternative would be developing some heuristic that maps neighbor error<br>
> rates to algorithm parameters. This might be tricky though, since error<br>
> depends highly on the dataset. It would be an interesting endeavour though.<br>
<br>
</span>Do you know if anyone has tried to map neighbor error rates to algorithm<br>
parameters? It would interesting to do.<br>
<br>
I think that Dong's paper would definitely be nice to implement and add<br>
to mlpack, specifically because of the big number of parameters and the<br>
auto-tuning.<br>
<span class=""><br>
> > Here are some interesting runtime comparisons to LSHKIT that I did last<br>
> > year (on page 9 and 10); I compared some tree-based algorithms and then<br>
> > also multiprobe LSH:<br>
> ><br>
> > <a href="http://ratml.org/pub/pdf/2015faster.pdf" rel="noreferrer" target="_blank">http://ratml.org/pub/pdf/2015faster.pdf</a><br>
> ><br>
> > I think that probably LSH would be faster in higher dimensions, but that<br>
> > wasn't the region of interest for the study I linked to. Also that<br>
> > paper might be a decent place to start reading about dual-tree nearest<br>
> > neighbor search (and the papers that paper references might be<br>
> > interesting as well).<br>
><br>
> I've put this paper on my reading list, as well as "An Optimal Algorithm<br>
> for Approximate Nearest Neighbor Searching in Fixed Dimensions" which you<br>
> linked to in another mail. I believe I will be able to read them by the<br>
> time I start writing my proposal.<br>
<br>
</span>Great! I think the Arya paper is really nice; I enjoyed reading it.<br>
Let me know if there's anything in the "Faster dual-tree traversal for<br>
nearest neighbor search" paper that I can clarify.<br>
<span class=""><br>
> > You're more than welcome to submit two proposals, but often it's better<br>
> > to focus on just one, since we can only accept one proposal per student.<br>
> > :)<br>
> ><br>
> Then I will probably stick with one proposal and try to make it as good as<br>
> possible :)<br>
><br>
> ** Regarding LSH Tests:<br>
> The "impossibly hard" label makes this look scary, admittedly :P<br>
> I'll look at the test and current LSH code and see what I can think of, in<br>
> the meantime could you provide some help regarding where I can find more<br>
> documentation for the test function? I haven't figured out how to run only<br>
> specific tests instead of all 616 of them.<br>
<br>
</span>Ah, yeah, it's the Boost Unit Test Framework, so their command-line<br>
options are documented here:<br>
<a href="http://www.boost.org/doc/libs/1_46_1/libs/test/doc/html/utf/user-guide/runtime-config/reference.html" rel="noreferrer" target="_blank">http://www.boost.org/doc/libs/1_46_1/libs/test/doc/html/utf/user-guide/runtime-config/reference.html</a><br>
<br>
You could run just the LSHTest suite like this:<br>
<br>
$ bin/mlpack_test -t LSHTest<br>
<br>
or you could run just one single test case from the LSHTest suite like<br>
this:<br>
<br>
$ bin/mlpack_test -t LSHTest/LSHSearchTest<br>
<span class=""><br>
> At the moment, an idea I have is this:<br>
> A typical way to test how "good" approximate algorithms are is against some<br>
> base case, so for example I've often used the SIFT dataset (Evaluation of<br>
> Approximate nearest neighbors: large datasets<br>
</span>> <<a href="http://corpus-texmex.irisa.fr/" rel="noreferrer" target="_blank">http://corpus-texmex.irisa.fr/</a>>) to calculate metrics like selectivity and<br>
<span class="">> recall.<br>
> Of course this doesn't check if the implementation is correct, but it<br>
> produces results we can evaluate as reasonable or unreasonable.<br>
><br>
> So maybe an approach would be, have our LSH implementation run a few times<br>
> with a few different parameters we've selected, and see how close these<br>
> results get to some benchmark we know to be correct? Problems with this<br>
> idea:<br>
> a) That requires packing a benchmarks file and SIFT dataset along with the<br>
> rest of the software.<br>
> b) We need benchmarks, which we could get by either using some<br>
> implementation we know to be good - for example, LSHKit - or by running the<br>
> LSH implementation now and then re-run it every time we change the code.<br>
><br>
> Is this what you had in mind when you mentioned probabilistic tests?<br>
<br>
</span>Yeah, I think approach (a) is probably the right way to do this. I<br>
don't think it's realistic to do an exact correctness check like the<br>
test that Pari originally wrote, because that depends on the random seed<br>
and the specific implementation, and we can't guarantee (and shouldn't<br>
guarantee) that that will never change. So maybe the better approach<br>
here is to find a dataset, find some LSH parameters, calculate the<br>
theoretical approximation guarantee for how correct the results should<br>
be, and then ensure that the implementation satisfies that guarantee.<br>
<br>
I haven't though too much about that, but I think maybe the theoretical<br>
guarantee would be something like the parameters to Theorem 1 in Datar's<br>
paper:<br>
<br>
<a href="http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf" rel="noreferrer" target="_blank">http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf</a><br>
<br>
The weird thing is that the mlpack implementation performs actual<br>
approximate nearest neighbor search, but their theory is for solving<br>
(R, c)-NN, which I think (correct me if I'm wrong) is just the task of<br>
repoting a point within the distance cR from the query point. So with<br>
properly set c, R, p_1, and p_2, I think maybe we can just run LSH and<br>
then check to see that the nearest neighbors returned are withing<br>
distance cR.<br>
<br>
We probably don't want to package a very large dataset though; maybe<br>
there is already a dataset in src/mlpack/tests/data/ that is suitable?<br>
<br>
<a href="https://github.com/mlpack/mlpack/tree/master/src/mlpack/tests" rel="noreferrer" target="_blank">https://github.com/mlpack/mlpack/tree/master/src/mlpack/tests</a><br>
<br>
For approach (b), I think actually a way to do that would be with the<br>
automatic benchmarking system found at<br>
<br>
<a href="https://github.com/zoq/benchmarks/" rel="noreferrer" target="_blank">https://github.com/zoq/benchmarks/</a><br>
<br>
which displays some results you can look through at<br>
<br>
<a href="http://www.mlpack.org/benchmark.html" rel="noreferrer" target="_blank">http://www.mlpack.org/benchmark.html</a><br>
<br>
The system can display metrics like recall, precision, etc.; try<br>
selecting "Metric plots for an algorithm/parameter/dataset combination",<br>
then "PERCEPTRON" for the method, "-i 1000" for the parameters, and the<br>
"iris" dataset to see what I mean. There is a lot more information<br>
there too, but it can be hard to find a compelling visualization in all<br>
of the noise. I think the perceptron iris graph looks good. :)<br>
<br>
Anyway, it would be really great to have this issue solved; I've never<br>
had the time to look into it enough.<br>
<br>
Thanks!<br>
<span class="HOEnZb"><font color="#888888"><br>
Ryan<br>
<br>
--<br>
Ryan Curtin | "Somebody dropped a bag on the sidewalk."<br>
<a href="mailto:ryan@ratml.org">ryan@ratml.org</a> | - Kit<br>
</font></span></blockquote></div><br></div>