<div dir="ltr"><div>Hello,<br><br>Thank you for the clarifications, I managed to run specific tests after that.<br><br></div><div>I&#39;ll look at Datar&#39;s paper and your comment, and hopefully I&#39;ll be able to have a draft idea about a testing procedure soon. Should I get back at you with it to discuss it further, or should I just start coding away and see how it goes?<br><br></div><div>I believe the dataset used for this shouldn&#39;t matter that much, so we probably will not need to include a different one. I&#39;ll probably toy around with the iris data for now.<br><br></div><div>Thank you very much for the input!<br><br></div><div>Yannis<br></div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 8, 2016 at 2:41 PM, Ryan Curtin <span dir="ltr">&lt;<a href="mailto:ryan@ratml.org" target="_blank">ryan@ratml.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Mon, Mar 07, 2016 at 01:03:15PM +0000, Yannis Mentekidis wrote:<br>

&gt; Hello again,<br>

&gt;<br>

&gt; sorry for the late reply<br>

&gt;<br>

&gt; Yes, definitely other forms of LSH or improvement to the existing LSH<br>

&gt; &gt; implementation would be a good way to go.  Dong&#39;s paper is where LSHKIT<br>

&gt; &gt; came from, right?  If so it might be interesting to implement that, but<br>

&gt; &gt; only if we are pretty sure we can outperform LSHKIT (i.e. if we can make<br>

&gt; &gt; a faster implementation).<br>

&gt;<br>

&gt; From what I can recall, LSHKIT is quite well optimized, so it might be a<br>

&gt; hard task to outperform it. I&#39;m not sure if it has been parallelized,<br>

&gt; though, so that might be a way to achieve that (if you don&#39;t consider that<br>

&gt; cheating :P).<br>

<br>

</span>Parallelization is fine, I don&#39;t think LSHKIT is parallel.  OpenMP could<br>

be a good way to do that, since there will probably be a lot of sections<br>

that could be embarrassingly parallel.  I see that LSHKIT uses floats<br>

internally, so if we test the two implementations against each other<br>

we&#39;ll probably want to use floats for mlpack&#39;s test (instead of the<br>

default double type that&#39;s used for most matrices).<br>

<span class=""><br>

&gt; &gt; One thing I found with other LSH implementations was that most of them<br>

&gt; &gt; are a little difficult to work with... a user is most likely going to<br>

&gt; &gt; want to say &quot;I want k approximate nearest neighbors with a desired<br>

&gt; &gt; maximum of e% relative error&quot;, so they specify k and e and then the<br>

&gt; &gt; algorithm returns results with the desired error bound.  But some LSH<br>

&gt; &gt; packages do not offer this kind of interface and can be very hard to<br>

&gt; &gt; work with (E2LSH is one example).<br>

&gt;<br>

&gt; Yes, that is true, most have a big number of parameters that people not<br>

&gt; familiar with the algorithm might not understand at first. That&#39;s why I<br>

&gt; proposed implementing something similar to Dong&#39;s paper, since they propose<br>

&gt; a way to identify sets of parameters that will yield results near some<br>

&gt; specified recall/selectivity.<br>

&gt; Of course this is probably going to be slower than the naive kNN<br>

&gt; implementation, but it is reasonable to run the tuning program once for a<br>

&gt; given dataset, produce favorable sets of parameters, and store these for<br>

&gt; further runs of LSH.<br>

&gt; An alternative would be developing some heuristic that maps neighbor error<br>

&gt; rates to algorithm parameters. This might be tricky though, since error<br>

&gt; depends highly on the dataset. It would be an interesting endeavour though.<br>

<br>

</span>Do you know if anyone has tried to map neighbor error rates to algorithm<br>

parameters?  It would interesting to do.<br>

<br>

I think that Dong&#39;s paper would definitely be nice to implement and add<br>

to mlpack, specifically because of the big number of parameters and the<br>

auto-tuning.<br>

<span class=""><br>

&gt; &gt; Here are some interesting runtime comparisons to LSHKIT that I did last<br>

&gt; &gt; year (on page 9 and 10); I compared some tree-based algorithms and then<br>

&gt; &gt; also multiprobe LSH:<br>

&gt; &gt;<br>

&gt; &gt; <a href="http://ratml.org/pub/pdf/2015faster.pdf" rel="noreferrer" target="_blank">http://ratml.org/pub/pdf/2015faster.pdf</a><br>

&gt; &gt;<br>

&gt; &gt; I think that probably LSH would be faster in higher dimensions, but that<br>

&gt; &gt; wasn&#39;t the region of interest for the study I linked to.  Also that<br>

&gt; &gt; paper might be a decent place to start reading about dual-tree nearest<br>

&gt; &gt; neighbor search (and the papers that paper references might be<br>

&gt; &gt; interesting as well).<br>

&gt;<br>

&gt; I&#39;ve put this paper on my reading list,  as well as &quot;An Optimal Algorithm<br>

&gt; for Approximate Nearest Neighbor Searching in Fixed Dimensions&quot; which you<br>

&gt; linked to in another mail. I believe I will be able to read them by the<br>

&gt; time I start writing my proposal.<br>

<br>

</span>Great!  I think the Arya paper is really nice; I enjoyed reading it.<br>

Let me know if there&#39;s anything in the &quot;Faster dual-tree traversal for<br>

nearest neighbor search&quot; paper that I can clarify.<br>

<span class=""><br>

&gt; &gt; You&#39;re more than welcome to submit two proposals, but often it&#39;s better<br>

&gt; &gt; to focus on just one, since we can only accept one proposal per student.<br>

&gt; &gt; :)<br>

&gt; &gt;<br>

&gt; Then I will probably stick with one proposal and try to make it as good as<br>

&gt; possible :)<br>

&gt;<br>

&gt; ** Regarding LSH Tests:<br>

&gt; The &quot;impossibly hard&quot; label makes this look scary, admittedly :P<br>

&gt; I&#39;ll look at the test and current LSH code and see what I can think of, in<br>

&gt; the meantime could you provide some help regarding where I can find more<br>

&gt; documentation for the test function? I haven&#39;t figured out how to run only<br>

&gt; specific tests instead of all 616 of them.<br>

<br>

</span>Ah, yeah, it&#39;s the Boost Unit Test Framework, so their command-line<br>

options are documented here:<br>

<a href="http://www.boost.org/doc/libs/1_46_1/libs/test/doc/html/utf/user-guide/runtime-config/reference.html" rel="noreferrer" target="_blank">http://www.boost.org/doc/libs/1_46_1/libs/test/doc/html/utf/user-guide/runtime-config/reference.html</a><br>

<br>

You could run just the LSHTest suite like this:<br>

<br>

$ bin/mlpack_test -t LSHTest<br>

<br>

or you could run just one single test case from the LSHTest suite like<br>

this:<br>

<br>

$ bin/mlpack_test -t LSHTest/LSHSearchTest<br>

<span class=""><br>

&gt; At the moment, an idea I have is this:<br>

&gt; A typical way to test how &quot;good&quot; approximate algorithms are is against some<br>

&gt; base case, so for example I&#39;ve often used the SIFT dataset (Evaluation of<br>

&gt; Approximate nearest neighbors: large datasets<br>

</span>&gt; &lt;<a href="http://corpus-texmex.irisa.fr/" rel="noreferrer" target="_blank">http://corpus-texmex.irisa.fr/</a>&gt;) to calculate metrics like selectivity and<br>

<span class="">&gt; recall.<br>

&gt; Of course this doesn&#39;t check if the implementation is correct, but it<br>

&gt; produces results we can evaluate as reasonable or unreasonable.<br>

&gt;<br>

&gt; So maybe an approach would be, have our LSH implementation run a few times<br>

&gt; with a few different parameters we&#39;ve selected, and see how close these<br>

&gt; results get to some benchmark we know to be correct? Problems with this<br>

&gt; idea:<br>

&gt; a) That requires packing a benchmarks file and SIFT dataset along with the<br>

&gt; rest of the software.<br>

&gt; b) We need benchmarks, which we could get by either using some<br>

&gt; implementation we know to be good - for example, LSHKit - or by running the<br>

&gt; LSH implementation now and then re-run it every time we change the code.<br>

&gt;<br>

&gt; Is this what you had in mind when you mentioned probabilistic tests?<br>

<br>

</span>Yeah, I think approach (a) is probably the right way to do this.  I<br>

don&#39;t think it&#39;s realistic to do an exact correctness check like the<br>

test that Pari originally wrote, because that depends on the random seed<br>

and the specific implementation, and we can&#39;t guarantee (and shouldn&#39;t<br>

guarantee) that that will never change.  So maybe the better approach<br>

here is to find a dataset, find some LSH parameters, calculate the<br>

theoretical approximation guarantee for how correct the results should<br>

be, and then ensure that the implementation satisfies that guarantee.<br>

<br>

I haven&#39;t though too much about that, but I think maybe the theoretical<br>

guarantee would be something like the parameters to Theorem 1 in Datar&#39;s<br>

paper:<br>

<br>

<a href="http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf" rel="noreferrer" target="_blank">http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/p253-datar.pdf</a><br>

<br>

The weird thing is that the mlpack implementation performs actual<br>

approximate nearest neighbor search, but their theory is for solving<br>

(R, c)-NN, which I think (correct me if I&#39;m wrong) is just the task of<br>

repoting a point within the distance cR from the query point.  So with<br>

properly set c, R, p_1, and p_2, I think maybe we can just run LSH and<br>

then check to see that the nearest neighbors returned are withing<br>

distance cR.<br>

<br>

We probably don&#39;t want to package a very large dataset though; maybe<br>

there is already a dataset in src/mlpack/tests/data/ that is suitable?<br>

<br>

<a href="https://github.com/mlpack/mlpack/tree/master/src/mlpack/tests" rel="noreferrer" target="_blank">https://github.com/mlpack/mlpack/tree/master/src/mlpack/tests</a><br>

<br>

For approach (b), I think actually a way to do that would be with the<br>

automatic benchmarking system found at<br>

<br>

<a href="https://github.com/zoq/benchmarks/" rel="noreferrer" target="_blank">https://github.com/zoq/benchmarks/</a><br>

<br>

which displays some results you can look through at<br>

<br>

<a href="http://www.mlpack.org/benchmark.html" rel="noreferrer" target="_blank">http://www.mlpack.org/benchmark.html</a><br>

<br>

The system can display metrics like recall, precision, etc.; try<br>

selecting &quot;Metric plots for an algorithm/parameter/dataset combination&quot;,<br>

then &quot;PERCEPTRON&quot; for the method, &quot;-i 1000&quot; for the parameters, and the<br>

&quot;iris&quot; dataset to see what I mean.  There is a lot more information<br>

there too, but it can be hard to find a compelling visualization in all<br>

of the noise.  I think the perceptron iris graph looks good. :)<br>

<br>

Anyway, it would be really great to have this issue solved; I&#39;ve never<br>

had the time to look into it enough.<br>

<br>

Thanks!<br>

<span class="HOEnZb"><font color="#888888"><br>

Ryan<br>

<br>

--<br>

Ryan Curtin    | &quot;Somebody dropped a bag on the sidewalk.&quot;<br>

<a href="mailto:ryan@ratml.org">ryan@ratml.org</a> |   - Kit<br>

</font></span></blockquote></div><br></div>