<div dir="ltr">Hi Ryan,<div><br><div>Thanks for clarifying things for me. It&#39;s good to know there are techniques like FastMKS.</div></div><div><br></div><div>The main idea of that paper I linked is to reduce the cost of computing similarity. It does not build any kind of index structures. I think it might be used with other algorithms that utilize index structures.</div><div><br></div><div>Anyway, I will get started with mlpack first and then run the benchmarks. I&#39;ll also read through the paper on FastMKS.<br></div><div><br></div><div>Since I&#39;ll not be able to work full time during May~July, I should contribute on github. Hope to discuss with you when I encounter problems :)</div><div><br></div><div>Regards</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 1, 2016 at 10:30 PM, Ryan Curtin <span dir="ltr">&lt;<a href="mailto:ryan@ratml.org" target="_blank">ryan@ratml.org</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Tue, Mar 01, 2016 at 03:12:07PM +0800, Shaoxiang Chen wrote:<br>

&gt; Hi everyone,<br>

&gt;<br>

&gt; I am Shaoxiang Chen, a CS student(junior) of Fudan University, China. I&#39;m<br>

&gt; currently working with a professor, studying in the areas of multimedia<br>

&gt; content analysis and machine(deep) learning.<br>

&gt;<br>

&gt; I&#39;ve read a research paper about accelerating approximate nearest neighbour<br>

&gt; search when the distance metric of cosine similarity is used. It achieved<br>

&gt; decent speed up by quantizing vectors into binary form, making the<br>

&gt; computation of distance orders of magnitude faster.<br>

&gt;<br>

&gt; I see that cosine similarity is not included in neither flann nor mlpack&#39;s<br>

&gt; distance metrics. For one thing, it is not directly applicable to the space<br>

&gt; partitioning tree structures(am I right?). And second, if all the vectors<br>

&gt; are l2-normed, cosine similarity directly corresponds to l2 distance.<br>

&gt;<br>

&gt; I&#39;ve coded the proposed method myself and got decent speed up. The higher<br>

&gt; dimension, the larger speed up. While the method itself limits its use<br>

&gt; cases to high dimensional data(otherwise precision drops), I think it might<br>

&gt; serve as a supplement for tree structures when dealing with high<br>

&gt; dimensional data.<br>

&gt;<br>

&gt; I haven&#39;t run a benchmark of this my code against other ann search<br>

&gt; algorithms, but the paper includes some benchmarks.<br>

&gt;<br>

&gt; If after you&#39;ve read the paper you think it&#39;s ok to include this algorithm<br>

&gt; in mlpack&#39;s neighbour-search, I&#39;d like to contribute and discuss :)<br>

&gt;<br>

&gt; link to the paper:<br>

&gt; <a href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40572.pdf" rel="noreferrer" target="_blank">http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40572.pdf</a><br>

<br>

</span>Hi Shaoxiang,<br>

<br>

Technically the cosine similarity is not a metric, so it can&#39;t be used<br>

out of the box.  I think there are ways you can modify the cosine<br>

similarity so that it can be used with trees, like for instance the<br>

angular cosine similarity.<br>

<br>

Another technique that can be used for fast cosine similarity search is<br>

&quot;fast max-kernel search&quot;, or FastMKS, a technique that mlpack provides<br>

(though its implementation gives exact results instead of approximate<br>

results).<br>

<br>

I think it might be interesting to compare your implementation with the<br>

FastMKS implementation in mlpack, and maybe some other implementations,<br>

to see where it gives speedup.  And if it does give speedup for a good<br>

variety of datasets and situations, then we should include it if you are<br>

willing to take the time to incorporate it (once it&#39;s tested,<br>

documented, optimized, etc.). :)<br>

<br>

Another thought is that the Hamming distance is a metric, so that&#39;s an<br>

easier comparison: you could implement the Hamming distance in<br>

src/mlpack/core/metrics/, then use it with the existing nearest neighbor<br>

search techniques, and compare.  (Again the existing techniques are<br>

exact, so that may make the comparison a little bit difficult, but it<br>

would still be interesting, I think.)<br>

<br>

If you are interested, here&#39;s a paper on FastMKS:<br>

<br>

<a href="http://www.ratml.org/pub/pdf/2013fastmks.pdf" rel="noreferrer" target="_blank">http://www.ratml.org/pub/pdf/2013fastmks.pdf</a><br>

<br>

When I have a chance, I&#39;ll try and read through the paper you<br>

linked---it looks to be interesting.<br>

<br>

This could be a good project for GSoC: implement the algorithm, test it,<br>

run benchmarks, and so forth.  So please do feel free to consider this<br>

work as a summer-long project.<br>

<br>

Let me know if I can clarify anything.<br>

<br>

Thanks!<br>

<span class="HOEnZb"><font color="#888888"><br>

Ryan<br>

<br>

--<br>

Ryan Curtin    | &quot;Hi Doggy.&quot;<br>

<a href="mailto:ryan@ratml.org">ryan@ratml.org</a> |   - Johnny<br>

</font></span></blockquote></div><br></div>