<div dir="ltr">Hi Ryan,<div><br><div>Thanks for clarifying things for me. It's good to know there are techniques like FastMKS.</div></div><div><br></div><div>The main idea of that paper I linked is to reduce the cost of computing similarity. It does not build any kind of index structures. I think it might be used with other algorithms that utilize index structures.</div><div><br></div><div>Anyway, I will get started with mlpack first and then run the benchmarks. I'll also read through the paper on FastMKS.<br></div><div><br></div><div>Since I'll not be able to work full time during May~July, I should contribute on github. Hope to discuss with you when I encounter problems :)</div><div><br></div><div>Regards</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Mar 1, 2016 at 10:30 PM, Ryan Curtin <span dir="ltr"><<a href="mailto:ryan@ratml.org" target="_blank">ryan@ratml.org</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span class="">On Tue, Mar 01, 2016 at 03:12:07PM +0800, Shaoxiang Chen wrote:<br>
> Hi everyone,<br>
><br>
> I am Shaoxiang Chen, a CS student(junior) of Fudan University, China. I'm<br>
> currently working with a professor, studying in the areas of multimedia<br>
> content analysis and machine(deep) learning.<br>
><br>
> I've read a research paper about accelerating approximate nearest neighbour<br>
> search when the distance metric of cosine similarity is used. It achieved<br>
> decent speed up by quantizing vectors into binary form, making the<br>
> computation of distance orders of magnitude faster.<br>
><br>
> I see that cosine similarity is not included in neither flann nor mlpack's<br>
> distance metrics. For one thing, it is not directly applicable to the space<br>
> partitioning tree structures(am I right?). And second, if all the vectors<br>
> are l2-normed, cosine similarity directly corresponds to l2 distance.<br>
><br>
> I've coded the proposed method myself and got decent speed up. The higher<br>
> dimension, the larger speed up. While the method itself limits its use<br>
> cases to high dimensional data(otherwise precision drops), I think it might<br>
> serve as a supplement for tree structures when dealing with high<br>
> dimensional data.<br>
><br>
> I haven't run a benchmark of this my code against other ann search<br>
> algorithms, but the paper includes some benchmarks.<br>
><br>
> If after you've read the paper you think it's ok to include this algorithm<br>
> in mlpack's neighbour-search, I'd like to contribute and discuss :)<br>
><br>
> link to the paper:<br>
> <a href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40572.pdf" rel="noreferrer" target="_blank">http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/40572.pdf</a><br>
<br>
</span>Hi Shaoxiang,<br>
<br>
Technically the cosine similarity is not a metric, so it can't be used<br>
out of the box. I think there are ways you can modify the cosine<br>
similarity so that it can be used with trees, like for instance the<br>
angular cosine similarity.<br>
<br>
Another technique that can be used for fast cosine similarity search is<br>
"fast max-kernel search", or FastMKS, a technique that mlpack provides<br>
(though its implementation gives exact results instead of approximate<br>
results).<br>
<br>
I think it might be interesting to compare your implementation with the<br>
FastMKS implementation in mlpack, and maybe some other implementations,<br>
to see where it gives speedup. And if it does give speedup for a good<br>
variety of datasets and situations, then we should include it if you are<br>
willing to take the time to incorporate it (once it's tested,<br>
documented, optimized, etc.). :)<br>
<br>
Another thought is that the Hamming distance is a metric, so that's an<br>
easier comparison: you could implement the Hamming distance in<br>
src/mlpack/core/metrics/, then use it with the existing nearest neighbor<br>
search techniques, and compare. (Again the existing techniques are<br>
exact, so that may make the comparison a little bit difficult, but it<br>
would still be interesting, I think.)<br>
<br>
If you are interested, here's a paper on FastMKS:<br>
<br>
<a href="http://www.ratml.org/pub/pdf/2013fastmks.pdf" rel="noreferrer" target="_blank">http://www.ratml.org/pub/pdf/2013fastmks.pdf</a><br>
<br>
When I have a chance, I'll try and read through the paper you<br>
linked---it looks to be interesting.<br>
<br>
This could be a good project for GSoC: implement the algorithm, test it,<br>
run benchmarks, and so forth. So please do feel free to consider this<br>
work as a summer-long project.<br>
<br>
Let me know if I can clarify anything.<br>
<br>
Thanks!<br>
<span class="HOEnZb"><font color="#888888"><br>
Ryan<br>
<br>
--<br>
Ryan Curtin | "Hi Doggy."<br>
<a href="mailto:ryan@ratml.org">ryan@ratml.org</a> | - Johnny<br>
</font></span></blockquote></div><br></div>