[mlpack-git] (blog) master: Adds Yannis Week Eight (960cec3)

Tue Jul 19 10:26:35 EDT 2016

Repository : https://github.com/mlpack/blog
On branch  : master
Link       : https://github.com/mlpack/blog/compare/471f931e36ef68961685ce1fa8a3f7591da15bc6...960cec39f75bd612d437914a2bc395aaa2434f8f

>---------------------------------------------------------------

commit 960cec39f75bd612d437914a2bc395aaa2434f8f
Author: Yannis Mentekidis <mentekid at gmail.com>
Date:   Tue Jul 19 15:26:35 2016 +0100

    Adds Yannis Week Eight


>---------------------------------------------------------------

960cec39f75bd612d437914a2bc395aaa2434f8f
 content/blog/YannisWeekEight.md | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/content/blog/YannisWeekEight.md b/content/blog/YannisWeekEight.md
new file mode 100644
index 0000000..d7874b9
--- /dev/null
+++ b/content/blog/YannisWeekEight.md
@@ -0,0 +1,25 @@
+Title: Some thoughts on the LSH Model
+Date: 2016-07-19 20:20:20
+Tags: gsoc, lsh, modeling
+Author: Yannis Mentekidis
+
+As I've discussed before, LSH has one major problem - a large number of parameters which are difficult to tune. Other approximate nearest neighbors algorithms accept a relative error $\epsilon$ and, through theoretical guarantees, return points that are less than (1+$\epsilon$) times further from the nearest neighbors.
+
+As opposed to that, LSH asks the user to specify a number of hash tables (L), a code length (M), a window size (W) and now a number of additional probes (T). Of course, literature provides explanations towards how these affect performance and time cost, but still they are not as clear.
+
+The idea behind the LSH model is to, eventually, produce two functions:
+
+Recall(Dataset, L, M, W, T)
+
+and
+
+Selectivity(Dataset, L, M, W, T)
+
+which will (hopefully in an efficient manner) estimate the Recall and Selectivity of an LSH run with parameters (L, M, W, T) on the dataset.
+
+To do that, the authors define a statistical expression for Recall(.) and Selectivity(.) and model the data distribution and especially the distribution of distances between points.
+
+If (hopefully) the functions are convex, then we can use binary search to get to their minima efficiently, thus producing a favorable parameter set for LSH.
+
+The statistical models that come from the authors' analysis are quite complex and require a lot of stuff to be done. I have begun by implementing an algorithm that fits the parameters of a gamma distribution to a dataset, and creating a checklist of what needs to be done next.
+