[mlpack-git] [mlpack/mlpack] Modeling LSH For Performance Tuning (#749)

Wed Aug 24 11:40:11 EDT 2016

> +
> +    // Reference set for kNN
> +    arma::mat refMat = sampleSet.cols(refSetStart, refSetEnd);
> +    referenceSizes(i) = refMat.n_cols;
> +
> +    arma::Mat<size_t> neighbors; // Not going to be used but required.
> +    arma::mat kNNDistances; // What we need.
> +    KNN naive(refMat, true); // true: train and use naive kNN.
> +    naive.Search(queryMat, k, neighbors, kNNDistances);
> +
> +    // Store the squared distances (what we need).
> +    kNNDistances = arma::pow(kNNDistances, 2);
> +
> +    // Compute Arithmetic and Geometric mean of the distances.
> +    Ek.row(i) = arma::mean(kNNDistances.t());
> +    Gk.row(i) = arma::exp(arma::mean(arma::log(kNNDistances.t()), 0));

Here's the cause of the L_BFGS -NaN values:
I compute the logarithm of the kNN distances, always assuming that there's no points that have distance 0. In the case of duplicate points, that is not a good assumption to make.
The iris.csv datset that's included in mlpack has some duplicates:
```
5.8,2.7,5.1,1.9 # repeated twice
4.9,3.1,1.5,0.1 # repeated three times
```
running `$sort iris.csv | uniq -c | awk '{print $1}' | sort | uniq` will print `1 2 3` meaning that's all the duplicates.

I think the correct approach here is to simply disregard 0-distances completely, by resizing the kNNDistances matrix to only hold positive entries.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/749/files/57c9d5e634d7d3d7e2ca1618353fe37d9e23b34a#r76081866
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160824/2ed6c17c/attachment.html>