[mlpack-git] [mlpack/mlpack] Modeling LSH For Performance Tuning (#749)
Yannis Mentekidis
notifications at github.com
Wed Aug 24 11:40:11 EDT 2016
> +
> + // Reference set for kNN
> + arma::mat refMat = sampleSet.cols(refSetStart, refSetEnd);
> + referenceSizes(i) = refMat.n_cols;
> +
> + arma::Mat<size_t> neighbors; // Not going to be used but required.
> + arma::mat kNNDistances; // What we need.
> + KNN naive(refMat, true); // true: train and use naive kNN.
> + naive.Search(queryMat, k, neighbors, kNNDistances);
> +
> + // Store the squared distances (what we need).
> + kNNDistances = arma::pow(kNNDistances, 2);
> +
> + // Compute Arithmetic and Geometric mean of the distances.
> + Ek.row(i) = arma::mean(kNNDistances.t());
> + Gk.row(i) = arma::exp(arma::mean(arma::log(kNNDistances.t()), 0));
Here's the cause of the L_BFGS -NaN values:
I compute the logarithm of the kNN distances, always assuming that there's no points that have distance 0. In the case of duplicate points, that is not a good assumption to make.
The iris.csv datset that's included in mlpack has some duplicates:
```
5.8,2.7,5.1,1.9 # repeated twice
4.9,3.1,1.5,0.1 # repeated three times
```
running `$sort iris.csv | uniq -c | awk '{print $1}' | sort | uniq` will print `1 2 3` meaning that's all the duplicates.
I think the correct approach here is to simply disregard 0-distances completely, by resizing the kNNDistances matrix to only hold positive entries.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/749/files/57c9d5e634d7d3d7e2ca1618353fe37d9e23b34a#r76081866
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160824/2ed6c17c/attachment.html>
More information about the mlpack-git
mailing list