[mlpack-git] [mlpack/mlpack] Modeling LSH For Performance Tuning (#749)

Fri Aug 26 16:53:33 EDT 2016

> +  Timer::Start("pairwise_distances");
> +  arma::vec distances(numSamples * (numSamples - 1) / 2);
> +  size_t d = 0; // Index of where to store next.
> +  for (size_t i = 0; i < numSamples; ++i)
> +    for (size_t j = i + 1; j < numSamples; ++j)
> +      distances(d++) = metric::EuclideanDistance::Evaluate(
> +          sampleSet.unsafe_col(i), sampleSet.unsafe_col(j));
> +  Log::Info << "Computed " << d << " pointwise distances." << std::endl;
> +  Timer::Stop("pairwise_distances");
> +
> +  // Step 3. Estimate statistics of these distances: log(mean(d)), mean(log(d)),
> +  // mean(d).
> +  distances = arma::pow(distances, 2);
> +  this->meanDist = arma::mean(distances);
> +  this->logMeanDist = std::log(meanDist);
> +  this->meanLogDist = arma::mean(arma::log(distances));

Even if you replace the geometric mean calculation below, this still causes problems in `Predict()`, because any zero values passed to the Gamma distribution training will cause it to fail, which makes sense because the PDF of a Gamma distribution (with any parameters) is always 0 at x = 0.

So basically what we have come upon here is a flaw in the paper: this entire strategy of modeling the squared distances as a gamma distribution is only applicable if there are no duplicate points in the dataset.  If we wanted this LSH tuning model to be able to handle duplicate points, we would need to pick a new distribution that handled zero distances and rederive everything in the paper.  But... that's not really what I want to do (if you want to feel free)...

So instead it seems like our best option is just to find a hack.  Here is the idea I went with to make my little test program work: take the smallest nonzero distance from `distances`, and use some small factor of that to replace all the 0s:

```
// Find the smallest nonzero distance.
double smallest = DBL_MAX;
for (size_t i = 0; i < distances.n_elem; ++i)
  if (distances[i] > 0.0 && distances[i] < DBL_MAX)
    smallest = distances[i];

// Replace any zero distances, since they break the gamma distribution.
for (size_t i = 0; i < distances.n_elem; ++i)
  if (distances[i] == 0.0)
    distances[i] = 1e-5 * smallest; // 1e-5 selected arbitrarily.
```

Now the distances are all nonzero, though we have probably affected the trained distribution somewhat.  It might be useful to output a warning if there are any duplicate points in the distances, noting that this affects the correctness of the predictions.  (I'm not sure how it affects the correctness.)

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/749/files/a0626a8c41fedc60ec255ef2939a519dfac5b83a#r76486391
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160826/f2330b89/attachment.html>