[mlpack-git] [mlpack/mlpack] Modeling LSH For Performance Tuning (#749)

Yannis Mentekidis notifications at github.com
Wed Aug 24 11:18:18 EDT 2016


> +  maxKValue = k;
> +
> +  // Save pointer to training set.
> +  this->referenceSet = &referenceSet;
> +
> +  // Step 1. Select a random sample of the dataset. We will work with only that
> +  // sample.
> +  arma::vec sampleHelper(referenceSet.n_cols, arma::fill::randu);
> +
> +  // Keep a sample of the dataset: We have uniformly random numbers in [0, 1],
> +  // so we expect about N*sampleRate of them to be in [0, sampleRate).
> +  arma::mat sampleSet = referenceSet.cols(
> +        arma::find(sampleHelper < sampleRate));
> +  // Shuffle to be impartial (in case dataset is sorted in some way).
> +  sampleSet = arma::shuffle(sampleSet);
> +  const size_t numSamples = sampleSet.n_cols; // Points in sampled set.

I think it's without replacement: I generate uniform numbers in [0, 1] and then threshold at the sample rate, getting a vector of booleans. I keep only columns (so, points) that have "true" in the corresponding vector position:
In matlab/pseudocode it would be:
```MATLAB
sampleRate = 0.3;
referenceSet = [
1 3 5 7;
2 4 6 8;
]
sampleHelper = [0.1 0.3 0.7 0.05];
sampleHelper = sampleHelper > sampleRate;
% So here sampleHelper = [0 0 1 0]
sampleSet = referenceSet.cols(sampleHelper);
% and therefore sampleSet = [5; 6] - only column 3
```

Is there something I don't see here?

I didn't know about `ObtainDistinctSamples()`, I think that will make the code cleaner so I'll refactor it to use that instead.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/749/files/57c9d5e634d7d3d7e2ca1618353fe37d9e23b34a#r76077253
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160824/2c3f2560/attachment.html>


More information about the mlpack-git mailing list