<p>In <a href="https://github.com/mlpack/mlpack/pull/749#discussion_r76077253">src/mlpack/methods/lsh/lshmodel_impl.hpp</a>:</p>
<pre style='color:#555'>&gt; +  maxKValue = k;
&gt; +
&gt; +  // Save pointer to training set.
&gt; +  this-&gt;referenceSet = &amp;referenceSet;
&gt; +
&gt; +  // Step 1. Select a random sample of the dataset. We will work with only that
&gt; +  // sample.
&gt; +  arma::vec sampleHelper(referenceSet.n_cols, arma::fill::randu);
&gt; +
&gt; +  // Keep a sample of the dataset: We have uniformly random numbers in [0, 1],
&gt; +  // so we expect about N*sampleRate of them to be in [0, sampleRate).
&gt; +  arma::mat sampleSet = referenceSet.cols(
&gt; +        arma::find(sampleHelper &lt; sampleRate));
&gt; +  // Shuffle to be impartial (in case dataset is sorted in some way).
&gt; +  sampleSet = arma::shuffle(sampleSet);
&gt; +  const size_t numSamples = sampleSet.n_cols; // Points in sampled set.
</pre>
<p>I think it's without replacement: I generate uniform numbers in [0, 1] and then threshold at the sample rate, getting a vector of booleans. I keep only columns (so, points) that have "true" in the corresponding vector position:<br>
In matlab/pseudocode it would be:</p>

<div class="highlight highlight-source-matlab"><pre>sampleRate = <span class="pl-c1">0.3</span>;
referenceSet = [
<span class="pl-c1">1</span> <span class="pl-c1">3</span> <span class="pl-c1">5</span> <span class="pl-c1">7</span>;
<span class="pl-c1">2</span> <span class="pl-c1">4</span> <span class="pl-c1">6</span> <span class="pl-c1">8</span>;
]
sampleHelper = [<span class="pl-c1">0.1</span> <span class="pl-c1">0.3</span> <span class="pl-c1">0.7</span> <span class="pl-c1">0.05</span>];
sampleHelper = sampleHelper <span class="pl-k">&gt;</span> sampleRate;
<span class="pl-c">% So here sampleHelper = [0 0 1 0]</span>
sampleSet = referenceSet.cols(sampleHelper);
<span class="pl-c">% and therefore sampleSet = [5; 6] - only column 3</span></pre></div>

<p>Is there something I don't see here?</p>

<p>I didn't know about <code>ObtainDistinctSamples()</code>, I think that will make the code cleaner so I'll refactor it to use that instead.</p>

<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">&mdash;<br />You are receiving this because you are subscribed to this thread.<br />Reply to this email directly, <a href="https://github.com/mlpack/mlpack/pull/749/files/57c9d5e634d7d3d7e2ca1618353fe37d9e23b34a#r76077253">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe-auth/AJ4bFGOHBy0qDclDOV0s2fxGiOpB5SKFks5qjGC6gaJpZM4JczVR">mute the thread</a>.<img alt="" height="1" src="https://github.com/notifications/beacon/AJ4bFCUFgpLO27sQS1skLZiGo87rJnYmks5qjGC6gaJpZM4JczVR.gif" width="1" /></p>
<div itemscope itemtype="http://schema.org/EmailMessage">
<div itemprop="action" itemscope itemtype="http://schema.org/ViewAction">
  <link itemprop="url" href="https://github.com/mlpack/mlpack/pull/749/files/57c9d5e634d7d3d7e2ca1618353fe37d9e23b34a#r76077253"></link>
  <meta itemprop="name" content="View Pull Request"></meta>
</div>
<meta itemprop="description" content="View this Pull Request on GitHub"></meta>
</div>

<script type="application/json" data-scope="inboxmarkup">{"api_version":"1.0","publisher":{"api_key":"05dde50f1d1a384dd78767c55493e4bb","name":"GitHub"},"entity":{"external_key":"github/mlpack/mlpack","title":"mlpack/mlpack","subtitle":"GitHub repository","main_image_url":"https://cloud.githubusercontent.com/assets/143418/17495839/a5054eac-5d88-11e6-95fc-7290892c7bb5.png","avatar_image_url":"https://cloud.githubusercontent.com/assets/143418/15842166/7c72db34-2c0b-11e6-9aed-b52498112777.png","action":{"name":"Open in GitHub","url":"https://github.com/mlpack/mlpack"}},"updates":{"snippets":[{"icon":"PERSON","message":"@mentekid in #749: I think it's without replacement: I generate uniform numbers in [0, 1] and then threshold at the sample rate, getting a vector of booleans. I keep only columns (so, points) that have \"true\" in the corresponding vector position:\r\nIn matlab/pseudocode it would be:\r\n```MATLAB\r\nsampleRate = 0.3;\r\nreferenceSet = [\r\n1 3 5 7;\r\n2 4 6 8;\r\n]\r\nsampleHelper = [0.1 0.3 0.7 0.05];\r\nsampleHelper = sampleHelper \u003e sampleRate;\r\n% So here sampleHelper = [0 0 1 0]\r\nsampleSet = referenceSet.cols(sampleHelper);\r\n% and therefore sampleSet = [5; 6] - only column 3\r\n```\r\n\r\nIs there something I don't see here?\r\n\r\nI didn't know about `ObtainDistinctSamples()`, I think that will make the code cleaner so I'll refactor it to use that instead."}],"action":{"name":"View Pull Request","url":"https://github.com/mlpack/mlpack/pull/749/files/57c9d5e634d7d3d7e2ca1618353fe37d9e23b34a#r76077253"}}}</script>