<p>I think my issue was that I had to specify <code>HAS_OPENMP=YES</code>.  So now I have it running in parallel.</p>

<p>Some scaling tests on my machine (8-core i7-3770), for a few datasets.  Here I tested with the corel and covertype datasets, looking only at the <code>computing_neighbors</code> timer.  The outside-of-mlpack load average of the system was about 1.8, so you can assume that 2 cores were already busy.  I tested with a couple underlying LAPACK/BLAS variants.</p>

<p>With ATLAS (3.10.2-9+b1):</p>

<pre><code>-&lt; ryan@zax &gt;&lt; ~/src/mlpack-mentekid/build-nodebug &gt;&lt; 24 &gt;-
-&lt; 14:16:51 &gt;- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 7.910973s
2 threads: 4.195096s
3 threads: 3.297027s
4 threads: 2.180582s
5 threads: 2.102062s
6 threads: 1.936919s
7 threads: 1.958664s
8 threads: 1.713584s
</code></pre>

<pre><code>-&lt; ryan@zax &gt;&lt; ~/src/mlpack-mentekid/build-nodebug &gt;&lt; 26 &gt;-
-&lt; 14:23:39 &gt;- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 83.253656s (1 mins, 23.2 secs)
2 threads: 42.035172s
3 threads: 29.575169s
4 threads: 23.218118s
5 threads: 20.773536s
6 threads: 18.055156s
7 threads: 17.681238s
8 threads: 17.888020s
</code></pre>

<pre><code>-&lt; ryan@zax &gt;&lt; ~/src/mlpack-mentekid/build-nodebug &gt;&lt; 27 &gt;-
-&lt; 14:29:09 &gt;- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 19.044967s
2 threads: 9.594015s
3 threads: 6.481856s
4 threads: 3.683142s
5 threads: 4.161889s
6 threads: 3.129131s
7 threads: 4.141623s
8 threads: 4.190556s
</code></pre>

<p>With OpenBLAS (0.2.18-1):</p>

<pre><code>-&lt; ryan@zax &gt;&lt; ~/src/mlpack-mentekid/build-nodebug &gt;&lt; 45 &gt;-
-&lt; 14:38:21 &gt;- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 7.706135s
2 threads: 4.041237s
3 threads: 2.777471s
4 threads: 2.023099s
5 threads: 2.031080s
6 threads: 2.088373s
7 threads: 1.623153s
8 threads: 1.656306s
</code></pre>

<pre><code>-&lt; ryan@zax &gt;&lt; ~/src/mlpack-mentekid/build-nodebug &gt;&lt; 46 &gt;-
-&lt; 14:43:01 &gt;- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 85.958403s (1 mins, 25.9 secs)
2 threads: 43.510783s
3 threads: 27.753276s
4 threads: 22.183104s
5 threads: 19.071099s
6 threads: 17.832296s
7 threads: 16.625094s
8 threads: 15.502252s
</code></pre>

<pre><code>-&lt; ryan@zax &gt;&lt; ~/src/mlpack-mentekid/build-nodebug &gt;&lt; 47 &gt;-
-&lt; 14:50:13 &gt;- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 16.803330s
2 threads: 9.012268s
3 threads: 6.607606s
4 threads: 4.308089s
5 threads: 4.446321s
6 threads: 3.457773s
7 threads: 3.839490s
8 threads: 3.354377s
</code></pre>

<p>Maybe I could have made a nicer graph, but I did not want to take the effort. :)</p>

<p>Next I ran on a very powerful system, with a Xeon E5-2630 v3 (32 cores).</p>

<p>With standard LAPACK/BLAS:</p>

<pre><code>◈ ryan@humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done            
1 threads: 8.404026s
2 threads: 5.168623s
3 threads: 3.672489s
4 threads: 2.822260s
5 threads: 2.223130s
6 threads: 2.099296s
7 threads: 1.787155s
8 threads: 1.557677s
9 threads: 1.445799s
10 threads: 1.218698s
11 threads: 1.283903s
12 threads: 1.261723s
13 threads: 1.354944s
14 threads: 1.013850s
15 threads: 1.053046s
16 threads: 1.122099s
17 threads: 0.957182s
18 threads: 0.888229s
19 threads: 0.911108s
20 threads: 0.924035s
21 threads: 0.920874s
22 threads: 0.859121s
23 threads: 0.836497s
24 threads: 0.823132s
25 threads: 0.809634s
26 threads: 0.737277s
27 threads: 0.804975s
28 threads: 0.805146s
29 threads: 0.762401s
30 threads: 0.729818s
31 threads: 0.724836s
32 threads: 0.799443s
</code></pre>

<pre><code>◈ ryan@humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 101.918782s (1 mins, 41.9 secs)
2 threads: 53.510828s
3 threads: 36.325209s
4 threads: 30.153590s
5 threads: 24.189268s
6 threads: 19.784561s
7 threads: 17.123877s
8 threads: 15.772134s
9 threads: 14.109838s
10 threads: 14.002757s
11 threads: 12.633819s
12 threads: 11.876447s
13 threads: 11.657881s
14 threads: 11.784745s
15 threads: 10.936825s
16 threads: 9.407911s
17 threads: 10.028609s
18 threads: 9.399953s
19 threads: 9.154050s
20 threads: 8.479986s
21 threads: 7.621993s
22 threads: 8.136546s
23 threads: 7.710549s
24 threads: 7.581741s
25 threads: 7.403005s
26 threads: 6.827410s
27 threads: 6.997940s
28 threads: 7.297680s
29 threads: 6.643068s
30 threads: 6.553058s
31 threads: 6.937021s
32 threads: 6.724141s
</code></pre>

<pre><code>◈ ryan@humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 24.463162s               
2 threads: 14.399175s               
3 threads: 6.809879s                
4 threads: 6.064711s                
5 threads: 5.320080s                
6 threads: 3.913631s                
7 threads: 3.571118s                
8 threads: 2.730636s                
9 threads: 2.855679s                
10 threads: 2.648417s
11 threads: 3.071749s
12 threads: 2.562618s
13 threads: 2.517803s
14 threads: 2.085122s
15 threads: 2.079082s
16 threads: 2.138712s
17 threads: 2.142987s
18 threads: 1.836003s
19 threads: 1.576602s
20 threads: 1.795865s
21 threads: 1.637288s
22 threads: 1.889029s
23 threads: 1.258768s
24 threads: 1.474051s
25 threads: 1.658719s
26 threads: 1.444587s
27 threads: 1.327272s
28 threads: 1.342775s
29 threads: 1.756671s
30 threads: 1.317495s
31 threads: 1.431359s
32 threads: 1.595325s
</code></pre>

<p>With OpenBLAS (0.2.14-1ubuntu1):</p>

<pre><code>◈ ryan@humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 8.336585s
2 threads: 4.911874s
3 threads: 3.842410s
4 threads: 2.971360s
5 threads: 2.358011s
6 threads: 1.924199s
7 threads: 1.716306s
8 threads: 1.568955s
9 threads: 1.541698s
10 threads: 1.256109s
11 threads: 1.356592s
12 threads: 1.159481s
13 threads: 1.290556s
14 threads: 1.227934s
15 threads: 1.227318s
16 threads: 1.109251s
17 threads: 1.082635s
18 threads: 0.902164s
19 threads: 0.908723s
20 threads: 0.905903s
21 threads: 0.905672s
22 threads: 0.887319s
23 threads: 0.877363s
24 threads: 0.802047s
25 threads: 0.762360s
26 threads: 0.835936s
27 threads: 0.823067s
28 threads: 0.748453s
29 threads: 0.758463s
30 threads: 0.834105s
31 threads: 0.810029s
32 threads: 0.830186s
</code></pre>

<pre><code>◈ ryan@humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 101.596847s (1 mins, 41.5 secs)
2 threads: 53.533078s
3 threads: 36.128960s
4 threads: 27.706339s
5 threads: 23.167973s
6 threads: 19.631714s
7 threads: 16.814206s
8 threads: 15.843670s
9 threads: 15.085720s
10 threads: 13.145210s
11 threads: 13.119659s
12 threads: 11.100898s
13 threads: 11.431071s
14 threads: 11.277082s
15 threads: 10.915975s
16 threads: 9.818397s
17 threads: 9.682370s
18 threads: 9.183385s
19 threads: 8.878544s
20 threads: 8.670723s
21 threads: 8.163627s
22 threads: 8.209054s
23 threads: 7.823726s
24 threads: 7.691504s
25 threads: 7.547275s
26 threads: 7.398752s
27 threads: 7.768681s
28 threads: 6.944279s
29 threads: 7.044016s
30 threads: 7.291548s
31 threads: 6.655275s
32 threads: 6.863990s
</code></pre>

<pre><code>◈ ryan@humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 22.185537s
2 threads: 10.286940s
3 threads: 8.115816s
4 threads: 5.231896s
5 threads: 4.176525s
6 threads: 3.875466s
7 threads: 3.264623s
8 threads: 3.284492s
9 threads: 3.308781s
10 threads: 2.977220s
11 threads: 2.649113s
12 threads: 2.250442s
13 threads: 2.180048s
14 threads: 1.871922s
15 threads: 1.915900s
16 threads: 2.084260s
17 threads: 1.952864s
18 threads: 1.906784s
19 threads: 2.026871s
20 threads: 1.933004s
21 threads: 1.643694s
22 threads: 1.599656s
23 threads: 1.547806s
24 threads: 1.600161s
25 threads: 1.788019s
26 threads: 1.630448s
27 threads: 1.785991s
28 threads: 1.744910s
29 threads: 1.849506s
30 threads: 2.061815s
31 threads: 2.133709s
32 threads: 1.773207s
</code></pre>

<p>Lastly, I ran on a humble i5 650 (4 cores).  This system was completely idle and was under 0 load when I ran these simulations.  (So it differs from your typical 4-core desktop/laptop, in which probably one or two cores will be saturated at any given time because someone is actively using the system.)</p>

<p>With ATLAS (3.10.2-6):</p>

<pre><code>(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 16.780850s
2 threads: 10.141014s
3 threads: 8.087287s
4 threads: 7.063538s
</code></pre>

<pre><code>(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 132.373939s (2 mins, 12.3 secs)
2 threads: 71.849087s (1 mins, 11.8 secs)
3 threads: 54.420773s
4 threads: 46.805077s
</code></pre>

<pre><code>(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 21.639796s
2 threads: 12.144827s
3 threads: 13.050360s
4 threads: 9.987645s
</code></pre>

<p>With OpenBLAS (0.2.12-1):</p>

<pre><code>(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 17.990595s
2 threads: 9.546705s
3 threads: 6.832160s
4 threads: 7.639326s
</code></pre>

<pre><code>(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 127.936811s (2 mins, 7.9 secs)
2 threads: 72.647918s (1 mins, 12.6 secs)
3 threads: 53.592957s
4 threads: 45.845096s
</code></pre>

<pre><code>(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 28.859252s
2 threads: 17.077041s
3 threads: 12.146553s
4 threads: 8.597622s
</code></pre>

<p>So overall I am definitely seeing some non-negligible speedup, although with only four cores the speedup is limited.  I guess I am a bit confused: I think you were saying that you were seeing no useful speedup at all?</p>

<p style="font-size:small;-webkit-text-size-adjust:none;color:#666;">&mdash;<br />You are receiving this because you are subscribed to this thread.<br />Reply to this email directly, <a href="https://github.com/mlpack/mlpack/pull/700#issuecomment-230593171">view it on GitHub</a>, or <a href="https://github.com/notifications/unsubscribe/AJ4bFDF8vU_44ysM8MXFN2lwOxqaeZGpks5qSr6IgaJpZM4I5KSz">mute the thread</a>.<img alt="" height="1" src="https://github.com/notifications/beacon/AJ4bFMXUUtkQAYl59vR6zoWXNzcUs_Pnks5qSr6IgaJpZM4I5KSz.gif" width="1" /></p>
<div itemscope itemtype="http://schema.org/EmailMessage">
<div itemprop="action" itemscope itemtype="http://schema.org/ViewAction">
  <link itemprop="url" href="https://github.com/mlpack/mlpack/pull/700#issuecomment-230593171"></link>
  <meta itemprop="name" content="View Pull Request"></meta>
</div>
<meta itemprop="description" content="View this Pull Request on GitHub"></meta>
</div>