[mlpack-git] [mlpack/mlpack] LSHSearch Parallelization (#700)

Tue Jul 5 16:28:56 EDT 2016

I think my issue was that I had to specify `HAS_OPENMP=YES`.  So now I have it running in parallel.

Some scaling tests on my machine (8-core i7-3770), for a few datasets.  Here I tested with the corel and covertype datasets, looking only at the `computing_neighbors` timer.  The outside-of-mlpack load average of the system was about 1.8, so you can assume that 2 cores were already busy.  I tested with a couple underlying LAPACK/BLAS variants.

With ATLAS (3.10.2-9+b1):

```
-< ryan at zax >< ~/src/mlpack-mentekid/build-nodebug >< 24 >-
-< 14:16:51 >- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 7.910973s
2 threads: 4.195096s
3 threads: 3.297027s
4 threads: 2.180582s
5 threads: 2.102062s
6 threads: 1.936919s
7 threads: 1.958664s
8 threads: 1.713584s
```

```
-< ryan at zax >< ~/src/mlpack-mentekid/build-nodebug >< 26 >-
-< 14:23:39 >- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 83.253656s (1 mins, 23.2 secs)
2 threads: 42.035172s
3 threads: 29.575169s
4 threads: 23.218118s
5 threads: 20.773536s
6 threads: 18.055156s
7 threads: 17.681238s
8 threads: 17.888020s
```

```
-< ryan at zax >< ~/src/mlpack-mentekid/build-nodebug >< 27 >-
-< 14:29:09 >- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 19.044967s
2 threads: 9.594015s
3 threads: 6.481856s
4 threads: 3.683142s
5 threads: 4.161889s
6 threads: 3.129131s
7 threads: 4.141623s
8 threads: 4.190556s
```

With OpenBLAS (0.2.18-1):

```
-< ryan at zax >< ~/src/mlpack-mentekid/build-nodebug >< 45 >-
-< 14:38:21 >- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 7.706135s
2 threads: 4.041237s
3 threads: 2.777471s
4 threads: 2.023099s
5 threads: 2.031080s
6 threads: 2.088373s
7 threads: 1.623153s
8 threads: 1.656306s
```

```
-< ryan at zax >< ~/src/mlpack-mentekid/build-nodebug >< 46 >-
-< 14:43:01 >- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 85.958403s (1 mins, 25.9 secs)
2 threads: 43.510783s
3 threads: 27.753276s
4 threads: 22.183104s
5 threads: 19.071099s
6 threads: 17.832296s
7 threads: 16.625094s
8 threads: 15.502252s
```

```
-< ryan at zax >< ~/src/mlpack-mentekid/build-nodebug >< 47 >-
-< 14:50:13 >- $ for t in 1 2 3 4 5 6 7 8; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 16.803330s
2 threads: 9.012268s
3 threads: 6.607606s
4 threads: 4.308089s
5 threads: 4.446321s
6 threads: 3.457773s
7 threads: 3.839490s
8 threads: 3.354377s
```

Maybe I could have made a nicer graph, but I did not want to take the effort. :)

Next I ran on a very powerful system, with a Xeon E5-2630 v3 (32 cores).

With standard LAPACK/BLAS:

```
◈ ryan at humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done            
1 threads: 8.404026s
2 threads: 5.168623s
3 threads: 3.672489s
4 threads: 2.822260s
5 threads: 2.223130s
6 threads: 2.099296s
7 threads: 1.787155s
8 threads: 1.557677s
9 threads: 1.445799s
10 threads: 1.218698s
11 threads: 1.283903s
12 threads: 1.261723s
13 threads: 1.354944s
14 threads: 1.013850s
15 threads: 1.053046s
16 threads: 1.122099s
17 threads: 0.957182s
18 threads: 0.888229s
19 threads: 0.911108s
20 threads: 0.924035s
21 threads: 0.920874s
22 threads: 0.859121s
23 threads: 0.836497s
24 threads: 0.823132s
25 threads: 0.809634s
26 threads: 0.737277s
27 threads: 0.804975s
28 threads: 0.805146s
29 threads: 0.762401s
30 threads: 0.729818s
31 threads: 0.724836s
32 threads: 0.799443s
```

```
◈ ryan at humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 101.918782s (1 mins, 41.9 secs)
2 threads: 53.510828s
3 threads: 36.325209s
4 threads: 30.153590s
5 threads: 24.189268s
6 threads: 19.784561s
7 threads: 17.123877s
8 threads: 15.772134s
9 threads: 14.109838s
10 threads: 14.002757s
11 threads: 12.633819s
12 threads: 11.876447s
13 threads: 11.657881s
14 threads: 11.784745s
15 threads: 10.936825s
16 threads: 9.407911s
17 threads: 10.028609s
18 threads: 9.399953s
19 threads: 9.154050s
20 threads: 8.479986s
21 threads: 7.621993s
22 threads: 8.136546s
23 threads: 7.710549s
24 threads: 7.581741s
25 threads: 7.403005s
26 threads: 6.827410s
27 threads: 6.997940s
28 threads: 7.297680s
29 threads: 6.643068s
30 threads: 6.553058s
31 threads: 6.937021s
32 threads: 6.724141s
```

```
◈ ryan at humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 24.463162s               
2 threads: 14.399175s               
3 threads: 6.809879s                
4 threads: 6.064711s                
5 threads: 5.320080s                
6 threads: 3.913631s                
7 threads: 3.571118s                
8 threads: 2.730636s                
9 threads: 2.855679s                
10 threads: 2.648417s
11 threads: 3.071749s
12 threads: 2.562618s
13 threads: 2.517803s
14 threads: 2.085122s
15 threads: 2.079082s
16 threads: 2.138712s
17 threads: 2.142987s
18 threads: 1.836003s
19 threads: 1.576602s
20 threads: 1.795865s
21 threads: 1.637288s
22 threads: 1.889029s
23 threads: 1.258768s
24 threads: 1.474051s
25 threads: 1.658719s
26 threads: 1.444587s
27 threads: 1.327272s
28 threads: 1.342775s
29 threads: 1.756671s
30 threads: 1.317495s
31 threads: 1.431359s
32 threads: 1.595325s
```

With OpenBLAS (0.2.14-1ubuntu1):

```
◈ ryan at humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 8.336585s
2 threads: 4.911874s
3 threads: 3.842410s
4 threads: 2.971360s
5 threads: 2.358011s
6 threads: 1.924199s
7 threads: 1.716306s
8 threads: 1.568955s
9 threads: 1.541698s
10 threads: 1.256109s
11 threads: 1.356592s
12 threads: 1.159481s
13 threads: 1.290556s
14 threads: 1.227934s
15 threads: 1.227318s
16 threads: 1.109251s
17 threads: 1.082635s
18 threads: 0.902164s
19 threads: 0.908723s
20 threads: 0.905903s
21 threads: 0.905672s
22 threads: 0.887319s
23 threads: 0.877363s
24 threads: 0.802047s
25 threads: 0.762360s
26 threads: 0.835936s
27 threads: 0.823067s
28 threads: 0.748453s
29 threads: 0.758463s
30 threads: 0.834105s
31 threads: 0.810029s
32 threads: 0.830186s
```

```
◈ ryan at humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 101.596847s (1 mins, 41.5 secs)
2 threads: 53.533078s
3 threads: 36.128960s
4 threads: 27.706339s
5 threads: 23.167973s
6 threads: 19.631714s
7 threads: 16.814206s
8 threads: 15.843670s
9 threads: 15.085720s
10 threads: 13.145210s
11 threads: 13.119659s
12 threads: 11.100898s
13 threads: 11.431071s
14 threads: 11.277082s
15 threads: 10.915975s
16 threads: 9.818397s
17 threads: 9.682370s
18 threads: 9.183385s
19 threads: 8.878544s
20 threads: 8.670723s
21 threads: 8.163627s
22 threads: 8.209054s
23 threads: 7.823726s
24 threads: 7.691504s
25 threads: 7.547275s
26 threads: 7.398752s
27 threads: 7.768681s
28 threads: 6.944279s
29 threads: 7.044016s
30 threads: 7.291548s
31 threads: 6.655275s
32 threads: 6.863990s
```

```
◈ ryan at humungus ☃ build-nodebug ◈ $ for t in `seq 1 32`; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -k 3 -v -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 22.185537s
2 threads: 10.286940s
3 threads: 8.115816s
4 threads: 5.231896s
5 threads: 4.176525s
6 threads: 3.875466s
7 threads: 3.264623s
8 threads: 3.284492s
9 threads: 3.308781s
10 threads: 2.977220s
11 threads: 2.649113s
12 threads: 2.250442s
13 threads: 2.180048s
14 threads: 1.871922s
15 threads: 1.915900s
16 threads: 2.084260s
17 threads: 1.952864s
18 threads: 1.906784s
19 threads: 2.026871s
20 threads: 1.933004s
21 threads: 1.643694s
22 threads: 1.599656s
23 threads: 1.547806s
24 threads: 1.600161s
25 threads: 1.788019s
26 threads: 1.630448s
27 threads: 1.785991s
28 threads: 1.744910s
29 threads: 1.849506s
30 threads: 2.061815s
31 threads: 2.133709s
32 threads: 1.773207s
```

Lastly, I ran on a humble i5 650 (4 cores).  This system was completely idle and was under 0 load when I ran these simulations.  (So it differs from your typical 4-core desktop/laptop, in which probably one or two cores will be saturated at any given time because someone is actively using the system.)

With ATLAS (3.10.2-6):

```
(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 16.780850s
2 threads: 10.141014s
3 threads: 8.087287s
4 threads: 7.063538s
```

```
(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 132.373939s (2 mins, 12.3 secs)
2 threads: 71.849087s (1 mins, 11.8 secs)
3 threads: 54.420773s
4 threads: 46.805077s
```

```
(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 21.639796s
2 threads: 12.144827s
3 threads: 13.050360s
4 threads: 9.987645s
```

With OpenBLAS (0.2.12-1):

```
(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/corel.csv -r ~/datasets/corel.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 17.990595s
2 threads: 9.546705s
3 threads: 6.832160s
4 threads: 7.639326s
```

```
(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/phy.csv -r ~/datasets/phy.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 127.936811s (2 mins, 7.9 secs)
2 threads: 72.647918s (1 mins, 12.6 secs)
3 threads: 53.592957s
4 threads: 45.845096s
```

```
(( ryan @ dambala )) ~/src/mlpack-mentekid/build-nodebug $ for t in 1 2 3 4; do OMP_NUM_THREADS=$t bin/mlpack_lsh -q ~/datasets/miniboone.csv -r ~/datasets/miniboone.csv -v -k 3 -d d.csv -n n.csv | grep computing_neighbors | awk -F':' '{ print $2 }' | sed "s/^/$t threads:/" ; done
1 threads: 28.859252s
2 threads: 17.077041s
3 threads: 12.146553s
4 threads: 8.597622s
```

So overall I am definitely seeing some non-negligible speedup, although with only four cores the speedup is limited.  I guess I am a bit confused: I think you were saying that you were seeing no useful speedup at all?

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/700#issuecomment-230593171
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160705/804adc40/attachment-0001.html>