[mlpack-svn] [MLPACK] #179: Design parallelization framework

Wed Dec 21 00:42:12 EST 2011

#179: Design parallelization framework
------------------------------+---------------------------------------------
  Reporter:  rcurtin          |        Owner:              
      Type:  wishlist         |       Status:  new         
  Priority:  major            |    Milestone:  MLPACK 2.0.0
 Component:  MLPACK           |   Resolution:              
  Keywords:  parallelization  |     Blocking:              
Blocked By:                   |  
------------------------------+---------------------------------------------

Comment (by jcline3):

 It's probably not possible to have developers just drop a line or two of
 code to make a method parallelized.

 For example, consider parallelizing this:

 {{{
 int i = 0;

 for(int c = 0; c < 100; ++c)
   i += c;
 }}}

 You can't do this and get correct results:

 {{{
 main() {
 int i = 0;

 new thread sum(&i, 0, 50);
 new thread sum(&i, 51, 100);
 }

 sum(int* i, c, j) {

 for(; c < j; ++c)
    *i += c;
 }
 }}}

 And the reason is that i will get clobbered by the two threads because the
 system loading the value of i from memory, adding to it, then writing back
 can happen in the two threads in an interleaved manner, ie:

 {{{
 thread 1 read
 thread 2 read
 thread 1 add
 thread 1 write
 thread 2 add
 thread 2 write
 }}}

 In this case, i would have the value given to it by thread 2 and the value
 from thread 1 is lost.

 This is a simple example in which you can solve the problem by not passing
 a pointer to i, and then returning the accumulated values from sum in each
 thread and adding those together at the end. You can use this simple
 approach of separating the workload for many different problems and
 results in a very good speedup.

 {{{
 main() {
 int i = 0;

 i = new thread sum(0, 50);
 i += new thread sum(51, 100);
 }

 int sum(c, j) {

 int acc = 0;

 for(; c < j; ++c)
    acc += c;

 return acc;
 }
 }}}

 However, if you want to do some operation like modifying a tree or an
 inplace matrix operation that multiple threads make at the same time, then
 you may have to use other constructs to prevent clobbering. This is where
 mutexes and semaphores come in. The ideal solution is to be able to
 separate all of our operations so that there are no dependencies between
 threads, since it is the simplest and potentially fastest.

 OpenMP makes this particular problem very simple to solve:

 {{{
 #pragma omp parallel for reduction(+:sum)
 for(i = 0; i < 100; ++i)
 {
   sum += i;
 }
 }}}

 A simple description of how OpenMP works is that you define the
 parallelism as above using the #pragmas. Then you compile your program.
 Then you setup whatever environment variables you need to (ie, the number
 of workers in the worker pool). Then you run your program and it creates a
 master thread and some number of workers and executes its parallel
 elements in those separate threads.

 The downside to this and similar approaches (C++11 threads, pthreads), is
 that they don't transfer immediately to distributed computing. The MPI
 standard, however, would allows us to parallelize on local and remote
 machines without changing anything in the code, IIRC. Open MPI would
 probably be a reasonable choice if we went this route, though I believe
 I've heard that Intel's MPI implementation is faster. If we want to tackle
 distributed with this ticket as well, then I'll do some more research on
 MPI implementations as well as some of the other approaches to distributed
 computing that may be appropriate for what we're doing (MapReduce, for
 example).

 Doing things like kNN classification once the model has been built on n
 data point is trivially parallelizable as I described above using OpenMP,
 and not much harder using C++11 threads or pthreads, since you don't have
 to worry about data dependencies.

 However, we also will want to parallelize building a model, which is all
 the "real" work we do for some methods (kNN, linear regression). Making
 linear algebra operations parallel will probably come down to trying to
 find a way to either add it to Armadillo or getting Armadillo to compile
 against some parallel BLAS/LAPACK library, so there may not be a good way
 to make linear regression parallel. As before, if we can write the model
 generation algorithm in such a way that the threads do not depend on each
 other then we should have a fairly easy time making it possible for
 developers to make their methods parallel.

 Any method where that's not possible and we start running into more
 challenging issues if we want to keep things fast.

 I hope this gives you a better idea of what's involved.

-- 
Ticket URL: <https://trac.research.cc.gatech.edu/fastlab/ticket/179#comment:1>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.