[mlpack-svn] [MLPACK] #179: Design parallelization framework
MLPACK Trac
trac at coffeetalk-1.cc.gatech.edu
Wed Dec 21 00:42:12 EST 2011
#179: Design parallelization framework
------------------------------+---------------------------------------------
Reporter: rcurtin | Owner:
Type: wishlist | Status: new
Priority: major | Milestone: MLPACK 2.0.0
Component: MLPACK | Resolution:
Keywords: parallelization | Blocking:
Blocked By: |
------------------------------+---------------------------------------------
Comment (by jcline3):
It's probably not possible to have developers just drop a line or two of
code to make a method parallelized.
For example, consider parallelizing this:
{{{
int i = 0;
for(int c = 0; c < 100; ++c)
i += c;
}}}
You can't do this and get correct results:
{{{
main() {
int i = 0;
new thread sum(&i, 0, 50);
new thread sum(&i, 51, 100);
}
sum(int* i, c, j) {
for(; c < j; ++c)
*i += c;
}
}}}
And the reason is that i will get clobbered by the two threads because the
system loading the value of i from memory, adding to it, then writing back
can happen in the two threads in an interleaved manner, ie:
{{{
thread 1 read
thread 2 read
thread 1 add
thread 1 write
thread 2 add
thread 2 write
}}}
In this case, i would have the value given to it by thread 2 and the value
from thread 1 is lost.
This is a simple example in which you can solve the problem by not passing
a pointer to i, and then returning the accumulated values from sum in each
thread and adding those together at the end. You can use this simple
approach of separating the workload for many different problems and
results in a very good speedup.
{{{
main() {
int i = 0;
i = new thread sum(0, 50);
i += new thread sum(51, 100);
}
int sum(c, j) {
int acc = 0;
for(; c < j; ++c)
acc += c;
return acc;
}
}}}
However, if you want to do some operation like modifying a tree or an
inplace matrix operation that multiple threads make at the same time, then
you may have to use other constructs to prevent clobbering. This is where
mutexes and semaphores come in. The ideal solution is to be able to
separate all of our operations so that there are no dependencies between
threads, since it is the simplest and potentially fastest.
OpenMP makes this particular problem very simple to solve:
{{{
#pragma omp parallel for reduction(+:sum)
for(i = 0; i < 100; ++i)
{
sum += i;
}
}}}
A simple description of how OpenMP works is that you define the
parallelism as above using the #pragmas. Then you compile your program.
Then you setup whatever environment variables you need to (ie, the number
of workers in the worker pool). Then you run your program and it creates a
master thread and some number of workers and executes its parallel
elements in those separate threads.
The downside to this and similar approaches (C++11 threads, pthreads), is
that they don't transfer immediately to distributed computing. The MPI
standard, however, would allows us to parallelize on local and remote
machines without changing anything in the code, IIRC. Open MPI would
probably be a reasonable choice if we went this route, though I believe
I've heard that Intel's MPI implementation is faster. If we want to tackle
distributed with this ticket as well, then I'll do some more research on
MPI implementations as well as some of the other approaches to distributed
computing that may be appropriate for what we're doing (MapReduce, for
example).
Doing things like kNN classification once the model has been built on n
data point is trivially parallelizable as I described above using OpenMP,
and not much harder using C++11 threads or pthreads, since you don't have
to worry about data dependencies.
However, we also will want to parallelize building a model, which is all
the "real" work we do for some methods (kNN, linear regression). Making
linear algebra operations parallel will probably come down to trying to
find a way to either add it to Armadillo or getting Armadillo to compile
against some parallel BLAS/LAPACK library, so there may not be a good way
to make linear regression parallel. As before, if we can write the model
generation algorithm in such a way that the threads do not depend on each
other then we should have a fairly easy time making it possible for
developers to make their methods parallel.
Any method where that's not possible and we start running into more
challenging issues if we want to keep things fast.
I hope this gives you a better idea of what's involved.
--
Ticket URL: <https://trac.research.cc.gatech.edu/fastlab/ticket/179#comment:1>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.
More information about the mlpack-svn
mailing list