[mlpack] Google summer of code 2013

Mon Apr 15 11:16:18 EDT 2013

On Sun, Apr 14, 2013 at 11:22:19PM +0400, Марат Байдасов wrote:
> Hi
> I'm interested in your "Profiling for further optimization" Google Summer
> of Code - 2013 project and I'd like to ask some questions to be clarified.
> 
> 1. First of all, could you please tell me which particular programs from
> mlpack should be run with profiling information?
> 2. Could I read anything specific in order to understand how the profiling
> information is used for the speedup?
> 3. Is it possible to compile mlpack using the profile information from
> several datasets?
> 4. Have you got any standard datasets to download and play with?

Hello Marat,

The idea behind the "Profiling for further optimization" project is that
we use profile-guided optimization (PGO) to speed up mlpack.  I think
the GCC project has a nice page on PGO but I can't find it right now.
So this guy's post works:

http://dom.as/2009/07/27/profile-guided-optimization-with-gcc/

Basically, to do PGO, you compile with -fprofile-generate.  Then, you
run the program, and then you recompile using -fprofile-use, and it's
magically faster.

To do PGO well, we will have to run each mlpack method to get profiling
information for it; that is, we have to use all of the code which mlpack
provides, and we also have to use it in a "typical" scenario.  For
instance, if we run k-nearest-neighbors on a dataset which has
particularly odd characteristics (such as the covertype dataset), then
GCC may compile mlpack trees in such a way that trees build on datasets
like covertype are faster, but other trees are not.  So running each
method on a wide variety of datasets is probably the way to go.

There are some other posts on GCC PGO that you can find with a simple
search for that term that should explain how PGO works a little better.

To find some datasets to run mlpack methods with, the UCI dataset
repository (http://archive.ics.uci.edu/ml/datasets.html) is a good place
to go.  For very large datasets (1 million points or more) we can sample
the datasets randomly to smaller sizes (50000 points or thereabouts) to
keep PGO from taking too long.

Let me know if there are more questions that I can answer for you.  I
think this project is really quite exciting because it will provide
speedup for everything in mlpack.

Ryan

-- 
Ryan Curtin       | "And the last thing I would ever do is lie to you."
ryan at igglybob.com |   - Marlon