[mlpack-svn] [MLPACK] #126: Implement simple PCA

Fri Nov 11 16:49:16 EST 2011

#126: Implement simple PCA
-----------------------------------------------+----------------------------
  Reporter:  rcurtin                           |        Owner:  ajinkya   
      Type:  wishlist                          |       Status:  assigned  
  Priority:  major                             |    Milestone:  MLPACK 1.0
 Component:  MLPACK                            |   Resolution:            
  Keywords:  pca kernel_pca covariance method  |     Blocking:  47        
Blocked By:                                    |  
-----------------------------------------------+----------------------------

Comment (by rcurtin):

 Excellent, looks great.  I did have some thoughts though; I think you can
 use the Timer class to help figure this out.

 The Armadillo `princomp(..., trans(data))` function is going to do this:

  * Transpose the data before `princomp()` is called into a temporary
 matrix.
  * Transpose the data again (into another temporary) to calculate the
 covariance with `X^T * X`

 I had written a function a long time back which I wanted to see accepted
 into the Armadillo code but it hasn't been, yet.  It's called
 `arma::ccov()` and it can be found in
 `src/mlpack/core/arma_extend/fn_ccov.hpp` (it is automatically included by
 `mlpack/core.h`).  What the function does is take the covariance of the
 transposed data matrix.

 If I called `arma::cov(trans(data))`, it would do the two levels of
 transposition, just like `princomp()`.  However, if I call
 `arma::ccov(data)`, it calculates `X * X^T`, not `X^T^T * X^T`.

 So what I'm getting at in the end is, could we see a speedup by calling
 this:

 {{{
 arma::mat cov = ccov(data);
 // call eig() manually and do all that stuff
 }}}

 instead of just

 {{{
 princomp(..., trans(data))
 }}}

 You can generate a random big dataset with MATLAB or Octave by just
 calling `randn(big number, big number)` and we can get timing comparisons.

 As for the API, I did have a couple thoughts; I don't like Armadillo's API
 for `princomp()`.  We'll have users who mainly want to use PCA in one of
 two ways: dimensionality reduction and actual PCA.  So basically, they'll
 want reduced-dimension data, or, they'll want all the principal components
 back.  I think this is a good API to give (I haven't included good
 comments, just enough to get the idea across):

 {{{
 class PCA
 {
  public:
   // Constructors and such

   // For someone who wants dimensionality reduction.  We modify the
 dataset directly.
   void Apply(arma::mat& data, const size_t newDimension);

   // For someone who wants more information.
   void Apply(const arma::mat& data, arma::mat& transformedData, arma::vec&
 eigenvalues);
   // And for someone who wants even more.
   void Apply(const arma::mat& data, arma::mat& transformedData, arma::vec&
 eigenvalues, arma::mat& coeffs);
 }
 }}}

 I think that's reasonable, but if you have other opinions or ideas, paste
 away.

 Thanks!

-- 
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/126#comment:7>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.