[mlpack-svn] [MLPACK] #126: Implement simple PCA

MLPACK Trac trac at coffeetalk-1.cc.gatech.edu
Fri Nov 11 16:49:16 EST 2011

#126: Implement simple PCA
  Reporter:  rcurtin                           |        Owner:  ajinkya   
      Type:  wishlist                          |       Status:  assigned  
  Priority:  major                             |    Milestone:  MLPACK 1.0
 Component:  MLPACK                            |   Resolution:            
  Keywords:  pca kernel_pca covariance method  |     Blocking:  47        
Blocked By:                                    |  

Comment (by rcurtin):

 Excellent, looks great.  I did have some thoughts though; I think you can
 use the Timer class to help figure this out.

 The Armadillo `princomp(..., trans(data))` function is going to do this:

  * Transpose the data before `princomp()` is called into a temporary
  * Transpose the data again (into another temporary) to calculate the
 covariance with `X^T * X`

 I had written a function a long time back which I wanted to see accepted
 into the Armadillo code but it hasn't been, yet.  It's called
 `arma::ccov()` and it can be found in
 `src/mlpack/core/arma_extend/fn_ccov.hpp` (it is automatically included by
 `mlpack/core.h`).  What the function does is take the covariance of the
 transposed data matrix.

 If I called `arma::cov(trans(data))`, it would do the two levels of
 transposition, just like `princomp()`.  However, if I call
 `arma::ccov(data)`, it calculates `X * X^T`, not `X^T^T * X^T`.

 So what I'm getting at in the end is, could we see a speedup by calling

 arma::mat cov = ccov(data);
 // call eig() manually and do all that stuff

 instead of just

 princomp(..., trans(data))

 You can generate a random big dataset with MATLAB or Octave by just
 calling `randn(big number, big number)` and we can get timing comparisons.

 As for the API, I did have a couple thoughts; I don't like Armadillo's API
 for `princomp()`.  We'll have users who mainly want to use PCA in one of
 two ways: dimensionality reduction and actual PCA.  So basically, they'll
 want reduced-dimension data, or, they'll want all the principal components
 back.  I think this is a good API to give (I haven't included good
 comments, just enough to get the idea across):

 class PCA
   // Constructors and such

   // For someone who wants dimensionality reduction.  We modify the
 dataset directly.
   void Apply(arma::mat& data, const size_t newDimension);

   // For someone who wants more information.
   void Apply(const arma::mat& data, arma::mat& transformedData, arma::vec&
   // And for someone who wants even more.
   void Apply(const arma::mat& data, arma::mat& transformedData, arma::vec&
 eigenvalues, arma::mat& coeffs);

 I think that's reasonable, but if you have other opinions or ideas, paste


Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/126#comment:7>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.

More information about the mlpack-svn mailing list