[mlpack-svn] [MLPACK] #126: Implement simple PCA
MLPACK Trac
trac at coffeetalk-1.cc.gatech.edu
Fri Nov 11 16:49:16 EST 2011
#126: Implement simple PCA
-----------------------------------------------+----------------------------
Reporter: rcurtin | Owner: ajinkya
Type: wishlist | Status: assigned
Priority: major | Milestone: MLPACK 1.0
Component: MLPACK | Resolution:
Keywords: pca kernel_pca covariance method | Blocking: 47
Blocked By: |
-----------------------------------------------+----------------------------
Comment (by rcurtin):
Excellent, looks great. I did have some thoughts though; I think you can
use the Timer class to help figure this out.
The Armadillo `princomp(..., trans(data))` function is going to do this:
* Transpose the data before `princomp()` is called into a temporary
matrix.
* Transpose the data again (into another temporary) to calculate the
covariance with `X^T * X`
I had written a function a long time back which I wanted to see accepted
into the Armadillo code but it hasn't been, yet. It's called
`arma::ccov()` and it can be found in
`src/mlpack/core/arma_extend/fn_ccov.hpp` (it is automatically included by
`mlpack/core.h`). What the function does is take the covariance of the
transposed data matrix.
If I called `arma::cov(trans(data))`, it would do the two levels of
transposition, just like `princomp()`. However, if I call
`arma::ccov(data)`, it calculates `X * X^T`, not `X^T^T * X^T`.
So what I'm getting at in the end is, could we see a speedup by calling
this:
{{{
arma::mat cov = ccov(data);
// call eig() manually and do all that stuff
}}}
instead of just
{{{
princomp(..., trans(data))
}}}
You can generate a random big dataset with MATLAB or Octave by just
calling `randn(big number, big number)` and we can get timing comparisons.
As for the API, I did have a couple thoughts; I don't like Armadillo's API
for `princomp()`. We'll have users who mainly want to use PCA in one of
two ways: dimensionality reduction and actual PCA. So basically, they'll
want reduced-dimension data, or, they'll want all the principal components
back. I think this is a good API to give (I haven't included good
comments, just enough to get the idea across):
{{{
class PCA
{
public:
// Constructors and such
// For someone who wants dimensionality reduction. We modify the
dataset directly.
void Apply(arma::mat& data, const size_t newDimension);
// For someone who wants more information.
void Apply(const arma::mat& data, arma::mat& transformedData, arma::vec&
eigenvalues);
// And for someone who wants even more.
void Apply(const arma::mat& data, arma::mat& transformedData, arma::vec&
eigenvalues, arma::mat& coeffs);
}
}}}
I think that's reasonable, but if you have other opinions or ideas, paste
away.
Thanks!
--
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/126#comment:7>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.
More information about the mlpack-svn
mailing list