[mlpack-svn] [MLPACK] #361: kernel pca unexpected behaviour
MLPACK Trac
trac at coffeetalk-1.cc.gatech.edu
Fri Aug 15 17:29:35 EDT 2014
#361: kernel pca unexpected behaviour
-----------------------+----------------------------------------------------
Reporter: ftrovato | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: mlpack | Resolution:
Keywords: | Blocking:
Blocked By: |
-----------------------+----------------------------------------------------
Comment (by marcus):
Hello Fabio,
[[BR]]
> I plotted the first two columns of outfile.txt and compared with the
first two in the case of the regular pca. However the results are
different: (i) the data are not centered on the x-axis and (ii) the
overall shape is different.
There was an error in the way we transformed the data. It is fixed in
r17032. If you plot the components be aware that the sign is not
deterministic. However, the absolute values should be the same.
Note that a negative or positive principal component is still the same
principal component it still describes the same one-dimensional subspace.
You can fix the sign by adjusting the columns of u and the rows of v such
that the largest element in each column will have a positive sign. If you
like I can send you the necessary lines.
I've attached a plot that shows the results of the different methods
except of the kpca using the nystroem method all results are the same.
I've used the keller4 (2 x 5100) dataset with a linear kernel.
Regarding the centering, there is a big difference in centered across
samples or across features. MLPACK PCA implementation centers across
samples and that is the standard as far as I know.
Regarding the slow down. You are dealing with high dimensional data using
the naive method (standard) that means you need to calculate the complete
kernel matrix (48000 x 48000) > 18GB. I guess you don't have enough
physical memory / swap file to hold the complete matrix. Linux starts
swapping before the RAM is filled up, which results in a slow down. The
nystroem method works by using a subset of the data as basis to
reconstruct an approximation of the kernel matrix, so you don't need to
calculate the complete kernel matrix (48000 x 48000).
[[BR]]
>
> 3) I tried to perform kernel pca with other type of kernels. Still, data
are never centered. The different kernels give different results, but
within each kernel, the calculation is rather insensitive to the
parameters of the specific kernel. I expected a bit of variation, why I do
not observe it even though i use paramater values that are orders of
magnitude different?
>
Can you test this again with the fix introduced in r17032?
I hope this is helpful if instead I've just made things more confusing,
let us know.
Thanks,
Marcus
--
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/361#comment:2>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.
More information about the mlpack-svn
mailing list