[mlpack-svn] [MLPACK] #361: kernel pca unexpected behaviour

Fri Aug 15 17:29:35 EDT 2014

#361: kernel pca unexpected behaviour
-----------------------+----------------------------------------------------
  Reporter:  ftrovato  |        Owner:     
      Type:  defect    |       Status:  new
  Priority:  major     |    Milestone:     
 Component:  mlpack    |   Resolution:     
  Keywords:            |     Blocking:     
Blocked By:            |  
-----------------------+----------------------------------------------------

Comment (by marcus):

 Hello Fabio,

 [[BR]]

 > I plotted the first two columns of outfile.txt and compared with the
 first two in the case of the regular pca. However the results are
 different: (i) the data are not centered on the x-axis and (ii) the
 overall shape is different.

 There was an error in the way we transformed the data. It is fixed in
 r17032. If you plot the components be aware that the sign is not
 deterministic. However, the absolute values should be the same.
 Note that a negative or positive principal component is still the same
 principal component it still describes the same one-dimensional subspace.
 You can fix the sign by adjusting the columns of u and the rows of v such
 that the largest element in each column will have a positive sign. If you
 like I can send you the necessary lines.
 I've attached a plot that shows the results of the different methods
 except of the kpca using the nystroem method all results are the same.
 I've used the keller4 (2 x 5100) dataset with a linear kernel.

 Regarding the centering, there is a big difference in centered across
 samples or across features. MLPACK PCA implementation centers across
 samples and that is the standard as far as I know.

 Regarding the slow down. You are dealing with high dimensional data using
 the naive method (standard) that means you need to calculate the complete
 kernel matrix (48000 x 48000) > 18GB. I guess you don't have enough
 physical memory / swap file to hold the complete matrix. Linux starts
 swapping before the RAM is filled up, which results in a slow down. The
 nystroem method works by using a subset of the data as basis to
 reconstruct an approximation of the kernel matrix, so you don't need to
 calculate the complete kernel matrix (48000 x 48000).
 [[BR]]
 >
 > 3) I tried to perform kernel pca with other type of kernels. Still, data
 are never centered. The different kernels give different results, but
 within each kernel, the calculation is rather insensitive to the
 parameters of the specific kernel. I expected a bit of variation, why I do
 not observe it even though i use paramater values that are orders of
 magnitude different?
 >

 Can you test this again with the fix introduced in r17032?

 I hope this is helpful if instead I've just made things more confusing,
 let us know.

 Thanks,
 Marcus

-- 
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/361#comment:2>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.