[mlpack-svn] [MLPACK] #361: kernel pca unexpected behaviour

Thu Aug 14 18:59:15 EDT 2014

#361: kernel pca unexpected behaviour
----------------------+-----------------------------------------------------
 Reporter:  ftrovato  |        Owner:     
     Type:  defect    |       Status:  new
 Priority:  major     |    Milestone:     
Component:  mlpack    |     Keywords:     
 Blocking:            |   Blocked By:     
----------------------+-----------------------------------------------------
 Dear mlpack developers,
 I have tried some tests using kernel_pca on a 48000 (rows) x 16 (cols)
 data matrix.

 1) With the first test I tried to reproduce the results obtained with the
 "regular" pca:
 kernel_pca -i inpfile.txt --center -k linear -n -o outfile.txt

 This is the output:

 [INFO ] Loading 'inpfile.txt' as raw ASCII formatted data.  Size is 16 x
 48000.
 [INFO ] Saving raw ASCII formatted data to 'outfile.txt'.
 [INFO ]
 [INFO ] Execution parameters:
 [INFO ]   bandwidth: 1
 [INFO ]   center: true
 [INFO ]   degree: 1
 [INFO ]   help: false
 [INFO ]   info: ""
 [INFO ]   input_file: inpfile.txt
 [INFO ]   kernel: linear
 [INFO ]   kernel_scale: 1
 [INFO ]   new_dimensionality: 0
 [INFO ]   nystroem_method: true
 [INFO ]   offset: 0
 [INFO ]   output_file: outfile.txt
 [INFO ]   sampling: kmeans
 [INFO ]   verbose: true
 [INFO ]   version: false
 [INFO ]
 [INFO ] Program timers:
 [INFO ]   loading_data: 0.458363s
 [INFO ]   saving_data: 0.524596s
 [INFO ]   total_time: 1.295479s

 I plotted the first two columns of outfile.txt and compared with the first
 two in the case of the regular pca. However the results are different: (i)
 the data are not centered on the x-axis and (ii) the overall shape is
 different.

 Is this an expected behaviour based on the options I have used? I tried to
 perform the same without the -n option but the calculation is really too
 slow to finish. Additionally while the calculation -- without the -n
 option -- is performed I am barely able to move the mouse cursor. I am
 working on a workstation (Ubuntu 14.04). Is there a reason why I observe
 such a heavy slow down of my computer performances? At least in my case
 not using -n seems very difficult, although more accurate (I guess).

 2) Compared to (1) I varied the sampling parameter by specifying
 --sampling and I see some differences in shape, but the data do not
 resemble those of regular pca in any case.

 3) I tried to perform kernel pca with other type of kernels. Still, data
 are never centered. The different kernels give different results, but
 within each kernel, the calculation is rather insensitive to the
 parameters of the specific kernel. I expected a bit of variation, why I do
 not observe it even though i use paramater values that are orders of
 magnitude different?

 For example in the case of the polynomial kernel, no matter what degree is
 used, the result is always the same: the files are identical weather I use
 --degree 1000 or 0.00001 or the default.

 Here I report an example

 kernel_pca -i inpfile.txt --center -k polynomial --sampling kmeans -n -o
 outfile.txt --verbose

 [INFO ] Loading 'inpfile.txt' as raw ASCII formatted data.  Size is 16 x
 48000.
 [INFO ] Saving raw ASCII formatted data to 'outfile.txt'.
 [INFO ]
 [INFO ] Execution parameters:
 [INFO ]   bandwidth: 1
 [INFO ]   center: true
 [INFO ]   degree: 1
 [INFO ]   help: false
 [INFO ]   info: ""
 [INFO ]   input_file: inpfile.txt
 [INFO ]   kernel: polynomial
 [INFO ]   kernel_scale: 1
 [INFO ]   new_dimensionality: 0
 [INFO ]   nystroem_method: true
 [INFO ]   offset: 0
 [INFO ]   output_file: outfile.txt
 [INFO ]   sampling: kmeans
 [INFO ]   verbose: true
 [INFO ]   version: false
 [INFO ]
 [INFO ] Program timers:
 [INFO ]   loading_data: 0.410814s
 [INFO ]   saving_data: 0.457318s
 [INFO ]   total_time: 0.981326s

 I might be missing something, but I would appreciate if you could give me
 any useful feedback.

 Thank you for your help,
 Fabio

-- 
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/361>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.