[mlpack-svn] [MLPACK] #361: kernel pca unexpected behaviour
MLPACK Trac
trac at coffeetalk-1.cc.gatech.edu
Thu Aug 14 18:59:15 EDT 2014
#361: kernel pca unexpected behaviour
----------------------+-----------------------------------------------------
Reporter: ftrovato | Owner:
Type: defect | Status: new
Priority: major | Milestone:
Component: mlpack | Keywords:
Blocking: | Blocked By:
----------------------+-----------------------------------------------------
Dear mlpack developers,
I have tried some tests using kernel_pca on a 48000 (rows) x 16 (cols)
data matrix.
1) With the first test I tried to reproduce the results obtained with the
"regular" pca:
kernel_pca -i inpfile.txt --center -k linear -n -o outfile.txt
This is the output:
[INFO ] Loading 'inpfile.txt' as raw ASCII formatted data. Size is 16 x
48000.
[INFO ] Saving raw ASCII formatted data to 'outfile.txt'.
[INFO ]
[INFO ] Execution parameters:
[INFO ] bandwidth: 1
[INFO ] center: true
[INFO ] degree: 1
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_file: inpfile.txt
[INFO ] kernel: linear
[INFO ] kernel_scale: 1
[INFO ] new_dimensionality: 0
[INFO ] nystroem_method: true
[INFO ] offset: 0
[INFO ] output_file: outfile.txt
[INFO ] sampling: kmeans
[INFO ] verbose: true
[INFO ] version: false
[INFO ]
[INFO ] Program timers:
[INFO ] loading_data: 0.458363s
[INFO ] saving_data: 0.524596s
[INFO ] total_time: 1.295479s
I plotted the first two columns of outfile.txt and compared with the first
two in the case of the regular pca. However the results are different: (i)
the data are not centered on the x-axis and (ii) the overall shape is
different.
Is this an expected behaviour based on the options I have used? I tried to
perform the same without the -n option but the calculation is really too
slow to finish. Additionally while the calculation -- without the -n
option -- is performed I am barely able to move the mouse cursor. I am
working on a workstation (Ubuntu 14.04). Is there a reason why I observe
such a heavy slow down of my computer performances? At least in my case
not using -n seems very difficult, although more accurate (I guess).
2) Compared to (1) I varied the sampling parameter by specifying
--sampling and I see some differences in shape, but the data do not
resemble those of regular pca in any case.
3) I tried to perform kernel pca with other type of kernels. Still, data
are never centered. The different kernels give different results, but
within each kernel, the calculation is rather insensitive to the
parameters of the specific kernel. I expected a bit of variation, why I do
not observe it even though i use paramater values that are orders of
magnitude different?
For example in the case of the polynomial kernel, no matter what degree is
used, the result is always the same: the files are identical weather I use
--degree 1000 or 0.00001 or the default.
Here I report an example
kernel_pca -i inpfile.txt --center -k polynomial --sampling kmeans -n -o
outfile.txt --verbose
[INFO ] Loading 'inpfile.txt' as raw ASCII formatted data. Size is 16 x
48000.
[INFO ] Saving raw ASCII formatted data to 'outfile.txt'.
[INFO ]
[INFO ] Execution parameters:
[INFO ] bandwidth: 1
[INFO ] center: true
[INFO ] degree: 1
[INFO ] help: false
[INFO ] info: ""
[INFO ] input_file: inpfile.txt
[INFO ] kernel: polynomial
[INFO ] kernel_scale: 1
[INFO ] new_dimensionality: 0
[INFO ] nystroem_method: true
[INFO ] offset: 0
[INFO ] output_file: outfile.txt
[INFO ] sampling: kmeans
[INFO ] verbose: true
[INFO ] version: false
[INFO ]
[INFO ] Program timers:
[INFO ] loading_data: 0.410814s
[INFO ] saving_data: 0.457318s
[INFO ] total_time: 0.981326s
I might be missing something, but I would appreciate if you could give me
any useful feedback.
Thank you for your help,
Fabio
--
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/361>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.
More information about the mlpack-svn
mailing list