[mlpack] sparse coding test examples in mlpack

Jianyu Huang hjyahead at gmail.com
Fri Jun 5 20:42:12 EDT 2015


Hi Ryan,

Thanks so much for the reply! It really helps!

1.
The output I am getting is like the following:
[0;36m[DEBUG] RA Search  [0x7ffe1d9e5740]
[DEBUG]   Reference Set: 40x40
[DEBUG]   Metric:
[DEBUG]     LMetric [0x7ffe1d9e58a4]
[DEBUG]       Power: 2
[DEBUG]       TakeRoot: false
[DEBUG] Sparse Coding  [0x7ffe1d9e5780]
[DEBUG]   Data: 40x40
[DEBUG]   Atoms: 3
[DEBUG]   Lambda 1: 0.1
[DEBUG]   Lambda 2: 0

It is shown in the almost end (like the summary part) of the output log in
DEBUG mode. I know that actually the output log for Sparse Coding of the
input data "mnist_first250_training_4s_and_9s.arm"  is here (before that
summary part).
-------------------------------------------------------------------------------------
[DEBUG] Optimization at point 0.
[DEBUG] Optimization at point 100.
[DEBUG] Optimization at point 200.
[DEBUG] Optimization at point 300.
[DEBUG] Optimization at point 400.
[DEBUG] Optimization at point 0.
[DEBUG] Optimization at point 100.
[DEBUG] Optimization at point 200.
[DEBUG] Optimization at point 300.
[DEBUG] Optimization at point 400.
[DEBUG] Optimization at point 0.
[DEBUG] Optimization at point 100.
[DEBUG] Optimization at point 200.
[DEBUG] Optimization at point 300.
[DEBUG] Optimization at point 400.
[DEBUG] Solving Dual via Newton's Method.
[DEBUG] Newton Method iteration 1:
[DEBUG]   Gradient norm: 4.80046.
...
...
[DEBUG] Newton Method iteration 49:
[DEBUG]   Gradient norm: 3.11503e-10.
[DEBUG]   Improvement: 0.
------------------------------------------------------------------------------------

But just curious, what is the the "40x40" input data shown in the summary
part?

2.
Thanks for pointing out that mistake! Sorry I am not familiar with
Armadillo.

3.
Thanks! But just be curious, if I set the data as some random matrix like
1,0,0,0
0,3,0,0
3,0,1,0
0,4,0,0
0,0,5,0
0,0,3,7

and I run "./sparse_coding -i data_bak2.csv -k 6 -l 1 -d dict.csv -c
codes.csv -n 10 -v" multiple times.

Sometimes I can get output smoothly, but sometimes I get the following
error:

-------------------------------------------------------------------------------------------------------
[DEBUG] Newton Method iteration 49:
[DEBUG]   Gradient norm: 1.94598.
[DEBUG]   Improvement: 0.
[INFO ]   Objective value: 27.9256.
[INFO ] Performing coding step...
[DEBUG] Optimization at point 0.
[INFO ]   Sparsity level: 22.2222%.
[INFO ]   Objective value: 20.6886 (improvement 1.79769e+308).
[INFO ] Iteration 2 of 10.
[INFO ] Performing dictionary step...
[WARN ] There are 1 inactive atoms. They will be re-initialized randomly.
[DEBUG] Solving Dual via Newton's Method.

error: solve(): solution not found

terminate called after throwing an instance of 'std::runtime_error'
  what():  solve(): solution not found
Aborted (core dumped)

 ------------------------------------------------------------------------------------------------------

Do you have any insights about what is wrong here?

4.
It looks like mlpack only implements a naive way to solve sparse coding,
i.e. using Cholesky-based implementation of the LARS-Lasso algorithm to
solve sparse coding step, and using Newton's iterative method to solve
Lagrange's Dual. So mlpack doesn't actually implement the feature-sign
search algorithm of Honglak Lee's "Efficient sparse coding algorithms"
(NIPS 2006) paper. Am I wrong here? Also, it looks like for online sparse
coding algorithm, the implementation in Julien Mairal's "Online Dictionary
Learning for Sparse Coding" (ICML 2009) paper is more efficient, which is
adopted in Scikit. Do you have plans to add those sparse coding approach?

5.
I also notice the parallel performance of Sparse Coding in mlpack. When I
run command line interface "./sparse_coding ...", it looks like only one
core is utilized. But when I run the API code, it looks like the quad cores
in my CPU are all utilized. But searching the whole package, I didn't see
any "openmp" or "pthread" key words. My guess is that the performance
benefit comes from parallel MKL/BLAS. Am I wrong here? Do you have any idea
about why I get different parallel performance for CLI and API?

Thank you very much!
Jianyu


On Thu, Jun 4, 2015 at 8:21 AM, Ryan Curtin <ryan at ratml.org> wrote:

> On Wed, Jun 03, 2015 at 08:52:11AM -0700, Jianyu Huang wrote:
> > Hi all,
> >
> > I am new to sparse coding, and I try to use mlpack to test sparse
> coding. I
> > have sucessfully installed mlpack on a Ubuntu 14.04 64-bit machine.
>
> Hi Jianyu,
>
> I'll do my best to answer the questions. :)
>
> > 1.
> >
> > I run "make test". So if I understand correctly, I am running
> > "bin/mlpack_test" actually. I check the DEBUG/WARN log output for the
> test.
> > In the beginning, it shows "Running 402 test cases...". In almost the
> end,
> > I find "Sparse Coding" is examined in this test. It shows the Data is
> > 40x40, Atoms is 3, Lambda 1 is 0.1 and Lambda 2 is 0.
> >
> > So what data set is tested here? Where can I get this “40x40” data?
>
> You are correct -- 'make test' runs the 'bin/mlpack_test' program.
> However, I'm not sure where the output you are referring to is.  Can you
> paste exactly the output you are getting?  Then maybe I will be able to
> figure it out.  As far as I know, though, the sparse coding tests use
> the mnist_first250_training_4s_and_9s.arm dataset.
>
> > 2.
> >
> > I copy src/mlpack/tests/sparse_coding_test.cpp to a separate cpp file,
> and
> > try to remove all macros depending on boost/test/unit_test. So the new
> test
> > cpp file looks like:
> >
> >
> >
> > int main() {
> >   double lambda1 = 0.1;
> >   uword nAtoms = 25;
> >
> >   mat X;
> >   X.load("mnist_first250_training_4s_and_9s.arm");
> >   uword nPoints = X.n_cols;
> >
> >   // Normalize each point since these are images.
> >   for (uword i = 0; i < nPoints; ++i) {
> >     X.col(i) /= norm(X.col(i), 2);
> >   }
> >
> >   SparseCoding<> sc(X, nAtoms, lambda1);
> >   sc.OptimizeCode();
> >
> >   mat D = sc.Dictionary();
> >   mat Z = sc.Codes();
> >
> >   for (uword i = 0; i < nPoints; ++i)
> >   {
> >     vec errCorr = trans(D) * (D * Z.unsafe_col(i) - X.unsafe_col(i));
> >     SCVerifyCorrectness(Z.unsafe_col(i), errCorr, lambda1);
> >   }
> > }
> >
> > However, it shows the following error:
> >
> -------------------------------------------------------------------------------------------
> > Mat::load(): couldn't read mnist_first250_training_4s_and_9s.arm
> > error: Mat::col(): index out of bounds
> > terminate called after throwing an instance of 'std::logic_error'
> >   what():  Mat::col(): index out of bounds
> > Aborted (core dumped)
> >
> -------------------------------------------------------------------------------------------
> >
> > I checked the data “mnist_first250_training_4s_and_9s.arm” is only 3M, so
> > it should not exceed the “4 billion elements” restrictions for Armadillo
> > without “ARMA_64BIT_WORD” configurations. Do you have any idea about why
> > this error happen?
>
> You probably don't have the dataset file in the working directory of
> your program.  Based on the error output, that's what it looks like.
>
> > Also, how can visualize/show the data in “
> > mnist_first250_training_4s_and_9s.arm”? I don’t know what the data
> > looks like.
>
> This file is an Armadillo binary format to save space.  You could
> convert it to CSV and then use whatever tools you like to inspect it
> with the following simple program:
>
> ----
> #include <mlpack/core.hpp>
>
> int main() {
>   mat X;
>   X.load("mnist_first250_training_4s_and_9s.arm");
>
>   data::Save("mnist_first250_training_4s_and_9s.csv", X);
> }
> ----
>
> > 3.
> >
> > The command line interface for sparse_coding is as the following,
> >
> > $ sparse_coding -i data.csv -k 200 -l 0.1 -d dict.csv -c codes.csv
> >
> > Could you give me an example of data.csv? Sorry I don't know what the
> input
> > of sparse coding should look like.
>
> data.csv should be a comma-separated values file where each row
> represents one observation/point and each column represents one
> feature/dimension.  As an example, here's the first ten lines of
> LCDM_q.csv, which is a dataset containing 3-dimensional objects
> collected from the Sloan Digital Sky Survey:
>
> $ head ~/datasets/LCDM_q.csv
> 73.1708,100.713,8.93208
> 66.1034,33.8976,66.7139
> 73.5393,130.28,55.328
> 99.751,43.9025,99.2587
> 98.783,79.4761,78.0526
> 23.808,81.3255,12.287
> 89.1525,90.8523,37.9072
> 74.1535,68.5934,9.56997
> 44.0054,79.9218,0.937222
> 29.7863,132.141,59.5095
>
> I hope this is helpful; if there's anything I've written that's unclear,
> I'm happy to elaborate.
>
> Thanks,
>
> Ryan
>
> --
> Ryan Curtin    | "Hungry."
> ryan at ratml.org |   - Sphinx
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20150605/a1cf6c7b/attachment.html>


More information about the mlpack mailing list