[mlpack-svn] [MLPACK] #300: allknn fails for mnist8m dataset

Thu Aug 15 17:04:05 EDT 2013

#300: allknn fails for mnist8m dataset
----------------------+-----------------------------------------------------
  Reporter:  rozyang  |        Owner:  rcurtin 
      Type:  defect   |       Status:  accepted
  Priority:  major    |    Milestone:          
 Component:  mlpack   |   Resolution:          
  Keywords:           |     Blocking:          
Blocked By:           |  
----------------------+-----------------------------------------------------

Comment (by rcurtin):

 Ok, I can reproduce this on a system with 16GB of RAM.  The debug output
 is slightly more helpful in this case:

 {{{
 [DEBUG] Compiled with debugging symbols.
 [INFO ] Loading '/home/ryan/mnist8m.csv' as CSV data.
 error: Mat::init(): requested size is too large

 terminate called after throwing an instance of 'std::logic_error'
   what():

 Program received signal SIGABRT, Aborted.
 }}}

 This output is not a part of the non-debug version; the debugging checks
 are not compiled in when Armadillo is compiled with -DNDEBUG (which mlpack
 releases are by default).

 As a side note, mnist8m.csv as you gave it has 784 lines (points) and 8.1M
 rows (features).  I think that the dataset should have 8.1M points each
 with 784 features (dimensions), so I changed the csvwrite() command to
 transpose X before saving.  But that shouldn't affect the problem you've
 reported.

 So, there are a few workarounds.

  * Buy more RAM.  This might be unfeasible, and realistically I think
 you'll need 32GB or 48GB if you want allknn to run without problems.

  * Use sparse matrices, assuming that the dataset is sparse.  There are
 not mlpack loaders for sparse data at this time, so you'd have to write
 one or wait for me to write the code to do it.  The easiest way would be
 with HDF5.  This may be slower than dense matrix kd-tree searching,
 depending on about a million different factors---mostly the sparsity of
 the input data.  I don't know how sparse the MNIST featureset is.

  * Use mmap() to avoid actually loading the matrix into memory, and use
 one of the advanced arma::mat() constructors to force the matrix to use
 the mmap()'ed memory.  This is going to be much slower because of the disk
 accesses.  I don't really want to detail how to do this unless you
 actually want to.

  * Run on a smaller set of data.  This is probably not an option, because
 if I had to guess, you probably chose mlpack because it supports large
 datasets.  It does---but only ones that fit in RAM, unless you want to do
 some of the hackery I've detailed above.

 I am thinking about a way to potentially handle the out-of-memory
 situation without a segfault and without compromising the speed of the
 internal Armadillo code.  I'm not entirely sure it's possible.

 Let me know which of those four ways you want to go and I can provide more
 advice and potentially some relevant code.  Unfortunately for any of those
 options, except the last, we'll have to deal with C++ so the nice allknn
 executable won't really be able to help us here.

-- 
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/300#comment:3>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.