[mlpack-svn] [MLPACK] #300: allknn fails for mnist8m dataset
MLPACK Trac
trac at coffeetalk-1.cc.gatech.edu
Thu Aug 15 17:04:05 EDT 2013
#300: allknn fails for mnist8m dataset
----------------------+-----------------------------------------------------
Reporter: rozyang | Owner: rcurtin
Type: defect | Status: accepted
Priority: major | Milestone:
Component: mlpack | Resolution:
Keywords: | Blocking:
Blocked By: |
----------------------+-----------------------------------------------------
Comment (by rcurtin):
Ok, I can reproduce this on a system with 16GB of RAM. The debug output
is slightly more helpful in this case:
{{{
[DEBUG] Compiled with debugging symbols.
[INFO ] Loading '/home/ryan/mnist8m.csv' as CSV data.
error: Mat::init(): requested size is too large
terminate called after throwing an instance of 'std::logic_error'
what():
Program received signal SIGABRT, Aborted.
}}}
This output is not a part of the non-debug version; the debugging checks
are not compiled in when Armadillo is compiled with -DNDEBUG (which mlpack
releases are by default).
As a side note, mnist8m.csv as you gave it has 784 lines (points) and 8.1M
rows (features). I think that the dataset should have 8.1M points each
with 784 features (dimensions), so I changed the csvwrite() command to
transpose X before saving. But that shouldn't affect the problem you've
reported.
So, there are a few workarounds.
* Buy more RAM. This might be unfeasible, and realistically I think
you'll need 32GB or 48GB if you want allknn to run without problems.
* Use sparse matrices, assuming that the dataset is sparse. There are
not mlpack loaders for sparse data at this time, so you'd have to write
one or wait for me to write the code to do it. The easiest way would be
with HDF5. This may be slower than dense matrix kd-tree searching,
depending on about a million different factors---mostly the sparsity of
the input data. I don't know how sparse the MNIST featureset is.
* Use mmap() to avoid actually loading the matrix into memory, and use
one of the advanced arma::mat() constructors to force the matrix to use
the mmap()'ed memory. This is going to be much slower because of the disk
accesses. I don't really want to detail how to do this unless you
actually want to.
* Run on a smaller set of data. This is probably not an option, because
if I had to guess, you probably chose mlpack because it supports large
datasets. It does---but only ones that fit in RAM, unless you want to do
some of the hackery I've detailed above.
I am thinking about a way to potentially handle the out-of-memory
situation without a segfault and without compromising the speed of the
internal Armadillo code. I'm not entirely sure it's possible.
Let me know which of those four ways you want to go and I can provide more
advice and potentially some relevant code. Unfortunately for any of those
options, except the last, we'll have to deal with C++ so the nice allknn
executable won't really be able to help us here.
--
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/300#comment:3>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.
More information about the mlpack-svn
mailing list