[mlpack-git] [mlpack/mlpack] Density Estimation Tree made sparse-enabled (#802)

Tue Nov 1 12:09:57 EDT 2016

Ok, great---I am merging a few other things that needed manual intervention today, but I will get to this one when I am done with those.

> I wish I could agree on the ExtractSort issue, but there are sparse-specific problems, in short - row_col_iterator will not report the zero values, but the split decision algorithm does need them - to make a split attempt between a zero and the adjacent non-zero element. The all-universal implementation with sparse enabled arma::sort and then index-based iteration will be correct, but terribly slow because of all the zeroes, that will be reported.

Ah, you are right, I had overlooked that.  I think it is possible to get _pretty_ close to completely generic but I agree now, it is not completely possible.

> My implementation on arma::sort is as part of MLPack's arma extensions. It is a less-than-a-page source code, which I can even paste into a comment, I guess :-) However, if it's really better to go the PR to Armadillo way - I can do it, I just don't know exactly when.

Hm, here is an idea, what do you think?  You can submit a PR to add it to `arma_extend.hpp` and then add tests for it in `src/mlpack/tests/`, and then I'll extract the code and submit it upstream to Armadillo and CC you on the email (since Armadillo is not on Github).  Then when that gets accepted I'll put `#if` guards around the `arma_extend` implementation to make sure that the method is only included when the Armadillo version doesn't already have it.

> Regarding the coordinate list sparse matrix implementation - isn't the CSC storage faster when it comes to column-based iteration? Anything that is fast on both rows and columns directed iteration will be fine.

For iteration, it's a toss-up.  For CSC you have to maintain a variable holding which column you are in as well as which index you are at in the row indices and values vectors, and when you switch columns it can take a few iterations to get the correct new column ID.  For coordinate lists you always have the column available and only need to maintain the index that you are at.

As for the coordinate list implementation, it may be many weeks before I actually have that ready to go.  My current priorities aren't totally pointed towards this, but it is definitely a priority in the long run to get better, faster, and more usable sparse matrix support for mlpack (which means contributions to Armadillo).

> Regarding the format loading - I'd say that the format to be supported is Matrix Market, which is a coordinate list with the header. Such format will also save one swipe through the input file, which is happening now - for obtaining the matrix size, and which technically is not even quite correct, because if you have zeroes at the last columns/rows you will not have entries for these in the coordinate list and will deduce a false matrix size.

Ah, that should not be hard.  You can take `diskio::load_coord_ascii()` and `diskio::save_coord_ascii()` and adapt them easily to produce methods like `diskio::load_mm()` and `diskio::save_mm()` and we could take the same approach I outlined for the `sort()` method to include it in mlpack and then also push it upstream to Armadillo.  Let me know what you think.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/802#issuecomment-257609265
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20161101/a2bad10f/attachment.html>