[mlpack-git] [mlpack/mlpack] DatasetMapper & Imputer (#694)

Wed Jul 13 10:33:12 EDT 2016

> This one I am not sure about what is the result rcurtin refer to either.I guess what he means is we should keep the DatasetMapper mapping results when we load the new data.

We can handle this particular issue outside of this PR.  What I will do is write a test to ensure that the functionality is what I am hoping for, and I'll let you know when that is done.  I think that the current code will fail that test but I am not sure.  In essence, the issue is that I might want to load a training set, followed by loading a test set.  I need to be assured that when I load the test set, the mappings will be the same as for the training set, otherwise when I run my machine learning algorithm on the test set, the results will be garbage since the mappings are different than what the algorithm was trained with.

> (I guess this is fundamentally inevitable?)

It's certainly inevitable that we need to iterate over the matrix; if the matrix is column-major and we wish to impute only in one dimension, then yes, there is no way to avoid that.  But what I am focusing on is how we can reduce iterations over the matrix.  Consider the example where we want to median-impute over all dimensions, the matrix is column-major, and we want a separate output matrix:

```
extern arma::mat data; // Let's say this is very large.
arma::mat imputedData;
MedianImputation im;
// First dimension to create the output matrix.
im.Impute(data, imputedData, 0 /* map 0s */, 0);
for (size_t i = 1; i < imputedData.n_rows; ++i)
  im.Impute(imputedData, 0 /* map 0s */, i);
```

Theoretically I should be able to do this entire imputation in only two passes over the matrix: once to calculate the median without missing elements (and collect the missing element indices), and once to apply the median to the missing elements.

But the code above takes (1 + 2d) passes over the matrix, which is a lot more: the first pass, it copies the input matrix to the output matrix; then 2 passes are made for each dimension.  It would be much better to make two changes:

 * The overload of `Impute()` that gives a separate output matrix should not copy the input matrix to the output matrix, but instead impute directly into the output matrix, and copy elements as needed.  I guess in that case there is no need to store the `targets` because you will have to make a second pass over all of the elements of the matrix.  (The overload that does not take a separate output matrix should remain.)

 * `Impute()` should allow imputation in all dimensions, and then it could apply imputation for all dimensions at once, instead of needing to take many passes over the matrix.  Maybe it makes sense to make this a new overload of `Impute()`; I am not sure.

I'm sorry to be so picky about this, but we are essentially implementing the same functionality as scikit-learn here, and it's hard to construct a good justification to use mlpack's implementation if it is slower than scikit's, so we need to make sure it is fast.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/694#issuecomment-232374029
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160713/82264077/attachment-0001.html>