[mlpack-git] [mlpack/mlpack] DatasetMapper & Imputer (#694)

Ryan Curtin notifications at github.com
Fri Jul 22 11:26:51 EDT 2016


I hate to be so picky on this point, but I think we can do better than the speedup given by `dirtyDimensions` by providing an overload of `Impute()` that looks in all dimensions.  This way only one pass over the input matrix is necessary:

```
// Version for only one matrix that we operate on in-place.
void Impute(arma::Mat<T>& matrix, ...)
{
  // Get list of dimensions to check.
  arma::Col<size_t> dirtyDimensions = ...; // just like in the main.cpp now

  // Iterate over all columns in the matrix.
  for (size_t c = 0; c < matrix.n_cols; ++c)
  {
    // Check each dirty column.
    for (size_t d = 0; d < dirtyDimensions.n_elem; ++d)
    {
      if (matrix(dirtyDimensions[d], c) == missingValue)
        matrix(dirtyDimensions[d], c) = newValue;
    }
  }
}
```

And the other overload:

```
// Version for one input and one output matrix.
void Impute(const arma::Mat<T>& input, arma::Mat<T>& output, ...)
{
  output.set_size(input.n_rows, input.n_cols);

  for (size_t c = 0; c < input.n_cols; ++c)
  {
    // No need for dirtyDimensions since we must loop over all dimensions.
    for (size_t r = 0; r < input.n_rows; ++r)
    {
      if (input(r, c) == missingValue)
        output(r, c) = newValue;
      else
        output(r, c) = input(r, c);
    }
  }
}
```

If we do things like that, it should be faster than the current approach.  I'm focusing so much on speed here, because other libraries like scikit-learn have this functionality already, and if we are not able to perform the same operations more quickly, there is no compelling reason to use this implementation vs. another one.

Otherwise everything looks fine to me, although I think to fix this issue there will have to be some refactoring.

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/694#issuecomment-234574772
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160722/866a7759/attachment-0001.html>


More information about the mlpack-git mailing list