[mlpack-git] [mlpack/mlpack] DatasetMapper & Imputer (#694)

Keon Kim notifications at github.com
Mon Jul 18 02:34:58 EDT 2016


@rcurtin @stereomatchingkiss 
I think now the overloads that produces output matrix are little bit more optimized.
The previous method went through every matrix again and again when imputing each dimensions.

Now the copy of the matrix happens at the same time calculating the mean (or median or anything). And target vector still remains to reduce the work of going through the dimension again.
So now it becomes (1m + 1t) (copy and caculate + replace) instead of previous (1m + 1d + 1t) (copy + caculate + replace). (m is the whole matrix, d is dimension, and t is the target vector). This showed slight improvements in performance.

However for the executable, I made it so that when going through every dimensions, first to check if any mappings exist in the dimension, put them in a list of dirtyDimensions, and apply the imputation methods on those dimensions. And when applying the changes using `Impute()`, the executable uses the overload that does not produce the output matrix. This one results in (1d + 1t) for every dimensions that have missing value mappings.

Benchmarks:
data: 'imputer.csv' as CSV data.  Size is 400850 x 4.
```
[INFO ] 15970 mappings in dimension 0.
[INFO ] 2646 mappings in dimension 1.
[INFO ] 2646 mappings in dimension 2.
[INFO ] 2661 mappings in dimension 3.
mlpack_preprocess_imputer -i imputer.csv -d 0 -m a -s mean -v 
```

Impute one dimension
- overload producing output (1m + 1d + 1t) for every dimensions: `0.058182s`
- overload producing output (1m + 1t) for every dimensions: `0.056293s`
- overload without producing output (1d + 1t) for every dimensions: `0.047528s`

And for FYI - Impute all dimensions
Same data,
`mlpack_preprocess_imputer -i imputer.csv -m a -s mean -v `
- overload without producing output(1d + 1t) for every dimensions.
```
[INFO ]   imputation: 0.197194s
[INFO ]   loading_data: 18.417980s
[INFO ]   total_time: 18.616683s
```
I know this is being fixed, but the most of the overhead comes from the loading_data right now.



---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/694#issuecomment-233244428
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160717/30ad276f/attachment.html>


More information about the mlpack-git mailing list