[mlpack-git] [mlpack/mlpack] DatasetMapper & Imputer (#694)

Tham notifications at github.com
Tue Jul 19 16:00:27 EDT 2016


I will try to fix the loading issue on this weekend and open a pull
request.

2016-07-18 14:34 GMT+08:00 Keon Kim <notifications at github.com>:

> @rcurtin <https://github.com/rcurtin> @stereomatchingkiss
> <https://github.com/stereomatchingkiss>
> I think now the overloads that produces output matrix are little bit more
> optimized.
> The previous method went through every matrix again and again when
> imputing each dimensions.
>
> Now the copy of the matrix happens at the same time calculating the mean
> (or median or anything). And target vector still remains to reduce the work
> of going through the dimension again.
> So now it becomes (1m + 1t) (copy and caculate + replace) instead of
> previous (1m + 1d + 1t) (copy + caculate + replace). (m is the whole
> matrix, d is dimension, and t is the target vector). This showed slight
> improvements in performance.
>
> However for the executable, I made it so that when going through every
> dimensions, first to check if any mappings exist in the dimension, put them
> in a list of dirtyDimensions, and apply the imputation methods on those
> dimensions. And when applying the changes using Impute(), the executable
> uses the overload that does not produce the output matrix. This one results
> in (1d + 1t) for every dimensions that have missing value mappings.
>
> Benchmarks:
> data: 'imputer.csv' as CSV data. Size is 400850 x 4.
>
> [INFO ] 15970 mappings in dimension 0.
> [INFO ] 2646 mappings in dimension 1.
> [INFO ] 2646 mappings in dimension 2.
> [INFO ] 2661 mappings in dimension 3.
> mlpack_preprocess_imputer -i imputer.csv -d 0 -m a -s mean -v
>
> Impute one dimension
>
>    - overload producing output (1m + 1d + 1t) for every dimensions:
>    0.058182s
>    - overload producing output (1m + 1t) for every dimensions: 0.056293s
>    - overload without producing output (1d + 1t) for every dimensions:
>    0.047528s
>
> And for FYI - Impute all dimensions
> Same data,
> mlpack_preprocess_imputer -i imputer.csv -m a -s mean -v
>
>    - overload without producing output(1d + 1t) for every dimensions.
>
> [INFO ]   imputation: 0.197194s
> [INFO ]   loading_data: 18.417980s
> [INFO ]   total_time: 18.616683s
>
> I know this is being fixed, but the most of the overhead comes from the
> loading_data right now.
>
>> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <https://github.com/mlpack/mlpack/pull/694#issuecomment-233244428>, or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/ABt-unjAQz3afitJLBrdVouI7fYzUHAlks5qWx6QgaJpZM4I07W->
> .
>


---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/694#issuecomment-233748087
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160719/53b30bc9/attachment.html>


More information about the mlpack-git mailing list