[mlpack-git] [mlpack/mlpack] data::load may becomes an io bottleneck? (#707)

Mon Jun 27 06:00:45 EDT 2016

When I use mlpack_kmeans client tool for a big dataset about 300MB csv file:

[INFO ] Loading 'train.csv' as CSV data.  Size is **11 x 10000000.**
[INFO ] Program timers:
[INFO ]   **clustering: 21.957208s**
[INFO ]   computing_neighbors: 0.000669s
[INFO ]   knn: 0.000710s
[INFO ]   **loading_data: 28.429786s**
[INFO ]   saving_data: 0.577348s
[INFO ]   total_time: 51.004174s

As we see, loading data takes a long time, even longer than training ... so I use another simple impl of myself to read and split csv file and init armadillo matrix. In fact, this should take less than 5 seconds.

The source code of core/data/load_impl.hpp has a lot of optimized spaces, you known sometimes, the routine need to be execute many times, if loading becomes faster... :)

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/issues/707
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160627/dd67d9a6/attachment.html>