[mlpack-git] [mlpack/mlpack] data::load may becomes an io bottleneck? (#707)

Tue Jun 28 22:50:10 EDT 2016

Data mapping is a good idea but seems it is another schema with DatasetInfo var prepared. 
data::Load has different overload, what I refer to is above
```cpp
template<typename eT>
bool Load(const std::string& filename,
          arma::Mat<eT>& matrix,
          const bool fatal,
          const bool transpose)
```
This func is a powerful implment, but  when the file's loadType != arma::hdf5_binary, we use arma::Mat::load..., so tragedy. armadillo's load method does Not make much optimized work for raw_csv or raw_txt, it's based on c++ std iostream, in fact, iostream is slower than c-style stdio.
To use std::ios::sync_with_stdio(false) will make this a little faster.

For my project, I used some stupid tech as contingency plan...
```cpp
    bool success;
    switch(loadType) {
    case arma::hdf5_binary:
        success = matrix.load(filename, loadType);
        break;
    case arma::csv_ascii:
    case arma::raw_ascii:
        success = stream_to_matrix(stream, matrix);
        break;
    default:
        success = matrix.load(stream, loadType);
        break;
    }
```
And a bad temporary solution:
```cpp
template<typename IStream, typename eT>
inline bool stream_to_matrix(IStream& stream, arma::Mat<eT>& matrix) {
    stream.clear();
    stream.seekg(0, std::ios::beg);
    if (!stream.good() || stream.eof() || stream.fail()) {
        return false;
    }

    std::string line;
    arma::uword ncol = 0, nrow = 0;

    std::getline(stream, line);
    stream.clear();
    stream.seekg(0, std::ios::beg);

    if (line.empty()) {
        return false;
    }

    boost::trim(line);
    if (boost::ends_with(line, ",")) {
        line.pop_back();
    }

    char delim = ',';
    ncol = std::count(line.begin(), line.end(), ','); // csv
    if (0 == ncol) {
        ncol = std::count(line.begin(), line.end(), '\t'); // tsv
        delim = '\t';
        if (0 == ncol) {
            ncol = std::count(line.begin(), line.end(), ' '); // txt
            delim = ' ';
        }
    }

    if (0 == ncol) {
        ncol = 1;
    } else {
        ncol += 1;
    }

    while (!stream.eof() && stream.good()) {
        std::getline(stream, line);
        if (line.empty()) {
            break;
        }
        ++nrow;
    }
    stream.clear();
    stream.seekg(0, std::ios::beg);

    matrix.resize(nrow, ncol);

    std::vector<const char*> seps;
    arma::uword i = 0;
    std::cout << "..." << std::endl;

    while (!stream.eof() && stream.good()) {
        std::getline(stream, line);
        boost::trim(line);
        if (line.empty()) {
            break;
        }
        if (cstyle_tokenize(line, delim, seps) != ncol) {
            Log::Warn << "Error line " << i << ": " << line << std::endl;
            return false;
        }
        char* end = nullptr;
        for (arma::uword j = 0; j < ncol; ++j) {
            assert(seps[j] && *seps[j]);
            matrix(i, j) = std::strtod(seps[j], &end);
        }
        if (end && *end) {
            Log::Warn << "Error line " << i << ": " << line << std::endl;
            return false;
        }
        ++i;
        if (!(i % 1000000)) {
            Log::Warn << i << " lines parsed" << std::endl;
        }
    }

    if (i != nrow) {
        return false;
    }

    return true;
}
```
I will try proposal in #681 and another dev branches, thx :)

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/issues/707#issuecomment-229242682
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160628/81937ce2/attachment.html>