[mlpack-git] [mlpack/mlpack] data::load may becomes an io bottleneck? (#707)
cypro666
notifications at github.com
Tue Jun 28 22:50:10 EDT 2016
Data mapping is a good idea but seems it is another schema with DatasetInfo var prepared.
data::Load has different overload, what I refer to is above
```cpp
template<typename eT>
bool Load(const std::string& filename,
arma::Mat<eT>& matrix,
const bool fatal,
const bool transpose)
```
This func is a powerful implment, but when the file's loadType != arma::hdf5_binary, we use arma::Mat::load..., so tragedy. armadillo's load method does Not make much optimized work for raw_csv or raw_txt, it's based on c++ std iostream, in fact, iostream is slower than c-style stdio.
To use std::ios::sync_with_stdio(false) will make this a little faster.
For my project, I used some stupid tech as contingency plan...
```cpp
bool success;
switch(loadType) {
case arma::hdf5_binary:
success = matrix.load(filename, loadType);
break;
case arma::csv_ascii:
case arma::raw_ascii:
success = stream_to_matrix(stream, matrix);
break;
default:
success = matrix.load(stream, loadType);
break;
}
```
And a bad temporary solution:
```cpp
template<typename IStream, typename eT>
inline bool stream_to_matrix(IStream& stream, arma::Mat<eT>& matrix) {
stream.clear();
stream.seekg(0, std::ios::beg);
if (!stream.good() || stream.eof() || stream.fail()) {
return false;
}
std::string line;
arma::uword ncol = 0, nrow = 0;
std::getline(stream, line);
stream.clear();
stream.seekg(0, std::ios::beg);
if (line.empty()) {
return false;
}
boost::trim(line);
if (boost::ends_with(line, ",")) {
line.pop_back();
}
char delim = ',';
ncol = std::count(line.begin(), line.end(), ','); // csv
if (0 == ncol) {
ncol = std::count(line.begin(), line.end(), '\t'); // tsv
delim = '\t';
if (0 == ncol) {
ncol = std::count(line.begin(), line.end(), ' '); // txt
delim = ' ';
}
}
if (0 == ncol) {
ncol = 1;
} else {
ncol += 1;
}
while (!stream.eof() && stream.good()) {
std::getline(stream, line);
if (line.empty()) {
break;
}
++nrow;
}
stream.clear();
stream.seekg(0, std::ios::beg);
matrix.resize(nrow, ncol);
std::vector<const char*> seps;
arma::uword i = 0;
std::cout << "..." << std::endl;
while (!stream.eof() && stream.good()) {
std::getline(stream, line);
boost::trim(line);
if (line.empty()) {
break;
}
if (cstyle_tokenize(line, delim, seps) != ncol) {
Log::Warn << "Error line " << i << ": " << line << std::endl;
return false;
}
char* end = nullptr;
for (arma::uword j = 0; j < ncol; ++j) {
assert(seps[j] && *seps[j]);
matrix(i, j) = std::strtod(seps[j], &end);
}
if (end && *end) {
Log::Warn << "Error line " << i << ": " << line << std::endl;
return false;
}
++i;
if (!(i % 1000000)) {
Log::Warn << i << " lines parsed" << std::endl;
}
}
if (i != nrow) {
return false;
}
return true;
}
```
I will try proposal in #681 and another dev branches, thx :)
---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/issues/707#issuecomment-229242682
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160628/81937ce2/attachment.html>
More information about the mlpack-git
mailing list