[mlpack-git] [mlpack/mlpack] Fix mapping issue (#660)

Thu Jun 2 10:01:52 EDT 2016

It's worth considering that in the strategy I proposed, re-reading the file is a thing that will happen either very early on in the process, or not very much.  Basically the idea is, read in the file until we encounter something we can't cast to a double, then go back and re-read everything and convert that feature to categorical.  In the vast majority of datasets, this will happen very early on: I don't think there are many datasets that have 1M+ lines of valid numbers then suddenly a "hello".  So I think the performance will be fine, with the exception of crazy corner cases.

Another idea might be to scan through the entire file once, in order to determine which features are categorical, then go back to the beginning and re-scan the file.  But for large files I think the first approach might be faster.

I'm not familiar with boost::spirit, I'll go read a little bit about it later today.  For what it's worth, the mlpack compile-time error messages can already be pretty crazy (especially with serialization!). :)

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/pull/660#issuecomment-223300776
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160602/2e16e7e7/attachment.html>