[mlpack-git] [mlpack] Dataset and Experimentation Tool Ideas (#582)

Sun Mar 20 09:54:34 EDT 2016

The Dataset and Experimentation tool project starts with a few flints of ideas:
•	checking a dataset for loading problems and printing errors
•	imputation strategies for missing variables
•	splitting a dataset into a training and test set
•	converting categorical features into binary features (or numeric features)

By analyzing the needs I felt the simple and intuitive console application that evaluates the dataset and solves some commonly faced problems can be a good solution to this problem.
I divided the prospective application into four major modules.

* Data IO - Convert to CSV, ARFF, or save into a particular format. (Maybe provide better result with external libraries) 
* Data Transformation - Join/Split, Edit Metadata (Feature type detection & Transformation), remove target leaks, clean missing data (customize, replace with mean, mode, median or remove entire row), fix scaling issues, and etc.
* Statistical Analytics - Descriptive Statistics (Row count, unique value count, missing value count, min, max, mean, median, mode, 1st and 3rd quartile, and etc.), T-Test and etc.
* Mathematical Operators - rounding, applying math operations, extracting hours from timestamps, apply time zone, and etc.

Please let me know if I am going to the right direction. 
Also, Let me know if there is another idea or resource that might help with this project. 

---
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/issues/582
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20160320/f68860fa/attachment.html>