[mlpack] Suggestions needed on basic outline
Ryan Curtin
ryan at ratml.org
Mon Mar 14 09:58:11 EDT 2016
On Mon, Mar 14, 2016 at 05:53:15AM +0530, nirmal singhania wrote:
> Hello,
Hi Nirmal,
There is no need to send your email multiple times. Everyone on the
list received it the first time.
> Preprocessing Modules can include-
> 1)checking a dataset for loading problems and printing errors
> 2)Standardization module(mean removal and variance scaling) using z-score
> 3)Scaling features to range(min-max)
> 4)Handling Missing values/na
> This can be done by removing the entire rows/columns containing
> missing values.
> or imputing the missing values using given data
> 5)Scaling data with outliers
> 6)converting categorical features into binary features
>
> 7)Normalization of data(Not required for every ML algorithm but it doesn't
> hurt if applied)
> 8)splitting a dataset into a training and test set
>
> Other features we can consider adding
> 1)Handling Class Imbalance(Smote(Synthetic Minority Over-Sampling
> Technique),Oversampling and Undersampling)
> 2)Quantlization of Numerical Attributes
Do you mean quantization of categorical attributes here?
> A C++ API will will developed which will serve the purpose of
> pre-processing data before using any ML-pack algorithm on it.
> A command line interface will also be developed through which user can
> check for problems and apply pre-processing methods on data set.
> Command line and C++ API will intially support csv and arff files and
> support for other formats may be added later.
> There will be a option to save the pre-processed data set.
> Optional-One Extra feature which can added is converted pre-processed arff
> to csv and vice-versa.
>
> Since Data handling and pre-processing will be crucial and common
> step,Extensive documentation will be created using Doxygen on
> 1)How to use various Methods Present in C++ API
> 2)How to Handle and Pre-Process data using command line
>
> Sample Programs and Tutorials on various data handling steps will also be
> created using some open datasets.
>
>
> I want to ask how much information about each of the above steps i should
> give in my proposal to make it a good proposal.
I like the ideas you've proposed here. When you put your proposal
together, though, please spend some time detailing what the proposed C++
API will be (and we can go back and forth on this if necessary). I
think maybe the design guidelines would be helpful here:
https://github.com/mlpack/mlpack/wiki/DesignGuidelines
A couple other thoughts:
* Don't worry about writing an imputer. A colleague of mine and I are
planning on adding this support in the next few months. Detecting
NaNs and missing values in a dataset is a good idea though.
* We should try and support all of the file formats that Armadillo
supports, instead of just CSV and ARFF. It would be good to provide
a tool that can work with any dataset a user might otherwise use with
mlpack.
I hope this is helpful. Please let me know if I can clarify anything.
> 2)Implementing Decision trees and other algorithms in ml-pack
> I've have understood the decision stump implementation done by Udit Saxena
> for adaboost and would like to add more "weak learner" adaboost some of
> which are already implemented in ml-pack and some which will be implemented
> by me.
> since Decison stumps are basically 1-level decision tree i would like to
> continue on the Udit Saxena's work and implement full fledged decison trees
> like ID3,C4.5,C5.0,CART.
> I also looked at the code for DET(Density Estimation Trees) and would like
> to borrow tree construction ideas from it.
>
> Also will try to implement NB-Tree(Naive Bayes Tree) and
> CI-Tree(Conditional Inference Tree) which are very useful in some tasks.
> I have some knowledge about above mentioned methods and am currently going
> through literature for more information and implementation.
>
> All the above points about documentation,tutorial also apply here.
> As in this,we are adding new algorithm to ml-pack library
> Testing of implemented algorithms will be an important phase of this
> project.
> Also as everyone knows ml-pack is known for its fast speed and scalability.
> We will benchmark it against similar methods available in
> scikit-learn,weka,R and Shogun machine learning toolkit
> and the results will provided via interactive and charts.
> The automatic benchmarking system by Marcus Edel and Anand Soni during GSOC
> will be used for benchmarkinghttps://github.com/zoq/benchmarks
I think that you should focus on just one of these two ideas; it's hard
to write two good proposals. Again the same advice applies for this
proposal: make sure to spend some time designing the API and mentioning
what it will be in your proposal.
Thanks,
Ryan
--
Ryan Curtin | "I just ran out of it, you see."
ryan at ratml.org | - Howard Beale
More information about the mlpack
mailing list