[mlpack] [GSoC 2016] Some attempts on refactoring decision_stump

Cloud Han heliumhgy at gmail.com
Sat Apr 16 23:22:25 EDT 2016


Hello, everyone,

My name is Han Guangyun (Cloud Han), a master candidate student in
China. I haven't introduced myself yet, and I'm not going to do it
right now ;)

I only have a few private emails with Ryan for GSoC issues and finally
decide to do some experiment on the original decision stump
implementation, which is a building block for decision tree (my
proposal).

So what is wrong with the original implementation? First, if you have
read the code, you'll find the splitting criterion is hard coded as
entropy gain (information gain) and have a bunch of redundant
code. Second, it only support numeric feature. Some other function
also needs to be added if you want to develop a full blown decision tree.

Also, I inspected Ryan's implementation for Hoeffding tree, which is
basically a decision tree supports streaming data. So it has to
remember the statistics(some state) of the data which is not needed in
the vanilla one. It also brings the ability to handle categorical
feature.

So in this experiment, I just remove the unneeded feature and combined
it with the stump. After inspecting the code, here is some conclusion:

1. Fitness Function (Impurity Gain Function)

   In Ryan's implementation, these calculation is presented by a
   FitnessFunction policy, you just prepare a `counts`, a numChildren
   X numClasses matrix holds the statistics for number of times each
   class occurs in a child, for the gain calculation. But we still
   need to add weighting support, sample reweighting it required by
   AdaBoost, class weighting is commonly needed for skewed data. So in
   this case, instead of let `counts` hold the occurrence, we hold
   weighted occurrence. This change should work fine for classification
   tasks, both info gain, info gain ratio and Gini impurity just need
   the probability of the label occurs in each child to calculate.

   However, for regression tree, I haven't figured it out on how to
   use these statistics to calculate MSE, which needs a mean value for
   each child. For now, I think a better one is instead of letting
   FitnessFunction class compute the gain, we just compute
   impurity of a (proposed) node, and combining them to get the
   impurity gain should be done in class NumericSplit or CategoricalSplit.

2. Numeric vs Categorical Feature

   It is common to see categorical feature, e.g. `male` vs `female`,
   in data, decision trees have some advantages on handling them with
   respect to other algorithms.

   Ryan use two classes, HoeffdingNumericSplit and
   HoeffdingCategoricalSplit to handle heterogeneous data, combined
   with a proper FitnessFunction, it is really flexible. I think this
   should be fine and I just mimic a NumericSplit in my experiment.

3. Missing Values

   Both the original DecisionStump and the Hoeffding Tree (not quite sure)
   implementation in mlpack doesn't support handling missing values,
   which is quite common in the wild. On training, it is easy to just
   use the non-missing data on a particular dimension to train a
   stump.

   On predicting, two ways to solve it, one is using other non-missing
   value to perform a split, i.e. we add some fallback
   mechanism. Another one is handle them by probability, C4.5 is using
   this it I think. I also prefer the later one due to its simplicity.

Combining the above 3 feature and adding weighting support, it should be
sufficient to build a generic decision stump or decision tree.

Some try is stored here, just added weighting support and just handle
numeric feature by NumericSplit class. Just a little progress. Further test
and consideration is needed. No serious test is performed, just concept
prove :)
https://github.com/cloudhan/mlpack/tree/weak_learner/src/mlpack/methods/weak_learner

Hope the above will shed some light on the one who is going to
implement a decision tree.

Good lucks.
(๑•̀ㅂ•́)و✧
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack/attachments/20160417/8e532a04/attachment.html>


More information about the mlpack mailing list