<div dir="ltr"><div class="gmail_default"><div class="gmail_default"><div class="gmail_default"><div class="gmail_default">Hello, everyone,</div><div class="gmail_default"><br></div><div class="gmail_default">My name is Han Guangyun (Cloud Han), a master candidate student in</div><div class="gmail_default">China. I haven't introduced myself yet, and I'm not going to do it</div><div class="gmail_default">right now ;)</div><div class="gmail_default"><br></div><div class="gmail_default">I only have a few private emails with Ryan for GSoC issues and finally</div><div class="gmail_default">decide to do some experiment on the original decision stump</div><div class="gmail_default">implementation, which is a building block for decision tree (my</div><div class="gmail_default">proposal).</div><div class="gmail_default"><br></div><div class="gmail_default">So what is wrong with the original implementation? First, if you have</div><div class="gmail_default">read the code, you'll find the splitting criterion is hard coded as</div><div class="gmail_default">entropy gain (information gain) and have a bunch of redundant</div><div class="gmail_default">code. Second, it only support numeric feature. Some other function</div><div class="gmail_default">also needs to be added if you want to develop a full blown decision tree.</div><div class="gmail_default"><br></div><div class="gmail_default">Also, I inspected Ryan's implementation for Hoeffding tree, which is</div><div class="gmail_default">basically a decision tree supports streaming data. So it has to</div><div class="gmail_default">remember the statistics(some state) of the data which is not needed in</div><div class="gmail_default">the vanilla one. It also brings the ability to handle categorical</div><div class="gmail_default">feature.</div><div class="gmail_default"><br></div><div class="gmail_default">So in this experiment, I just remove the unneeded feature and combined</div><div class="gmail_default">it with the stump. After inspecting the code, here is some conclusion:</div><div class="gmail_default"><br></div><div class="gmail_default">1. Fitness Function (Impurity Gain Function)</div><div class="gmail_default"><br></div><div class="gmail_default"> In Ryan's implementation, these calculation is presented by a</div><div class="gmail_default"> FitnessFunction policy, you just prepare a `counts`, a numChildren</div><div class="gmail_default"> X numClasses matrix holds the statistics for number of times each</div><div class="gmail_default"> class occurs in a child, for the gain calculation. But we still</div><div class="gmail_default"> need to add weighting support, sample reweighting it required by</div><div class="gmail_default"> AdaBoost, class weighting is commonly needed for skewed data. So in</div><div class="gmail_default"> this case, instead of let `counts` hold the occurrence, we hold</div><div class="gmail_default"> weighted occurrence. This change should work fine for classification</div><div class="gmail_default"> tasks, both info gain, info gain ratio and Gini impurity just need</div><div class="gmail_default"> the probability of the label occurs in each child to calculate.</div><div class="gmail_default"><br></div><div class="gmail_default"> However, for regression tree, I haven't figured it out on how to</div><div class="gmail_default"> use these statistics to calculate MSE, which needs a mean value for</div><div class="gmail_default"> each child. For now, I think a better one is instead of letting</div><div class="gmail_default"> FitnessFunction class compute the gain, we just compute</div><div class="gmail_default"> impurity of a (proposed) node, and combining them to get the</div><div class="gmail_default"> impurity gain should be done in class NumericSplit or CategoricalSplit.</div><div class="gmail_default"><br></div><div class="gmail_default">2. Numeric vs Categorical Feature</div><div class="gmail_default"><br></div><div class="gmail_default"> It is common to see categorical feature, e.g. `male` vs `female`,</div><div class="gmail_default"> in data, decision trees have some advantages on handling them with</div><div class="gmail_default"> respect to other algorithms.</div><div class="gmail_default"> </div><div class="gmail_default"> Ryan use two classes, HoeffdingNumericSplit and</div><div class="gmail_default"> HoeffdingCategoricalSplit to handle heterogeneous data, combined</div><div class="gmail_default"> with a proper FitnessFunction, it is really flexible. I think this</div><div class="gmail_default"> should be fine and I just mimic a NumericSplit in my experiment.</div><div class="gmail_default"><br></div><div class="gmail_default">3. Missing Values</div><div class="gmail_default"><br></div><div class="gmail_default"> Both the original DecisionStump and the Hoeffding Tree (not quite sure)</div><div class="gmail_default"> implementation in mlpack doesn't support handling missing values,</div><div class="gmail_default"> which is quite common in the wild. On training, it is easy to just</div><div class="gmail_default"> use the non-missing data on a particular dimension to train a</div><div class="gmail_default"> stump. </div><div class="gmail_default"><br></div><div class="gmail_default"> On predicting, two ways to solve it, one is using other non-missing</div><div class="gmail_default"> value to perform a split, i.e. we add some fallback</div><div class="gmail_default"> mechanism. Another one is handle them by probability, C4.5 is using</div><div class="gmail_default"> this it I think. I also prefer the later one due to its simplicity.</div><div class="gmail_default"><br></div><div class="gmail_default">Combining the above 3 feature and adding weighting support, it should be</div><div class="gmail_default">sufficient to build a generic decision stump or decision tree.</div><div class="gmail_default"><br></div><div class="gmail_default">Some try is stored here, just added weighting support and just handle</div><div class="gmail_default">numeric feature by NumericSplit class. Just a little progress. Further test</div><div class="gmail_default">and consideration is needed. No serious test is performed, just concept prove :)</div><div class="gmail_default"><a href="https://github.com/cloudhan/mlpack/tree/weak_learner/src/mlpack/methods/weak_learner">https://github.com/cloudhan/mlpack/tree/weak_learner/src/mlpack/methods/weak_learner</a><br></div><div class="gmail_default"><br></div><div class="gmail_default">Hope the above will shed some light on the one who is going to</div><div class="gmail_default">implement a decision tree.</div><div class="gmail_default"><br></div><div class="gmail_default">Good lucks.</div><div class="gmail_default">(๑•̀ㅂ•́)و✧</div></div></div></div></div>