<div dir="ltr">Hello!<div>I'm sorry for not answering so long, I had a very busy end of the week.</div><div><br></div><div>I'd like to discuss the API of decision trees. Decision tree class should be parametrized by FitnessFunction template parameter (like Hoeffding trees). This class should have several constructors (with training and without as mentioned in design guidelines), Train methods (with weight support, so decision trees can be used as weak learners for adaboost), Classify methods (to predict class labels or probabilities). Constructor should have parameters listed below: maxDepth (decision stump would have maxDepth=1, besides this parameter can be used to prevent overfitting), minSamplesLeaf (minimum number of samples required to form a leaf node), maxFeatures (number of features to consider when looking for the best split). Maybe it would be a good idea to implement pruning.</div><div><br></div><div>I have several questions to discuss:</div><div>1) Should I treat ordinal and nominal variables different when splitting? (I suggest not to do so, because ordinal variable with n distinct values has (n-1) possible splits whereas nominal variable has about 2^(n-1) possible splits at the same time, so it would be faster to split the dimension if both types of variables are treated as ordinal).</div><div>2) I've noticed that decision stump code in mlpack does non-binary splits. It constructs several bins and then try to merge them, if bins have common most frequent class. I'd like to implement binary decision tree, so I'm not sure decision stump code can be refactored in this case.</div><div>3) I'd like to use decision trees for regression tasks too. The only difference between classification and regression trees is information, stored in leaf nodes (in classification task leaf node contains class probabilities, in regression - mean of samples in this node). I don't know how to deal with these two different cases better. Maybe I can implement a class parametrized by FitnessFunction that performs splitting (Splitter class). Then I can use this class in implementation of two different classes for classification and regression trees. Is it a good idea to prevent duplicating of code in such way?</div><div>4) Recently I've looked through scikit-learn code for decision trees. Splitting and impurity evaluation is separated in its code. Splitter uses Criterion class to count impurity reduction at different possible levels to find the best split. I like this idea a lot.</div><div>That's all for decision trees.</div><div><br></div><div>> As for ticket #356, extending<br>> the EMST code to compute single-linkage clustering should not be<br>> particularly hard, and in low dimensions the clustering should be fast<br>> because it is a dual-tree algorithm. If you did implement<br>> single-linkage clustering (or other clustering algorithms) with good<br>> tests, I'd be happy to merge them in.<br></div><div>I'd like to try to implement single-linkage clustering soon.</div><div><br></div><div>Regards, Natalia.</div></div>