[mlpack-svn] [MLPACK] #237: Use log-space for DTree::ComputeVariableImportance()

MLPACK Trac trac at coffeetalk-1.cc.gatech.edu
Wed Aug 1 20:58:56 EDT 2012

#237: Use log-space for DTree::ComputeVariableImportance()
 Reporter:  rcurtin  |        Owner:                                
     Type:  defect   |       Status:  new                           
 Priority:  trivial  |    Milestone:                                
Component:  mlpack   |     Keywords:  dtree, det, logspace, overflow
 Blocking:           |   Blocked By:                                
 It used to be that the density estimation tree would use long doubles,
 because a lot of the calculations depended on the volume of hypercubes,
 which in very high dimensions become very, very large.  The solution to
 this is to use the logarithm of the volume, but then we end up with
 situations where we need

 log(a + b)

 but we only have ```log(a)``` and ```log(b)```.  Fortunately algebraic
 tricks can be used for the actual tree construction algorithms by
 exploiting the definitions of ```a``` and ```b``` (or as they actually are
 in the code, the definition of the error function R(t)).  So the tree
 construction is done nearly entirely in log-space (with one exception; the
 volume of one node divided by a leaf ```(V_t / V_i)``` can't be done in
 log-space (or I could not figure out how), but in that case it is not a
 huge problem because the quantity V_t / V_i turns out to be dependent
 mostly on the depth of the tree, and will probably only overflow for very,
 very deep trees (and other things will probably fail before that).

 However there is still one function which needs to be addressed, which is

 // The way to do this entirely in log-space is (at this time) somewhat
 // unclear.  So this risks overflow.
 importances[curNode.SplitDim()] += (-std::exp(curNode.LogNegError()) -
     (-std::exp(curNode.Left()->LogNegError()) +

 There should be a way to exploit the definition of the variable importance
 as described in Pari's paper such that we can calculate the quantity (R(t)
 - R(t_l) - R(t_r)) in log-space.

Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/237>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.

More information about the mlpack-svn mailing list