# [mlpack-svn] [MLPACK] #237: Use log-space for DTree::ComputeVariableImportance()

MLPACK Trac trac at coffeetalk-1.cc.gatech.edu
Wed Aug 1 20:58:56 EDT 2012

```#237: Use log-space for DTree::ComputeVariableImportance()
---------------------+------------------------------------------------------
Reporter:  rcurtin  |        Owner:
Type:  defect   |       Status:  new
Priority:  trivial  |    Milestone:
Component:  mlpack   |     Keywords:  dtree, det, logspace, overflow
Blocking:           |   Blocked By:
---------------------+------------------------------------------------------
It used to be that the density estimation tree would use long doubles,
because a lot of the calculations depended on the volume of hypercubes,
which in very high dimensions become very, very large.  The solution to
this is to use the logarithm of the volume, but then we end up with
situations where we need

{{{
log(a + b)
}}}

but we only have ```log(a)``` and ```log(b)```.  Fortunately algebraic
tricks can be used for the actual tree construction algorithms by
exploiting the definitions of ```a``` and ```b``` (or as they actually are
in the code, the definition of the error function R(t)).  So the tree
construction is done nearly entirely in log-space (with one exception; the
volume of one node divided by a leaf ```(V_t / V_i)``` can't be done in
log-space (or I could not figure out how), but in that case it is not a
huge problem because the quantity V_t / V_i turns out to be dependent
mostly on the depth of the tree, and will probably only overflow for very,
very deep trees (and other things will probably fail before that).

However there is still one function which needs to be addressed, which is
```DTree::ComputeVariableImportance()```.

{{{
// The way to do this entirely in log-space is (at this time) somewhat
// unclear.  So this risks overflow.
importances[curNode.SplitDim()] += (-std::exp(curNode.LogNegError()) -
(-std::exp(curNode.Left()->LogNegError()) +
-std::exp(curNode.Right()->LogNegError())));
}}}

There should be a way to exploit the definition of the variable importance
as described in Pari's paper such that we can calculate the quantity (R(t)
- R(t_l) - R(t_r)) in log-space.

--
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/237>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.
```