[mlpack-svn] [MLPACK] #356: RANN with ball trees fails to make theoretical guarantees

Fri Jul 25 11:57:29 EDT 2014

#356: RANN with ball trees fails to make theoretical guarantees
---------------------+------------------------------------------------------
 Reporter:  rcurtin  |        Owner:                           
     Type:  defect   |       Status:  new                      
 Priority:  minor    |    Milestone:  mlpack 1.1.0             
Component:  mlpack   |     Keywords:  ballbound, allkrann, rann
 Blocking:           |   Blocked By:  320                      
---------------------+------------------------------------------------------
 The recent solution of #320 introduced a new issue in r16855, in
 allkrann_search_test.cpp; the SingleBallTreeTest and DualBallTreeTest
 tests fail to make the theoretical guarantees given for the success
 probability of RANN.

 After some amount of digging, I think I may have isolated the problem for
 the single-tree case (but I am not certain).  I think the dual-tree
 problem is the same, though, so we can consider the simpler single-tree
 case.

 `Score(queryIndex, referenceNode)` has the following basic structure:

 {{{
   if ((SortPolicy::IsBetter(distance, bestDistance))
       && numSamplesMade[queryIndex] < numSamplesReqd)
   {
     // Do either real approximation or brute-force search.
   }
   else
   {
     // Do "fake" approximation.  We don't need to actually sample.
   }
 }}}

 This makes reasonable sense; if the node is too far from the point or the
 required number of samples has already been made, then the node can be
 pruned, and no more sampling is necessary.  Otherwise, we have to either
 sample the node, or recurse deeper for brute-force search.

 However, consider the case where the node is sufficiently close to the
 point that it can't be pruned (i.e. dist_to_node(q, R) < ub(q)) but
 numSamplesMade[queryIndex] >= numSamplesReqd.  In this situation we have
 already sampled enough points from the dataset.  But now suppose that R
 contains, as its descendants, all of the first many nearest neighbors of
 q; this means that by pruning R and not sampling it (which is what will
 happen in this case), the final rank-approximate nearest neighbor result
 for the query q will not be in the top many true nearest neighbors (we can
 call this a "failure").

 Basically, what appears to have happened here is that we have sampled the
 necessary number of points from the dataset, but they have not been
 sampled uniformly because we are pruning some nodes that could contain
 better points and not sampling them.

 I've sent an email to Pari about this to get his input on what the best
 way to solve the issue is, or if I have overlooked something simpler.

-- 
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/356>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.