[mlpack-svn] [MLPACK] #356: RANN with ball trees fails to make theoretical guarantees
MLPACK Trac
trac at coffeetalk-1.cc.gatech.edu
Fri Jul 25 11:57:29 EDT 2014
#356: RANN with ball trees fails to make theoretical guarantees
---------------------+------------------------------------------------------
Reporter: rcurtin | Owner:
Type: defect | Status: new
Priority: minor | Milestone: mlpack 1.1.0
Component: mlpack | Keywords: ballbound, allkrann, rann
Blocking: | Blocked By: 320
---------------------+------------------------------------------------------
The recent solution of #320 introduced a new issue in r16855, in
allkrann_search_test.cpp; the SingleBallTreeTest and DualBallTreeTest
tests fail to make the theoretical guarantees given for the success
probability of RANN.
After some amount of digging, I think I may have isolated the problem for
the single-tree case (but I am not certain). I think the dual-tree
problem is the same, though, so we can consider the simpler single-tree
case.
`Score(queryIndex, referenceNode)` has the following basic structure:
{{{
if ((SortPolicy::IsBetter(distance, bestDistance))
&& numSamplesMade[queryIndex] < numSamplesReqd)
{
// Do either real approximation or brute-force search.
}
else
{
// Do "fake" approximation. We don't need to actually sample.
}
}}}
This makes reasonable sense; if the node is too far from the point or the
required number of samples has already been made, then the node can be
pruned, and no more sampling is necessary. Otherwise, we have to either
sample the node, or recurse deeper for brute-force search.
However, consider the case where the node is sufficiently close to the
point that it can't be pruned (i.e. dist_to_node(q, R) < ub(q)) but
numSamplesMade[queryIndex] >= numSamplesReqd. In this situation we have
already sampled enough points from the dataset. But now suppose that R
contains, as its descendants, all of the first many nearest neighbors of
q; this means that by pruning R and not sampling it (which is what will
happen in this case), the final rank-approximate nearest neighbor result
for the query q will not be in the top many true nearest neighbors (we can
call this a "failure").
Basically, what appears to have happened here is that we have sampled the
necessary number of points from the dataset, but they have not been
sampled uniformly because we are pruning some nodes that could contain
better points and not sampling them.
I've sent an email to Pari about this to get his input on what the best
way to solve the issue is, or if I have overlooked something simpler.
--
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/356>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.
More information about the mlpack-svn
mailing list