[mlpack-svn] [MLPACK] #344: Using Welford method to calculate variance in Naive Bayes Classifier
MLPACK Trac
trac at coffeetalk-1.cc.gatech.edu
Wed Apr 16 10:59:05 EDT 2014
#344: Using Welford method to calculate variance in Naive Bayes Classifier
-----------------------------------+----------------------------------------
Reporter: akvah | Owner: rcurtin
Type: enhancement | Status: closed
Priority: minor | Milestone: mlpack 1.0.9
Component: mlpack | Resolution: fixed
Keywords: variance calculation | Blocking:
Blocked By: |
-----------------------------------+----------------------------------------
Comment (by akvah):
Hi,
I checked the new code you committed, and it does not completely fits with
my suggestions. I guess I was not very clear on my explanations, so here
is a more elaborated one:
There are three (well known) methods to calculate the variance (the
explanations of each on is available at:
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance):
1. The Naive algorithm, which was originally implemented in the library.
2. The two pass algorithm, which I suggested to be implemented in my
previous post.
3. The incremental algorithm, which I suggested in my initial post.
So in my previous post, I argued that the naive approach is the least
stable (but the fastest), the incremental approach is the most accurate
(but the slowest), and the two pass method's accuracy is in the middle
(although it is closer to the incremental approach in terms of accuracy,
and closer to the Naive approach in terms of speed).
The two pass method is the one implemented in the R. John Cook
(http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-
computing-standard-deviation/) performed several experiments to compare
the accuracy of each one and concluded that either the incremental or the
two pass method should be the default method for variance calculation.
In terms of the computational demand I performed some experiments to
calculate the mean and variance of an array with 10000000 elements with
each mentioned algorithm. The naive, two pass, and the incremental
algorithms produced the result in 6, 7, 11 seconds respectively. So I
don't think it would be a great sacrifice (in terms of speed) to set the
default algorithm as the two pass algorithm (I attached a patch with my
new suggestion).
BTW, my name is Vahab Akbarzadeh (v(dot)akbarzadeh(at)gmail) and I would
be glad to be listed as one of the contributors.
Vahab
--
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/344#comment:5>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.
More information about the mlpack-svn
mailing list