[mlpack-svn] [MLPACK] #344: Using Welford method to calculate variance in Naive Bayes Classifier

Tue Apr 15 09:51:34 EDT 2014

#344: Using Welford method to calculate variance in Naive Bayes Classifier
-----------------------------------+----------------------------------------
  Reporter:  akvah                 |        Owner:  rcurtin 
      Type:  enhancement           |       Status:  accepted
  Priority:  minor                 |    Milestone:          
 Component:  mlpack                |   Resolution:          
  Keywords:  variance calculation  |     Blocking:          
Blocked By:                        |  
-----------------------------------+----------------------------------------

Comment (by akvah):

 Hi Ryan,

 Yes, the new code seems ok.
 With respect to running time you are right, as we are doing a division at
 each iteration it is taking much longer.
 So I looked around and found that the standard (two pass) approach is the
 one usually used. Although it uses two passes over the data, it basically
 performs the same operations that the squared method performs (only in two
 iterations), therefore its running time should be close to that of squared
 method.
 For example I looked at the source code of the R
 (src/library/stats/src/cov.c) and saw that they also use the standard
 algorithm.
 To detect when the algorithm is going to fail is not that easy. The point
 is that when the difference between the mean and variance is large, the
 standard methods will fail (the higher the difference the more error they
 accumulate). Although the standard method will fail much less than the
 squared method (http://www.johndcook.com/blog/2008/09/26/comparing-three-
 methods-of-computing-standard-deviation/).
 So I suggest that we can make the method to be a parameter of the
 function, but the default method would better be either the standard or
 the Welford method.

 Vahab

-- 
Ticket URL: <https://trac.research.cc.gatech.edu/fastlab/ticket/344#comment:2>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.