[mlpack-svn] [MLPACK] #344: Using Welford method to calculate variance in Naive Bayes Classifier

Wed Apr 16 10:59:05 EDT 2014

#344: Using Welford method to calculate variance in Naive Bayes Classifier
-----------------------------------+----------------------------------------
  Reporter:  akvah                 |        Owner:  rcurtin     
      Type:  enhancement           |       Status:  closed      
  Priority:  minor                 |    Milestone:  mlpack 1.0.9
 Component:  mlpack                |   Resolution:  fixed       
  Keywords:  variance calculation  |     Blocking:              
Blocked By:                        |  
-----------------------------------+----------------------------------------

Comment (by akvah):

 Hi,

 I checked the new code you committed, and it does not completely fits with
 my suggestions. I guess I was not very clear on my explanations, so here
 is a more elaborated one:
 There are three (well known) methods to calculate the variance (the
 explanations of each on is available at:
 http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance):
 1. The Naive algorithm, which was originally implemented in the library.
 2. The two pass algorithm, which I suggested to be implemented in my
 previous post.
 3. The incremental algorithm, which I suggested in my initial post.

 So in my previous post, I argued that the naive approach is the least
 stable (but the fastest), the incremental approach is the most accurate
 (but the slowest), and the two pass method's accuracy is in the middle
 (although it is closer to the incremental approach in terms of accuracy,
 and closer to the Naive approach in terms of speed).

 The two pass method is the one implemented in the R. John Cook
 (http://www.johndcook.com/blog/2008/09/26/comparing-three-methods-of-
 computing-standard-deviation/) performed several experiments to compare
 the accuracy of each one and concluded that either the incremental or the
 two pass method should be the default method for variance calculation.

 In terms of the computational demand I performed some experiments to
 calculate the mean and variance of an array with 10000000 elements with
 each mentioned algorithm. The naive, two pass, and the incremental
 algorithms produced the result in 6, 7, 11 seconds respectively. So I
 don't think it would be a great sacrifice (in terms of speed) to set the
 default algorithm as the two pass algorithm (I attached a patch with my
 new suggestion).

 BTW, my name is Vahab Akbarzadeh (v(dot)akbarzadeh(at)gmail) and I would
 be glad to be listed as one of the contributors.

 Vahab

-- 
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/344#comment:5>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed at Georgia Tech.