[mlpack-svn] [MLPACK] #212: Cluster a data set following estimating a GMM

Mon Mar 26 15:36:33 EDT 2012

#212: Cluster a data set following estimating a GMM
----------------------------+-----------------------------------------------
  Reporter:  Adam           |        Owner:     
      Type:  task           |       Status:  new
  Priority:  major          |    Milestone:     
 Component:  mlpack         |   Resolution:     
  Keywords:  GMM, Cluster,  |     Blocking:     
Blocked By:                 |  
----------------------------+-----------------------------------------------

Comment (by rcurtin):

 I haven't responded to this one yet because I'm not sure quite how to
 handle it.  It seems like I could add a helpful `double
 GMM::Probability(const arma::vec& observation, const size_t component)`
 which would return the probability of a point being from one Gaussian
 component in the mixture.  However, I haven't yet seen a way to do what
 you're saying in a fast way.  Given that function you could do something
 resembling

 {{{
 arma::Col<size_t> assignments(dataset.n_cols);
 for (size_t i = 0; i < dataset.n_cols; ++i)
 {
   // Find maximum probability component.
   double probability = 0;
   for (size_t j = 0; j < gmm.Gaussian(); ++j)
   {
     double newProb = GMM.Probability(dataset.col(i), j);
     if (newProb > probability)
     {
       probability = newProb;
       assignments[i] = j;
     }
   }
 }
 }}}

 But even this is O(n * m) where n is the number of points and m is the
 number of mixture components.  I'm not sure I could really do any better
 than that though, unless we brought trees into the calculation and I don't
 think I have the time to devote to that.  It could probably produce good
 performance though -- using the pruning step of the dual-tree algorithm to
 discard unlikely mixtures.  The runtime of that would be more along the
 lines of O(m log n) or O(n log m) or something like that (I haven't
 thought too hard about it).

 Perhaps for now a good, simple solution would be to provide that utility
 `GMM::Probability()` function I mentioned earlier, along with a function
 like `GMM::ComponentAssignments()` which would basically be an
 implementation of the above function (or the kd-tree version if you're
 feeling ambitious)?  Let me know what you think.  I can't think of a
 better name than `ComponentAssignments()` for now but I'm sure if I sleep
 on it I may find a more descriptive and meaningful name.

-- 
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/212#comment:1>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.