[mlpack-svn] [MLPACK] #212: Cluster a data set following estimating a GMM
MLPACK Trac
trac at coffeetalk-1.cc.gatech.edu
Mon Mar 26 15:36:33 EDT 2012
#212: Cluster a data set following estimating a GMM
----------------------------+-----------------------------------------------
Reporter: Adam | Owner:
Type: task | Status: new
Priority: major | Milestone:
Component: mlpack | Resolution:
Keywords: GMM, Cluster, | Blocking:
Blocked By: |
----------------------------+-----------------------------------------------
Comment (by rcurtin):
I haven't responded to this one yet because I'm not sure quite how to
handle it. It seems like I could add a helpful `double
GMM::Probability(const arma::vec& observation, const size_t component)`
which would return the probability of a point being from one Gaussian
component in the mixture. However, I haven't yet seen a way to do what
you're saying in a fast way. Given that function you could do something
resembling
{{{
arma::Col<size_t> assignments(dataset.n_cols);
for (size_t i = 0; i < dataset.n_cols; ++i)
{
// Find maximum probability component.
double probability = 0;
for (size_t j = 0; j < gmm.Gaussian(); ++j)
{
double newProb = GMM.Probability(dataset.col(i), j);
if (newProb > probability)
{
probability = newProb;
assignments[i] = j;
}
}
}
}}}
But even this is O(n * m) where n is the number of points and m is the
number of mixture components. I'm not sure I could really do any better
than that though, unless we brought trees into the calculation and I don't
think I have the time to devote to that. It could probably produce good
performance though -- using the pruning step of the dual-tree algorithm to
discard unlikely mixtures. The runtime of that would be more along the
lines of O(m log n) or O(n log m) or something like that (I haven't
thought too hard about it).
Perhaps for now a good, simple solution would be to provide that utility
`GMM::Probability()` function I mentioned earlier, along with a function
like `GMM::ComponentAssignments()` which would basically be an
implementation of the above function (or the kd-tree version if you're
feeling ambitious)? Let me know what you think. I can't think of a
better name than `ComponentAssignments()` for now but I'm sure if I sleep
on it I may find a more descriptive and meaningful name.
--
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/212#comment:1>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.
More information about the mlpack-svn
mailing list