[mlpack-svn] [MLPACK] #163: Decide on "basic type" of observation for MLPACK

Wed Nov 23 14:34:38 EST 2011

#163: Decide on "basic type" of observation for MLPACK
--------------------------------------+-------------------------------------
  Reporter:  rcurtin                  |        Owner:            
      Type:  wishlist                 |       Status:  new       
  Priority:  blocker                  |    Milestone:  MLPACK 1.0
 Component:  MLPACK                   |   Resolution:            
  Keywords:  mlpack observation type  |     Blocking:  132       
Blocked By:                           |  
--------------------------------------+-------------------------------------

Comment (by rcurtin):

 > I suppose we might send arma::Col<struct myMLObservationType> into the
 technique, provided Armadillo can handle this.

 Nope, Armadillo won't handle anything other than basic numeric types.

 > That being the case, let the machine learning technique decide whether
 it prefers arma::vec or arma::Col<size_t>. Utility calls might require an
 additional layer of templating (pass arma::Col<T> instead of arma::vec).

 Yeah, but that incurs some code overhead and could make different machine
 learning models and methods incompatible.  For instance, suppose we want
 to use HMMs on a discrete sequence of `size_t`s.  Then, we want to run
 nearest-neighbors on that data sequence.  But trees don't take `size_t`,
 they take `arma::vec`, so we have to convert all the data.

 Alternately, we have to make trees support `size_t` also.  And if we're
 going to do that, we need to make sure our load/save functions are okay
 with arbitrary types too.  So basically, we add another layer of
 complexity to our code, meaning that every class now has a template
 parameter for the type of data it's taking.

 I don't think it's worth it to add another template parameter to
 everything and increase the complexity of the code by a huge amount, just
 so we can support the few users who want to run on a weird type of data.

 In addition, a size_t can be represented by a double (minus corner cases)
 and we should be able to represent every other type as a double too.
 Therefore, everyone should be able to fit their problem inside of a
 double-only framework; maybe sometimes with a little fighting.

 One particular manifestation of the difficulties of multi-type support is
 that for an arbitrary observation type, it makes sense to give this
 function signature to the HMM `Train()` function:

 {{{
 void Train(const std::vector<Observation>& observation, const
 std::vector<size_t>& states);
 }}}

 but if `Observation = arma::vec`, now we are passing
 `std::vector<arma::vec>`.  It would make so much more sense as just an
 `arma::mat`, because then we'd be able to use Armadillo's built-in
 covariance functions and whatnot; without that, we have to implement them
 ourselves for `std::vector<arma::vec>`.  But on the other hand, we can't
 pass `arma::Col<Observation>` because Armadillo doesn't support types
 where it doesn't know how to do calculations on.

 And on top of all that, I'd don't think it's a good idea to have some
 methods that take `std::vector<Observation>` and others that take
 `arma::mat`.  It's time-consuming to convert between the two and makes the
 API inconsistent.

 What do you think?  I'm not sure I've done the best job of describing the
 problem that led me to this question (with HMM function signatures); maybe
 I should have done that in the original ticket description.

-- 
Ticket URL: <http://trac.research.cc.gatech.edu/fastlab/ticket/163#comment:6>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.