[mlpack-svn] [MLPACK] #163: Decide on "basic type" of observation for MLPACK

MLPACK Trac trac at coffeetalk-1.cc.gatech.edu
Wed Nov 23 15:16:23 EST 2011

#163: Decide on "basic type" of observation for MLPACK
  Reporter:  rcurtin                  |        Owner:            
      Type:  wishlist                 |       Status:  new       
  Priority:  blocker                  |    Milestone:  MLPACK 1.0
 Component:  MLPACK                   |   Resolution:            
  Keywords:  mlpack observation type  |     Blocking:  132       
Blocked By:                           |  

Comment (by nslagle):

 >Yeah, but that incurs some code overhead and could make different machine
 learning models and methods incompatible. For instance, suppose we want to
 use HMMs on a discrete sequence of size_ts. Then, we want to run nearest-
 neighbors on that data sequence. But trees don't take size_t, they take
 arma::vec, so we have to convert all the data.

 I suppose for now we should leave everything with arma::mat.  The lack of
 generality is something we'll probably have to revisit in the future, but
 templates can help us avoid breaking existing code.

 >Alternately, we have to make trees support size_t also. And if we're
 going to do that, we need to make sure our load/save functions are okay
 with arbitrary types too. So basically, we add another layer of complexity
 to our code, meaning that every class now has a template parameter for the
 type of data it's taking.

 You raise some good points; if we ensure that each observation class
 contains a representation function (essentially __repr__ in Python), we
 can avoid save/load issues.  The trees we build are applicable likely only
 when the feature classes are ordered sets.  (An enumeration of colors
 isn't necessarily ordered, for example.)

 >In addition, a size_t can be represented by a double (minus corner cases)
 and we should be able to represent every other type as a double too.
 Therefore, everyone should be able to fit their problem inside of a
 double-only framework; maybe sometimes with a little fighting.

 It's possible that long double might cover the full range of size_t.

 >One particular manifestation of the difficulties of multi-type support is
 that for an arbitrary observation type, it makes sense to give this
 function signature to the HMM Train() function

 >but if Observation = arma::vec, now we are passing
 std::vector<arma::vec>. It would make so much more sense as just an
 arma::mat, because then we'd be able to use Armadillo's built-in
 covariance functions and whatnot; without that, we have to implement them
 ourselves for std::vector<arma::vec>. But on the other hand, we can't pass
 arma::Col<Observation> because Armadillo doesn't support types where it
 doesn't know how to do calculations on.

 >And on top of all that, I'd don't think it's a good idea to have some
 methods that take std::vector<Observation> and others that take arma::mat.
 It's time-consuming to convert between the two and makes the API

 >What do you think? I'm not sure I've done the best job of describing the
 problem that led me to this question (with HMM function signatures); maybe
 I should have done that in the original ticket description.

 I think that a general observation type is a (though not necessarily the)
 solution, though the abstraction admittedly further obfuscates the code.
 Like I said above, the lack of generality with arma::mat likely will be a
 problem later, though the existing ML techniques in the library seem to be
 okay with it for now.

 In my opinion, cramming all possible observation types into arma::mat is a
 design flaw, but we cannot address this adequately in the time remaining.
 (I regret that it didn't occur to me until now.)

 So, for now, we should stick to arma::mat and kludge techniques expecting
 sequences of size_t so that they accept doubles, as you suggested earlier.
 If we find later that several new methods require a more general
 observation type, we can pass a template of observation type, and simply
 model our new observation types' methods after Armadillo (n_col, n_row,
 etc.).  As you've pointed out previously, the compiler will demand
 conformity among templated types, so if a user forgets to define
 Observation.n_col, he'll receive fair warning.  Furthermore, if Armadillo
 doesn't provide support for matrices of arbitrary types, maybe we can add
 the support, and provide our own overridden methods.

Ticket URL: <https://trac.research.cc.gatech.edu/fastlab/ticket/163#comment:7>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.

More information about the mlpack-svn mailing list