[mlpack-svn] [MLPACK] #163: Decide on "basic type" of observation for MLPACK
MLPACK Trac
trac at coffeetalk-1.cc.gatech.edu
Wed Nov 23 15:16:23 EST 2011
#163: Decide on "basic type" of observation for MLPACK
--------------------------------------+-------------------------------------
Reporter: rcurtin | Owner:
Type: wishlist | Status: new
Priority: blocker | Milestone: MLPACK 1.0
Component: MLPACK | Resolution:
Keywords: mlpack observation type | Blocking: 132
Blocked By: |
--------------------------------------+-------------------------------------
Comment (by nslagle):
>Yeah, but that incurs some code overhead and could make different machine
learning models and methods incompatible. For instance, suppose we want to
use HMMs on a discrete sequence of size_ts. Then, we want to run nearest-
neighbors on that data sequence. But trees don't take size_t, they take
arma::vec, so we have to convert all the data.
I suppose for now we should leave everything with arma::mat. The lack of
generality is something we'll probably have to revisit in the future, but
templates can help us avoid breaking existing code.
>Alternately, we have to make trees support size_t also. And if we're
going to do that, we need to make sure our load/save functions are okay
with arbitrary types too. So basically, we add another layer of complexity
to our code, meaning that every class now has a template parameter for the
type of data it's taking.
You raise some good points; if we ensure that each observation class
contains a representation function (essentially __repr__ in Python), we
can avoid save/load issues. The trees we build are applicable likely only
when the feature classes are ordered sets. (An enumeration of colors
isn't necessarily ordered, for example.)
>In addition, a size_t can be represented by a double (minus corner cases)
and we should be able to represent every other type as a double too.
Therefore, everyone should be able to fit their problem inside of a
double-only framework; maybe sometimes with a little fighting.
It's possible that long double might cover the full range of size_t.
>One particular manifestation of the difficulties of multi-type support is
that for an arbitrary observation type, it makes sense to give this
function signature to the HMM Train() function
>but if Observation = arma::vec, now we are passing
std::vector<arma::vec>. It would make so much more sense as just an
arma::mat, because then we'd be able to use Armadillo's built-in
covariance functions and whatnot; without that, we have to implement them
ourselves for std::vector<arma::vec>. But on the other hand, we can't pass
arma::Col<Observation> because Armadillo doesn't support types where it
doesn't know how to do calculations on.
>And on top of all that, I'd don't think it's a good idea to have some
methods that take std::vector<Observation> and others that take arma::mat.
It's time-consuming to convert between the two and makes the API
inconsistent.
>What do you think? I'm not sure I've done the best job of describing the
problem that led me to this question (with HMM function signatures); maybe
I should have done that in the original ticket description.
I think that a general observation type is a (though not necessarily the)
solution, though the abstraction admittedly further obfuscates the code.
Like I said above, the lack of generality with arma::mat likely will be a
problem later, though the existing ML techniques in the library seem to be
okay with it for now.
In my opinion, cramming all possible observation types into arma::mat is a
design flaw, but we cannot address this adequately in the time remaining.
(I regret that it didn't occur to me until now.)
So, for now, we should stick to arma::mat and kludge techniques expecting
sequences of size_t so that they accept doubles, as you suggested earlier.
If we find later that several new methods require a more general
observation type, we can pass a template of observation type, and simply
model our new observation types' methods after Armadillo (n_col, n_row,
etc.). As you've pointed out previously, the compiler will demand
conformity among templated types, so if a user forgets to define
Observation.n_col, he'll receive fair warning. Furthermore, if Armadillo
doesn't provide support for matrices of arbitrary types, maybe we can add
the support, and provide our own overridden methods.
--
Ticket URL: <https://trac.research.cc.gatech.edu/fastlab/ticket/163#comment:7>
MLPACK <www.fast-lab.org>
MLPACK is an intuitive, fast, and scalable C++ machine learning library developed by the FASTLAB at Georgia Tech under Dr. Alex Gray.
More information about the mlpack-svn
mailing list