[mlpack] [GSoc2013] Interested in developing collaborative filtering package

Ryan Curtin gth671b at mail.gatech.edu
Mon Apr 22 14:28:43 EDT 2013


On Sat, Apr 20, 2013 at 02:31:01PM +0530, Nilesh Chakraborty wrote:
> Hi Ajinkya,
> 
> Yes, it's at matchfreak.com. The frontend is in a very early stage and some
> stuff are breaking on the site. I'm currently working on fixing it and
> improving the load time. I'm afraid that since it's not open source, I
> won't be able to share its exact code with you, but I can describe how it
> works.
> 
> In short, I fetch a user's status messages and Likes, and those of his/her
> friends and store it in a DB. The database has hundreds of thousands of
> data like this, from many users. Now, the Likes and status messages are
> scanned against a modified wikipedia data dump and a lot of article IDs are
> extracted. Those are stored in the DB as user-item combinations. These are
> fed to Myrrix, and call an API for finding similar items. We used to use
> Mahout before this, with calculating user similarity using log-likelihood
> metric. Myrrix fared better, in terms of speed and usability.
> 
> The best algorithm options to venture into first would be Alternating Least
> Squares, and maybe SVD. But ALS provides a lot of added perks that we
> normally wouldn't have from SVD, like handling incomplete user-item pairs
> etc. This paper is a good description of a parallel ALS algorithm -
> http://www.hpl.hp.com/personal/Robert_Schreiber/papers/2008%20AAIM%20Netflix/netflix_aaim08(submitted).pdf
> 
> No, I have not implemented CF algorithms from scratch. I have mostly used
> APIs and studyied research papers on different kinds of algorithm
> implementations.

Hello Nilesh,

I'm going to do my usual thing here and link a bunch of threads that are
worth reading (be sure to read the responses too):

https://mailman.cc.gatech.edu/pipermail/mlpack/2013-April/000034.html
https://mailman.cc.gatech.edu/pipermail/mlpack/2013-April/000049.html
https://mailman.cc.gatech.edu/pipermail/mlpack/2013-April/000051.html

Note that somewhere in there, the topic of QUIC-SVD comes up.  That's a
tree-based fast SVD that (a long time ago) used to be implemented in
mlpack but didn't have a maintainer and eventually got deleted (it also
didn't have any useful tests).  So it would be nice to re-implement
something like that.

> Suppose we want some kind of an implementation of SVD. Do you think it is a
> good idea to add a dependency on a mature library like
> Eigen<http://eigen.tuxfamily.org/index.php?title=Main_Page> (all
> templates, only compile-time dependency) or
> redsvd<https://code.google.com/p/redsvd/>,
> or is it better to implement it ourselves reusing already existing code
> from mlpack and armadillo if possible, and why?

Armadillo already has svd():

http://arma.sourceforge.net/docs.html#svd

That doesn't mean that QUIC-SVD wouldn't be an improvement.  Anyway,
keeping the number of mlpack dependencies minimal is a very important
goal.  Each new dependency means another API which we don't control that
is subject to (potentially haphazard) changes which means suddenly
mlpack doesn't compile or work right when they release new versions of
their libraries.  So, preferably any implementations of CF algorithms
will avoid introducing any new dependencies.

-- 
Ryan Curtin       | "Long live the new flesh."
ryan at igglybob.com |   - Max


More information about the mlpack mailing list