[mlpack] [GSoC2013] Collaborative Filtering Package

Ryan Curtin gth671b at mail.gatech.edu
Mon Apr 22 14:07:34 EDT 2013


On Sat, Apr 20, 2013 at 01:44:42PM +0530, Sarthak Kukreti wrote:
> Hi all
> 
> I am a final year undergrad, pursuing my Bachelors in Engineering at NSIT,
> Delhi, majoring in Computer Engineering. My main area of research is social
> discovery in large scale graphs; I have worked on link prediction in social
> network graphs and I mostly use Python or C++ for implementations.
> 
> As a part of my undergraduate thesis on recommendation systems, I have been
> working on implementing matrix factorization models. I have already
> implemented the base model for matrix factorization using stochastic
> gradient descent in python as proposed in Yehuda Koren's paper [1]. It's
> quite slow, but it achieves an RMSE of 0.98 on the MovieLens dataset [5].
> Besides the approaches mentioned in [1], my final thesis involves
> implementing Probabilistic Matrix Factorization [2], Bayesian Probabilistic
> Tensor factorization [3] and on distributed stochastic gradient descent for
> matrix factorization [4] (almost implemented).
> 
> I am interested in developing the collaborative engine package for mlpack
> and I think quite a lot of my work on my thesis can be subsequently
> deployed as a part it. From my current vantage point, the collaborative
> engine package would have a group of such models, sample data for testing,
> and supporting functions for them like parameter selection, RMSE, plots for
> convergence rate, and comparing different models. I would like to discuss
> how you would ideally want me to proceed, and how you view the package as a
> whole.

Hello Sarthak,

There has been a lot of discussion about the collaborative filtering
project.  Take a look at these threads (be sure to take a look at the
responses too):

https://mailman.cc.gatech.edu/pipermail/mlpack/2013-April/000034.html
https://mailman.cc.gatech.edu/pipermail/mlpack/2013-April/000049.html
https://mailman.cc.gatech.edu/pipermail/mlpack/2013-April/000051.html

You can also take a look at the mlpack archives to search further.

https://mailman.cc.gatech.edu/pipermail/mlpack/2013-April/thread.html

Also, mlpack already has SGD implemented.  You can find it in
src/mlpack/core/optimizers/sgd/.  Distributed SGD would be easy -- add a
couple of lines for OpenMP to the existing SGD code.

> I am also attaching the baseline code. Although it's in Python, the final
> work I am planning will have a similar structure. I would lke your views on
> the structure and quality of code.

There are no comments in this code, horizontal whitespace is nonexistent
(a * b is much more readable than a*b), and there's no vertical
whitespace.  I haven't looked through the actual implementation
thoroughly because without comments it's basically write-only code and
can't realistically be maintained.  The mlpack style guidelines can be
found here:

http://www.mlpack.org/trac/wiki/NewStyleGuidelines

Take a look at the SGD implementation in src/mlpack/core/optimizers/sgd
and you'll see it's not restricted to a specific objective function.
That kind of modularity is a goal in mlpack code; now, SGD can be
plugged into numerous different problems (see src/mlpack/methods/nca/
for an example -- that's Neighborhood Components Analysis).

-- 
Ryan Curtin       | "If you understood everything I said, you'd be me."
ryan at igglybob.com |   - Miles Davis


More information about the mlpack mailing list