[mlpack-git] [mlpack] CF cannot properly handle 0s in the input matrix (#379)

Wed Jan 14 15:22:28 EST 2015

After further thinking, I agree with your comments about the modified `sp_mat`.  I had a feeling it was a half-baked idea but I hadn't figured out how, yet.

uvec,vec pair isn't necessarily a problem, and I don't see any serious performance issues by using that.  Sparse matrices are a compelling abstraction to use for collaborative filtering where one is missing entries, but it's not necessarily the best solution.  Ideally I'd like a unified API so we can present the user documentation that says "Okay, you're doing collaborative filtering and using the CF class.  Pass in your data like this." instead of "Well, depending on which factorizer you use, your data may come in differently...".

Right now we have these factorizers:

* RegularizedSVD -- takes a mat coordinate list (should probably be split into `uvec`/`vec`)
* AMF<> (this is a whole class of factorizers) -- may take `sp_mat` or `mat` depending; in either case, entries that are zero are assumed to be missing, except for the NMF update rules (I think)
* QUIC_SVD -- takes `mat`, does not consider missing values
* MatrixCompletion (needs some glue, but could work as one) -- takes a `uvec`/`vec` coordinate list

It would be possible but difficult to refactor the AMF rules to use `uvec`/`vec` (especially the NMF rules, which take advantage of linear algebra expressions).  NMF does appear to be used for collaborative filtering without any modification, as suggested by this paper: http://arxiv.org/pdf/1205.3193.pdf , so I don't think we can say "okay, we'll just drop support for NMF as a factorizer for CF".

NMF is a good case of where the `mat`/`sp_mat` duality is really nice; see `src/mlpack/methods/amf/update_rules/nmf_mult_dist.hpp` and the fact that `HUpdate()` and `WUpdate()` work the same with `mat` or `sp_mat`.  In some of the other AMF update rules, the `row_col_iterator` is used to give the same type of sparse/dense genericity, although those rules make the assumption that a zero value means a missing value.

In my mind what's necessary here is some kind of API consistency, which is really what I'm striving for in the end.  (Otherwise, as more factorizers and CF code gets added, it gets even uglier...)  I'd be happy with switching all factorizers to `uvec`/`vec` as input, but this doesn't make it clear what we should do with NMF, which has matrix expressions where missing values no longer make any sense.  That solution also doesn't consider the case where things like AMF, QUIC_SVD, and RegularizedSVD would be used for tasks where there are no missing entries, either with `mat` or `sp_mat` as input.

Okay, I hope this long rambling essay makes some amount of sense.  Good API design is hard because there are so many things to balance, but it's very helpful to have someone to bounce ideas off of in the process of convergence.  And if it doesn't make any sense, don't be afraid to say so. :)

---
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/issues/379#issuecomment-69984520
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20150114/db877940/attachment.html>