[mlpack-git] [mlpack] evaluate various changes to L-BFGS optimizer (#370)

Wed Jan 14 18:46:46 EST 2015

It's actually a bit subtle, b/c this diff tries to fix *two* problems. Let me try to explain.

The optimization proceeds as follows for lovasz theta. The augmented lagrangian method starts at small values of `sigma` (penalty). L-BFGS guides the current iterate towards the dreaded saddle point of X = 0 with these small values (say `sigma < 100`). 

Here's where the current mlpack (pre-merged) implementation differs from the scipy one. In the mlpack one, since X is approximately 0, grad F is also approximately 0 for the starting point of L-BFGS, *even as we increase sigma*. However, the current implementation was written so that if grad F is approx zero, it won't even run a single iteration of L-BFGS, it just exits immediately. Hence, we now check if `itNum > 0`, before allowing termination. Why? Because after stepping one iteration, with an increased `sigma` value, I found that it was able to get out of this saddle point (empirically). 

Now, I also added a relative function value termination criterion to address your other concerns that the optimization was indeed slow. Indeed, it turns out as `sigma` gets larger and larger, it becomes harder for the gradient norm to get really small (since we're dealing with increased numerical instability with really large penalties). As you rightfully observed, the function value itself doesn't change that much. Hence, this allows us to shortcircuit this issue: the optimization criteria is an OR, not an AND, so it only requires *one* of the conditions to be true.  

`factr` is terminology borrowed from scipy: http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.fmin_l_bfgs_b.html, which is also terminology borrowed from the fortran L-BFGS-B code.

To address your observation that LRSDP does not converge generically and needs a lot of hand-holding. YES! I mean even if you read the original paper, their experimental section is full of text saying they had to do all these hacks with initialization points and penalty schedules to get the SDPs to converge. In fact, if you email sam burer asking about these heuristics, he will tell you he can't even quite piece them together unless he spends quite a bit of time looking at his code. 

I think this is just an unfortunate property of the algorithm, and we don't know how to make it generically robust (if we did, that would be paper worthy itself). Hence, I think the best we can do is to offer a few choices of solvers (hence why I'm working slowly on this interior point solver, which will help for small/medium sized problem instances).

Another thing we might want to think about is this: LRSDP doesn't provide any dual certificates, so we can't say anything about optimality. One cool thing we could do towards this, I think, is actually run (in parallel) a dual ascent solver, so at least you could certify that the duality gap is below a specific tolerance. Then, I could sort of grid search LRSDP from different starting points w/ different penalty parameters, until my duality gap is small enough. 

---
Reply to this email directly or view it on GitHub:
https://github.com/mlpack/mlpack/issues/370#issuecomment-70015812
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.cc.gatech.edu/pipermail/mlpack-git/attachments/20150114/31a416b7/attachment.html>