principleMinor
BFGS vs L-BFGS -- how different are they really?
Viewed 0 times
reallyarebfgsdifferenthowthey
Problem
I am trying to implement an optimization procedure in Python using BFGS and L-BFGS in Python, and I am getting surprisingly different results in the two cases. L-BFGS converges to the proper minimum super fast, whereas BFGS converges very slowly, and that too to a nonsensical minimum.
QUESTION: From my readings, it seems to me that BFGS and L-BFGS are basically the algorithm (quasi-Newton methods), except that the latter uses less memory, and hence is faster. Is that true? Otherwise, if they are more different, then how so?
Ultimately, I want to figure out if the difference in performance is due to some differences in the actual algorithms, or due to their implementation in the python SciPy modules.
EDIT: I am adding some data to support my claims of divergent behaviour from the two algorithms.
QUESTION: From my readings, it seems to me that BFGS and L-BFGS are basically the algorithm (quasi-Newton methods), except that the latter uses less memory, and hence is faster. Is that true? Otherwise, if they are more different, then how so?
Ultimately, I want to figure out if the difference in performance is due to some differences in the actual algorithms, or due to their implementation in the python SciPy modules.
EDIT: I am adding some data to support my claims of divergent behaviour from the two algorithms.
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 147 M = 10
This problem is unconstrained.
At X0 0 variables are exactly at the bounds
At iterate 0 f= 2.56421D+04 |proj g|= 1.19078D+03
At iterate 1 f= 2.12904D+04 |proj g|= 1.04402D+03
At iterate 2 f= 1.49651D+03 |proj g|= 2.13394D+02
At iterate 3 f= 6.08288D+02 |proj g|= 9.85720D+01
At iterate 4 f= 2.91810D+02 |proj g|= 6.23062D+01
...
At iterate 142 f= 3.27609D+00 |proj g|= 8.80170D-04
Time taken for minimisation: 36.3749790192
*** BFGS code ***
At iterate 1, f= 21249.561722
At iterate 2, f= 15710.435098
At iterate 3, f= 15443.836262
At iterate 4, f= 15386.035398
At iterate 5, f= 15311.242917
At iterate 6, f= 15211.986938
At iterate 7, f= 15022.632266
...
At iterate 524, f= 67.898495
...
Warning: Desired error not necessarily achieved due to precision loss.
Iterations: 1239
Time taken: 340.728140116Solution
No, they're not the same. In some sense, L-BFGS is an approximation to BFGS, one which requires a lot less memory. BFGS and L-BFGS are explained in great detail in many standard resources.
Very crudely, you can think of the difference like this. BFGS computes and stores the full Hessian $H$ at each step; this requires $\Theta(n^2)$ space, where $n$ counts the number of variables (dimensions) that you're optimizing over. L-BFGS computes and stores an approximation to the Hessian, chosen so that the approximation can be stored in $\Theta(n)$ space. Effectively, L-BFGS uses the approximation $H \approx M^\top M$ for some $k \times n$ matrix $M$ (I think).
Each step of L-BFGS is an attempt at approximating/guessing what the corresponding step of BFGS would do. However, a single step of L-BFGS takes a lot less space and time than a single step of BFGS. Consequently, you can do many more steps of L-BFGS within a particular time bound than BFGS. Therefore, you might find that L-BFGS converges faster, because it can do so many more iterations within a given amount of time than BFGS can.
I don't know what a nonsensical minimum means, or why BFGS would converge to something worse than L-BFGS if both were allowed to run for an unbounded amount of time.
Very crudely, you can think of the difference like this. BFGS computes and stores the full Hessian $H$ at each step; this requires $\Theta(n^2)$ space, where $n$ counts the number of variables (dimensions) that you're optimizing over. L-BFGS computes and stores an approximation to the Hessian, chosen so that the approximation can be stored in $\Theta(n)$ space. Effectively, L-BFGS uses the approximation $H \approx M^\top M$ for some $k \times n$ matrix $M$ (I think).
Each step of L-BFGS is an attempt at approximating/guessing what the corresponding step of BFGS would do. However, a single step of L-BFGS takes a lot less space and time than a single step of BFGS. Consequently, you can do many more steps of L-BFGS within a particular time bound than BFGS. Therefore, you might find that L-BFGS converges faster, because it can do so many more iterations within a given amount of time than BFGS can.
I don't know what a nonsensical minimum means, or why BFGS would converge to something worse than L-BFGS if both were allowed to run for an unbounded amount of time.
Context
StackExchange Computer Science Q#70921, answer score: 4
Revisions (0)
No revisions yet.