principleMinor

BFGS vs L-BFGS -- how different are they really?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

reallyarebfgsdifferenthowthey

Problem

I am trying to implement an optimization procedure in Python using BFGS and L-BFGS in Python, and I am getting surprisingly different results in the two cases. L-BFGS converges to the proper minimum super fast, whereas BFGS converges very slowly, and that too to a nonsensical minimum.

QUESTION: From my readings, it seems to me that BFGS and L-BFGS are basically the algorithm (quasi-Newton methods), except that the latter uses less memory, and hence is faster. Is that true? Otherwise, if they are more different, then how so?

Ultimately, I want to figure out if the difference in performance is due to some differences in the actual algorithms, or due to their implementation in the python SciPy modules.

EDIT: I am adding some data to support my claims of divergent behaviour from the two algorithms.

RUNNING THE L-BFGS-B CODE

       * * *

Machine precision = 2.220D-16
N =          147     M =           10
This problem is unconstrained.

At X0         0 variables are exactly at the bounds
At iterate    0    f=  2.56421D+04    |proj g|=  1.19078D+03
At iterate    1    f=  2.12904D+04    |proj g|=  1.04402D+03
At iterate    2    f=  1.49651D+03    |proj g|=  2.13394D+02
At iterate    3    f=  6.08288D+02    |proj g|=  9.85720D+01
At iterate    4    f=  2.91810D+02    |proj g|=  6.23062D+01
...
At iterate  142    f=  3.27609D+00    |proj g|=  8.80170D-04
Time taken for minimisation: 36.3749790192

*** BFGS code ***

At iterate    1,  f= 21249.561722 
At iterate    2,  f= 15710.435098 
At iterate    3,  f= 15443.836262 
At iterate    4,  f= 15386.035398 
At iterate    5,  f= 15311.242917 
At iterate    6,  f= 15211.986938 
At iterate    7,  f= 15022.632266
...
At iterate  524,  f= 67.898495
...
Warning: Desired error not necessarily achieved due to precision loss.
Iterations: 1239
Time taken: 340.728140116

Solution

No, they're not the same. In some sense, L-BFGS is an approximation to BFGS, one which requires a lot less memory. BFGS and L-BFGS are explained in great detail in many standard resources.

Very crudely, you can think of the difference like this. BFGS computes and stores the full Hessian $H$ at each step; this requires $\Theta(n^2)$ space, where $n$ counts the number of variables (dimensions) that you're optimizing over. L-BFGS computes and stores an approximation to the Hessian, chosen so that the approximation can be stored in $\Theta(n)$ space. Effectively, L-BFGS uses the approximation $H \approx M^\top M$ for some $k \times n$ matrix $M$ (I think).

Each step of L-BFGS is an attempt at approximating/guessing what the corresponding step of BFGS would do. However, a single step of L-BFGS takes a lot less space and time than a single step of BFGS. Consequently, you can do many more steps of L-BFGS within a particular time bound than BFGS. Therefore, you might find that L-BFGS converges faster, because it can do so many more iterations within a given amount of time than BFGS can.

I don't know what a nonsensical minimum means, or why BFGS would converge to something worse than L-BFGS if both were allowed to run for an unbounded amount of time.

Context

StackExchange Computer Science Q#70921, answer score: 4

Revisions (0)

No revisions yet.