HiveBrain v1.2.0
Get Started
← Back to all entries
patternMinor

Understanding the Broyden–Fletcher–Goldfarb–Shanno Algorithm to Select Weights for Neural Nets

Submitted by: @import:stackexchange-cs··
0
Viewed 0 times
understandingtheneuralgoldfarbweightsbroydenalgorithmfletchernetsfor

Problem

I am trying to train and implement a Neural Network. I was reading a few articles, learning about their principles and the math that goes behind them. However, while I was trying to understand the math, specifically the optimization part (for selecting weights), I ran into a problem: I couldn’t understand the Broyden–Fletcher–Goldfarb–Shanno algorithm.
I understand that it (basically) uses newton's method, expect that it finds the zeroes of the derivative of the original function (to find the local minima of the original function).

However I was having trouble following the notation and math that is written on Wikipedia about it:

First of, I do understand that Bk is a Hessian matrix that holds the partial second derivatives and f(x) is the original function that we are finding the local minima of. Moving on to step 1, I do not understand why the search direction needs to be solved. (Is it because we need to figure out whether the zero of the derivative is a local maximum or minimum of f(x) aka perform the first derivative test). In step 2, I do not understand why line search needs to be performed. In step 3, I do not understand what the product of ak and pk means. In step four, (correct me if I am wrong), yk equals the change in y or f(x). Finally, I know step 5 is trying to approximate Bk+1, which is what newton's method does, but I do not understand why its trying to estimate another Hessian matrix, shouldn't it be estimating a point? Or is it estimating a matrix because the algorithm is supposed to work with a multidimensional space, but then how does the partial second derivative matrix lead to finding the local minima of the original function? Also, what does the right side of the equation mean with all the vector/matrix multiplication and division. (I know this is a quasi-netown method but do not really understand it).

I tried to search the internet for my answers, but I couldn't really find anything. I am only in high school and don't have anyone who c

Solution

The purpose of the "search direction" is that we want to move closer to a minimum of $f(x)$. The gradient tells us the direction of steepest descent (as in gradient descent).

The reason we do the line search is to figure out how far to travel in that direction. The gradient tells us the direction; but it doesn't tell us how far to go in that direction.

The product $\alpha_k \mathbf{p}_k$ is an ordinary product of a scalar and a vector; i.e., component-wise product. You probably need to study a bit of linear algebra and multivariate calculus to get familiar with this sort of thing -- that's a vitally important foundation.

Step 5 is the heart of BFGS. A full Newton method would calculate the Hessian (matrix of second partial derivatives) at each point; i.e., it would re-calculate the Hessian in step 5. BFGS doesn't re-calculate it from scratch, as re-calculating it is expensive. Instead, it tries to make an estimate/guess of how the Hessian has changed. That's what step 5 is doing. This estimate is less accurate than simply recomputing the full Hessian matrix, but it's faster.

You're not going to be able to understand what's going on in BFGS from Wikipedia. Wikipedia's description is not very good, and there is no explanation of where the method came from. Look for a better, more detailed explanation that explains where these steps came from, the intuition of what's going on, and how these particular formulas were derived. Wikipedia is a great resource, but it's not a very good source for learning BFGS.

Context

StackExchange Computer Science Q#60736, answer score: 3

Revisions (0)

No revisions yet.