patternMinor

MAP estimation (for stationary iid gaussian environment)

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

machine-learning mathematical-foundations cs stackoverflow

mapiidstationaryenvironmentforestimationgaussian

Problem

This is my first post, and have been self studying Haykin's Neural Networks and Learning Machines book. I'm not sure if this is a typo or if I'm doing something wrong, but I've been stuck on a statement on page 75.

The context was the process of finding a Maximum likelihood estimate for a parameter vector assuming a stationary, iid, and Gaussian environment.

Given:
d is a scalar, i indexes time from 1 to N, and $\mathbf{w},\mathbf{x}_i $ are m-dimensional vectors.

$\hat{w}_{MAP} = max_{\mathbf{w}}[-\frac{1}{2}\Sigma_{i=1}^{N}(d_i - \mathbf{w}^T\mathbf{x}_i)^2 - \frac{\lambda}{2}||\mathbf{w}||^2]$

we should get the same result if instead we find the minimum of the quadratic function

$E(\mathbf{x}) = \frac{1}{2}\Sigma_{i=1}^{N}(d_i - \mathbf{w}^T\mathbf{x}_i)^2 + \frac{\lambda}{2}||\mathbf{w}||^2$

Haykin says you should arrive at the result (by differentiation wrt $\mathbf{w}$ and setting the result to 0)

$\hat{w}_{MAP} = (R_{xx} + \lambda I)^{-1} r_{dx}$

where $R_{xx} = -\Sigma_{i=1}^N \Sigma_{j=1}^N \mathbf{x}_i \mathbf{x}_j^T $ is the M*M correlation matrix of x and $r_{dx} = -\Sigma_{j=1}^N \mathbf{x}_i d_i$

My question is why isn't $R_{xx} = -\Sigma_{i=1}^N \mathbf{x}_i \mathbf{x}_i^T?$
Haykin goes through about a paragraph saying how the correlation matrix is time averaged and I think this is an important point.
Thanks for any help!

Solution

$\newcommand{\bd}{\mathbf{d}}\newcommand{\bw}{\mathbf{w}}\newcommand{\bX}{\mathbf{X}}$Your observation is correct--it appears to be several typos. To see why let's re-write the likelihood equation to something a bit more suitable for differentiating using matrix/vector notation. $$\text{NLL}(w) = \frac{1}{2}||\bd - \bX\bw||^2_2 + \frac{\lambda}{2}||\bw||^2_2,$$

where $\bX$ is our matrix of observations, $\bd$ is our vector of labels and $\bw$ is our vector parameter of interest. Taking the derivative yields,
$$\frac{\partial \text{NLL}}{\partial \bw} = -\bX^t(\bd - \bX\bw) + \lambda\bw.$$

Setting this equal to 0 and solving gives,
$$ \begin{align*}\lambda \bw &= \bX^t(\bd - \bX\bw) \\
&= \bX^t\bd - \bX^t\bX\bw \iff\\
\lambda\bw + \bX^t\bX\bw &= \bX^t\bd \iff \\
(\lambda\mathbf{I} + \bX^t\bX)\bw &= \bX^t\bd \iff \\
\hat{\bw}_{\text{MAP}} &= (\lambda\mathbf{I} + \bX^t\bX)^{-1}(\bX^t\bd).
\end{align*}$$

Notice that $$\bX^t\bX = \sum_i \mathbf{x_i} \mathbf{x_i}^t = R_{xx}$$ and $$\bX^t\bd = \sum_i \mathbf{x_i} d_i = r_{dx}.$$

Context

StackExchange Computer Science Q#41153, answer score: 2

Revisions (0)

No revisions yet.