Definitions

# Least mean squares filter

Least mean squares (LMS) algorithms are used in adaptive filters to find the filter coefficients that relate to producing the least mean squares of the error signal (difference between the desired and the actual signal). It is a stochastic gradient descent method in that the filter is only adapted based on the error at the current time. It was invented in 1960 by Stanford University professor Bernard Widrow and his first Ph.D. student, Ted Hoff.

## Problem Formulation

Most linear adaptive filtering problems can be formulated using the block diagram above. That is, an unknown system $mathbf\left\{h\right\}\left(n\right)$ is to be identified and the adaptive filter attempts to adapt the filter $hat\left\{mathbf\left\{h\right\}\right\}\left(n\right)$ to make it as close as possible to $mathbf\left\{h\right\}\left(n\right)$, while using only observable signals $x\left(n\right)$, $d\left(n\right)$ and $e\left(n\right)$; but $y\left(n\right)$, $v\left(n\right)$ and $h\left(n\right)$ are not directly observable. Its solution is closely related to the Wiener filter.

## Idea

The idea behind LMS filters is to use the method of steepest descent to find a coefficient vector $mathbf\left\{h\right\}\left(n\right)$ which minimizes a cost function. We start the discussion by defining the cost function as

$C\left(n\right) = Eleft$ where $e\left(n\right)$ is defined in the block diagram section of the general adaptive filter and $E\left\{.\right\}$ denotes the expected value. Applying the steepest descent method means to take the partial derivatives with respect to the individual entries of the filter coefficient vector

nabla C(n) = nabla Eleft{e(n) , e^{*}(n)right}=2Eleft{nabla (e(n) , e^{*}(n) )right} where $nabla$ is the gradient operator. With $mathbf\left\{x\right\}\left(n\right) = left\left[x\left(n\right), x\left(n-1\right), dots, x\left(n-p+1\right)right\right]^T$ and $nabla e\left(n\right)= -mathbf\left\{x\right\}\left(n\right)$ it follows

nabla C(n) = -2Eleft{mathbf{x}(n) , e^{*}(n)right} Now, $nabla C\left(n\right)$ is a vector which points towards the steepest ascent of the cost function. To find the minimum of the cost function we need to take a step in the opposite direction of $nabla C\left(n\right)$. To express that in mathematical terms
$hat\left\{mathbf\left\{h\right\}\right\}\left(n+1\right)=hat\left\{mathbf\left\{h\right\}\right\}\left(n\right)-frac\left\{mu\right\}\left\{2\right\} nabla C\left(n\right)=hat\left\{mathbf\left\{h\right\}\right\}\left(n\right)+mu , Eleft\left\{mathbf\left\{x\right\}\left(n\right) , e^\left\{*\right\}\left(n\right)right\right\}$
where $frac\left\{mu\right\}\left\{2\right\}$ is the step size. That means we have found a sequential update algorithm which minimizes the cost function. Unfortunately, this algorithm is not realizable until we know $Eleft\left\{mathbf\left\{x\right\}\left(n\right) , e^\left\{*\right\}\left(n\right)right\right\}$.

## Simplifications

For most systems the expectation function $\left\{E\right\}left\left\{mathbf\left\{x\right\}\left(n\right) , e^\left\{*\right\}\left(n\right)right\right\}$ must be approximated. This can be done with the following unbiased estimator

hat{E}left{mathbf{x}(n) , e^{*}(n)right}=frac{1}{N}sum_{i=0}^{N-1}mathbf{x}(n-i) , e^{*}(n-i) where $N$ indicates the number of samples we use for that estimate. The simplest case is $N=1$

hat{E}left{mathbf{x}(n) , e^{*}(n)right}=mathbf{x}(n) , e^{*}(n) For that simple case the update algorithm follows as
$hat\left\{mathbf\left\{h\right\}\right\}\left(n+1\right)=hat\left\{mathbf\left\{h\right\}\right\}\left(n\right)+mu mathbf\left\{x\right\}\left(n\right) , e^\left\{*\right\}\left(n\right)$
Indeed this constitutes the update algorithm for the LMS filter.

## LMS algorithm summary

The LMS algorithm for a $p$th order algorithm can be summarized as

 Parameters: $p=$ filter order $mu=$ step size Initialisation: $hat\left\{mathbf\left\{h\right\}\right\}\left(0\right)=0$ Computation: For $n=0,1,2,...$ $mathbf\left\{x\right\}\left(n\right) = left\left[x\left(n\right), x\left(n-1\right), dots, x\left(n-p+1\right)right\right]^T$ $e\left(n\right) = d\left(n\right)-hat\left\{mathbf\left\{h\right\}\right\}^\left\{H\right\}\left(n\right)mathbf\left\{x\right\}\left(n\right)$ $hat\left\{mathbf\left\{h\right\}\right\}\left(n+1\right) = hat\left\{mathbf\left\{h\right\}\right\}\left(n\right)+mu,e^\left\{*\right\}\left(n\right)mathbf\left\{x\right\}\left(n\right)$

where $hat\left\{mathbf\left\{h\right\}\right\}^\left\{H\right\}\left(n\right)$ denotes the Hermitian transpose of $hat\left\{mathbf\left\{h\right\}\right\}\left(n\right)$.

## Normalised least mean squares filter (NLMS)

The main drawback of the "pure" LMS algorithm is that it is sensitive to the scaling of its input $x\left(n\right)$. This makes it very hard (if not impossible) to choose a learning rate $mu$ that guarantees stability of the algorithm. The Normalised least mean squares filter (NLMS) is a variant of the LMS algorithm that solves this problem by normalising with the power of the input. The NLMS algorithm can be summarised as:

 Parameters: $p=$ filter order $mu=$ step size Initialization: $hat\left\{mathbf\left\{h\right\}\right\}\left(0\right)=0$ Computation: For $n=0,1,2,...$ $mathbf\left\{x\right\}\left(n\right) = left\left[x\left(n\right), x\left(n-1\right), dots, x\left(n-p\right)right\right]^T$ $e\left(n\right) = d\left(n\right)-hat\left\{mathbf\left\{h\right\}\right\}^\left\{H\right\}\left(n\right)mathbf\left\{x\right\}\left(n\right)$ $hat\left\{mathbf\left\{h\right\}\right\}\left(n+1\right) = hat\left\{mathbf\left\{h\right\}\right\}\left(n\right)+frac\left\{mu,e^\left\{*\right\}\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\}$

### Optimal learning rate

It can be shown that if there is no interference ($v\left(n\right)=0$), then the optimal learning rate for the NLMS algorithm is

$mu_\left\{opt\right\}=1$
and is independent of the input $x\left(n\right)$ and the real (unknown) impulse response $mathbf\left\{h\right\}\left(n\right)$. In the general case with interference ($v\left(n\right) ne 0$), the optimal learning rate is

mu_{opt}=frac{Eleft[left|y(n)-hat{y}(n)right|^2right]}{Eleft[|e(n)|^2right]}

The results above assume that the signals $v\left(n\right)$ and $x\left(n\right)$ are uncorrelated to each other, which is generally the case in practice.

### Proof

Let the filter misalignment be defined as $Lambda\left(n\right) = left| mathbf\left\{h\right\}\left(n\right) - hat\left\{mathbf\left\{h\right\}\right\}\left(n\right) right|^2$, we can derive the expected misalignment for the next sample as:

$Eleft\left[Lambda\left(n+1\right) right\right] = Eleft\left[left| hat\left\{mathbf\left\{h\right\}\right\}\left(n\right) + frac\left\{mu,e^\left\{*\right\}\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\} - mathbf\left\{h\right\}\left(n\right) right|^2 right\right]$
$Eleft\left[Lambda\left(n+1\right) right\right] = Eleft\left[left| hat\left\{mathbf\left\{h\right\}\right\}\left(n\right) + frac\left\{mu, left\left( v^*\left(n\right)+y^*\left(n\right)-hat\left\{y\right\}^*\left(n\right) right\right) mathbf\left\{x\right\}\left(n\right)\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\} - mathbf\left\{h\right\}\left(n\right) right|^2 right\right]$

Let $mathbf\left\{delta\right\}=hat\left\{mathbf\left\{h\right\}\right\}\left(n\right)-mathbf\left\{h\right\}\left(n\right)$ and $r\left(n\right) = hat\left\{y\right\}\left(n\right)-y\left(n\right)$

$Eleft\left[Lambda\left(n+1\right) right\right] = Eleft\left[left| mathbf\left\{delta\right\}\left(n\right) - frac\left\{mu, left\left( v\left(n\right)+r\left(n\right) right\right) mathbf\left\{x\right\}\left(n\right)\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\} right|^2 right\right]$
$Eleft\left[Lambda\left(n+1\right) right\right] = Eleft\left[left\left(mathbf\left\{delta\right\}\left(n\right) - frac\left\{mu, left\left( v\left(n\right)+r\left(n\right) right\right) mathbf\left\{x\right\}\left(n\right)\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\} right\right)^H left\left(mathbf\left\{delta\right\}\left(n\right) - frac\left\{mu, left\left( v\left(n\right)+r\left(n\right) right\right) mathbf\left\{x\right\}\left(n\right)\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\} right\right) right\right]$

Assuming independence, we have:

$Eleft\left[Lambda\left(n+1\right) right\right] = Lambda\left(n\right) + Eleft\left[left\left(frac\left\{mu, left\left( v\left(n\right)-r\left(n\right) right\right) mathbf\left\{x\right\}\left(n\right)\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\} right\right)^H left\left(frac\left\{mu, left\left( v\left(n\right)-r\left(n\right) right\right) mathbf\left\{x\right\}\left(n\right)\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\} right\right) right\right] - 2 Eleft\left[frac\left\{mu|r\left(n\right)|^2\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\}right\right]$
$Eleft\left[Lambda\left(n+1\right) right\right] = Lambda\left(n\right) + frac\left\{mu^2 Eleft\left[|e\left(n\right)|^2right\right]\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\} - frac\left\{2 mu Eleft\left[|r\left(n\right)|^2right\right]\right\}\left\{mathbf\left\{x\right\}^H\left(n\right)mathbf\left\{x\right\}\left(n\right)\right\}$

The optimal learning rate is found at $frac\left\{dEleft\left[Lambda\left(n+1\right) right\right]\right\}\left\{dmu\right\} = 0$, which leads to:

$2 mu Eleft\left[|e\left(n\right)|^2right\right] - 2 Eleft\left[|r\left(n\right)|^2right\right] = 0$
$mu = frac\left\{Eleft\left[|r\left(n\right)|^2right\right]\right\}\left\{Eleft\left[|e\left(n\right)|^2right\right]\right\}$

## References

• Monson H. Hayes Statistical Digital Signal Processing and Modeling, Wiley, 1996, ISBN 0-471-59431-8
• Simon Haykin Adaptive Filter Theory, Prentice Hall, 2002, ISBN 0-13-048434-2
• Simon S. Haykin, Bernard Widrow (Editor) Least-Mean-Square Adaptive Filters, Wiley, 2003, ISBN 0-471-21570-8
• Bernard Widrow, Samuel D. Stearns Adaptive Signal Processing, Prentice Hall, 1985, ISBN 0-13-004029-0