Adaptive Filters and Stochastic Gradient Descent: One Update Rule, Two Vocabularies

contact@corebaseit.com (Vincent Bevia) — Wed, 29 Apr 2026 10:00:00 +0200

Modern AI is often framed as a clean break from classical engineering. For anyone who has worked in adaptive signal processing, that framing is misleading. The mathematical spine of stochastic gradient descent (SGD) is the same spine that has driven adaptive filters in telecommunications since 1960, and the engineering trade-offs that made LMS-based equalizers and echo cancellers reliable in production map almost directly onto the ones that govern training a deep network today.

This post lays out that mapping carefully. The intent is not poetry. It is to give engineers who came up through DSP a precise translation table into modern ML, and engineers who came up through ML a sense of what fifty years of adaptive filtering literature already settled.

A Continuous Lineage, Not a New Beginning

The historical sequence is short and well-documented. Widrow and Hoff formalized the Least Mean Squares (LMS) algorithm in 1960 for the Adaline (Adaptive Linear Neuron). Rumelhart, Hinton, and Williams scaled the same family of error-driven update rules to multi-layer networks with backpropagation in 1986. The vocabulary moved from adaptive filtering to deep learning, but the underlying move — iterative parameter adjustment proportional to error — was continuous across both worlds.

The claim of this post is narrower than “AI came from DSP” and more useful: for a linear model trained on mean squared error, the LMS update is exactly the SGD update. Multi-layer training adds the chain rule on top of that local move; it does not replace it.

LMS as Stochastic Gradient Descent

The LMS update for a linear combiner (FIR filter or single Adaline) is:

$$ \mathbf{w}(n+1) = \mathbf{w}(n) + \mu \, e(n) \, \mathbf{x}(n) $$

with $\mathbf{w}(n)$ the weight vector at step $n$, $\mathbf{x}(n)$ the input vector (tap-delay line in DSP, feature vector in ML), $e(n) = d(n) - y(n)$ the error between the desired response $d(n)$ and the output $y(n) = \mathbf{w}^\top(n)\,\mathbf{x}(n)$, and $\mu$ the step size.

The instantaneous squared error is $\xi(n) = \tfrac{1}{2}e^2(n)$, and its gradient with respect to $\mathbf{w}$ is $-e(n)\,\mathbf{x}(n)$. Substituting that gradient into the standard SGD form recovers the LMS update line for line.

The “stochastic” qualifier deserves the same scrutiny. Classical Wiener filtering minimizes the expected squared error $E[e^2(n)]$, which presupposes knowledge of the input statistics. LMS sidesteps that requirement by using the instantaneous squared error $e^2(n)$ as a high-variance, zero-cost estimator of the expectation. SGD uses the same trick at higher dimensionality: the per-batch loss is a noisy estimate of the population loss, and the noise is doing real work in shaping the optimization trajectory.

The mapping between the two vocabularies for a linear MSE layer is one-to-one:

Symbol	Adaptive filtering (DSP)	Deep learning (ML)
$\mathbf{w}$	Filter coefficients / weight vector	Weight vector / parameters
$\mathbf{x}$	Input vector / tap-delay line	Feature vector / activations
$e(n)$	Desired response $d(n)$ minus output	Target label minus prediction
$\mu$	Step size	Learning rate $\eta$
$\tfrac{1}{2}e^2$	Instantaneous squared error	MSE loss on a single example

So if you have ever shipped an LMS equalizer or echo canceller, you have implemented the local update rule that underlies a large fraction of modern machine learning: small steps proportional to error times input. Notation in Haykin’s Adaptive Filter Theory differs from the PyTorch docs. The mathematics does not.

Multi-layer networks add the chain rule (backpropagation) so error can be attributed back through nonlinear layers. The local update at any linear layer trained on MSE is still the same structural move: adjust weights in proportion to error and activations. Momentum, Adam, and adaptive learning rates are engineering on top of that spine, not departures from it.

Step Size, Learning Rate, and the Geometry of the Loss Surface

In telecommunications, the step size $\mu$ governs the classic compromise between convergence speed and steady-state misadjustment. Too large and the filter can diverge or oscillate. Too small and the filter cannot track a fast-fading channel or a moving echo path. Adaptive filtering textbooks devote whole chapters to stability bounds on $\mu$, usually expressed in terms of input power and filter length, and to variants designed to fix the worst-case behavior.

There is a more specific geometric point worth surfacing, because it carries directly into deep learning. The allowable bounds on $\mu$ for LMS depend on the eigenvalue spread of the input autocorrelation matrix $\mathbf{R} = E[\mathbf{x}\mathbf{x}^\top]$. When $\mathbf{R}$ is poorly conditioned — long tap-delay lines with strongly correlated inputs are a familiar offender — convergence in slow eigen-directions becomes painful, and a step size that is safe in one direction is too aggressive in another. Increasing the filter length tends to worsen this conditioning rather than help it.

This is the same pathology that shows up in deep learning when the loss surface is poorly conditioned. Plain gradient descent crawls along flat directions and bounces along steep ones. Per-coordinate scaling inside Adam, preconditioners, and the entire warm-up / cosine-decay literature exist to attack different facets of that conditioning problem. The DSP community called these eigenvalue spread problems for decades before the ML community started calling them ill-conditioned loss landscapes.

In deep learning, the learning rate $\eta$ plays the same role as $\mu$ at higher abstraction. Too high and training diverges or chatters around a minimum. Too low and you underfit or burn compute without making progress. Learning-rate schedules, warm-up, and cosine decay are all variations on a single instinct: the right step size depends on the local geometry and may need to change over time.

Normalization: NLMS, RMSProp, Adam

To handle wide variations in input signal power, adaptive-filtering practitioners developed Normalized LMS (NLMS), which scales the update by the inverse of the current input energy:

$$ \mathbf{w}(n+1) = \mathbf{w}(n) + \frac{\mu}{\|\mathbf{x}(n)\|^2 + \delta} \, e(n) \, \mathbf{x}(n) $$

with $\delta$ a small regularizer that keeps the denominator from collapsing.

There is a real conceptual line from NLMS to modern adaptive optimizers like RMSProp and Adam. There is also a real mechanical distinction that pop-ML writeups tend to flatten. NLMS normalizes by the instantaneous input energy of the current sample. RMSProp and Adam normalize by an exponential moving average of squared gradients. The intent is shared — keep the update from being driven by signal scale rather than error — but they react on different time horizons and stabilize different things in practice.

Two further points are worth stating without dressing them up.

First, computational cost is part of why LMS and NLMS won and stayed. Both are $O(N)$ per update in the filter length $N$. Recursive Least Squares (RLS) and Newton-style methods give faster theoretical convergence but cost $O(N^2)$ or worse, which is why nobody runs full second-order optimization on a 70-billion-parameter model either. “Good enough updates, very fast” beat “perfect updates, eventually” both in real-time DSP hardware and in the GPU cluster.

Second, normalization by signal-scale statistics is the primary defense against divergence in high-variance environments. The exact statistic differs by algorithm, but the principle is shared across NLMS, RMSProp, Adam, and most of the layer-norm / batch-norm family: bound the update magnitude using something derived from the signal, so error can drive the direction without scale dominating the magnitude.

Tracking, Not Convergence

Adaptive filters were built for non-stationary environments: multipath fading, time-varying echoes, drifting noise floors. The “true” optimal weights are not fixed; they move. The filter is not supposed to converge once and freeze. It is supposed to track. That mindset is much closer to production ML than the static batch fit on a fixed dataset that introductory courses still tend to lead with.

Modern systems face the same phenomenon under different labels:

Distribution shift — the input distribution drifts.
Concept drift — the input-to-target relationship drifts.
Stale features — engineered signals decay as the underlying world changes.

The framing change matters. In a non-stationary environment, “convergence” is the wrong success metric. The right metric is tracking: how well the system follows the moving optimum, with what lag, and at what excess error. DSP literature has been formal about this for decades — excess mean-square error, lag noise, tracking misadjustment — while production ML still tends to discover it informally, usually after a regression has shipped.

Research on in-context learning in linear models (Akyürek et al., 2022) even investigates which classical learning algorithms are implicitly approximated by transformers under simplified conditions. That is one more sign that the boundary between adaptive signal processing and contemporary ML is thinner than course catalogs suggest.

A Continuous Mathematical Thread

The transition from digital signal processing to artificial intelligence is more a change of vocabulary than a change of fundamental principles. The engineering rigor required to stabilize an echo canceller in 1970 is the same rigor required to train a multi-billion-parameter model today: error-driven updates, step-size discipline, normalization to handle input-scale variance, and the discipline of tracking a moving target rather than converging on a fixed one.

For engineers coming from telecommunications, the move into AI is not a career pivot. It is recognition that the tools they already carry — stability analysis, second-order statistics, non-stationary tracking, complexity-aware design — are exactly the tools the new domain still depends on. For engineers coming from ML, the DSP literature is a large, well-tested archive of solutions to problems that show up again at scale.

If you understand LMS, you already understand a piece of what every deep learning framework is doing when it steps the weights. The rest is scale, architecture, and tooling.

References

Widrow, B., & Hoff, M. E. “Adaptive switching circuits.” IRE WESCON Convention Record, 4, 96–104, 1960.
Haykin, S. Adaptive Filter Theory (4th ed.). Prentice Hall, 2002.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. “Learning representations by back-propagating errors.” Nature, 323, 533–536, 1986.
Sayed, A. H. Adaptive Filters. Wiley-IEEE Press, 2008.
Kingma, D. P., & Ba, J. “Adam: A Method for Stochastic Optimization.” ICLR, 2015. arxiv.org/abs/1412.6980
Tieleman, T., & Hinton, G. “Lecture 6.5 — RMSProp: Divide the gradient by a running average of its recent magnitude.” COURSERA: Neural Networks for Machine Learning, 2012.
Akyürek, E., Schuurmans, D., Andreas, J., Ma, T., & Zhou, D. “What learning algorithm is in-context learning? Investigations with linear models.” 2022. arxiv.org/abs/2211.15661

Nlms on Corebaseit — POS · EMV · Payments · AI