I Spent Years on Adaptive Filters. I Was Already Training Neural Networks.

contact@corebaseit.com (Vincent Bevia) — Tue, 14 Apr 2026 10:00:00 +0100

I spent years implementing LMS-based equalizers and echo cancellers in telecommunications. Only later did I fully appreciate what I had been doing mathematically: the same family of update rules that powers neural network training today.

Not as a loose analogy — as the same structure of optimization. Widrow and Hoff formalized the Least Mean Squares (LMS) algorithm in 1960 for the Adaline. Rumelhart, Hinton, and Williams scaled related ideas through multi-layer networks with backpropagation in 1986. The vocabulary changed from adaptive filtering to deep learning, but the core idea — adjust parameters in the direction that reduces error, one small step at a time — is continuous across both worlds.

This post is my attempt to make that lineage explicit: what LMS actually is, why it is structurally the same rule as stochastic gradient descent on a linear model, how the engineering trade-offs line up, and why non-stationarity remains the hard problem in both domains.

LMS Is Not a Metaphor for Training — It Is the Algorithm

The LMS update for a linear combiner (FIR filter or single Adaline) is:

$$ \mathbf{w}(n+1) = \mathbf{w}(n) + \mu , e(n) , \mathbf{x}(n) $$

Here (\mathbf{w}(n)) is the weight vector at time (n), (\mathbf{x}(n)) is the input vector (tap-delay line or feature vector), (e(n) = d(n) - y(n)) is the error between the desired response (d(n)) and the output (y(n) = \mathbf{w}^\top(n)\mathbf{x}(n)), and (\mu) is the step size.

That is stochastic gradient descent on the instantaneous squared error (\frac{1}{2}e^2(n)) with respect to (\mathbf{w}). The gradient of (\frac{1}{2}(d - \mathbf{w}^\top\mathbf{x})^2) with respect to (\mathbf{w}) is (-e,\mathbf{x}). Walking in the opposite direction of the gradient (or equivalently, in the direction (+e,\mathbf{x}) when you define the update as above) is exactly the LMS rule.

So if you have ever shipped an LMS equalizer or echo canceller, you have implemented the foundational learning rule that underlies a huge fraction of modern machine learning: small steps proportional to error times input. The notation in Haykin’s Adaptive Filter Theory differs from PyTorch docs; the mathematics does not.

Multi-layer networks add the chain rule (backpropagation) to compute how error propagates to earlier layers, but the local update at a linear layer trained with mean squared error is still the same structural move: adjust weights in proportion to error and activations. Everything else — momentum, Adam, adaptive learning rates — is engineering on top of that spine.

The Engineering Trade-Offs Are the Same Trade-Offs

In telecommunications, the step size (\mu) controls the classic compromise: convergence speed versus steady-state misadjustment. Too large — the filter can diverge or oscillate. Too small — the filter cannot track a fast-fading channel or a moving echo path. Entire chapters of adaptive filtering textbooks are devoted to stability bounds on (\mu) (often expressed in terms of input power and filter length) and to variants that fix the worst-case behavior.

In deep learning, the learning rate (\eta) plays the same role at a higher level: too high and training diverges or chatters around a minimum; too low and you underfit or burn compute without making progress. The community talks about learning-rate schedules, warm-up, and cosine decay — different names for the same instinct: the right step size depends on the landscape and may need to change over time.

Normalized LMS (NLMS) scales the update by the inverse of the input energy (|\mathbf{x}(n)|^2) (with a small regularizer to avoid division by zero). The goal is stable convergence when input power varies — the same motivation that shows up in adaptive optimizers that normalize updates by running statistics of gradients (RMSProp-style normalization is not identical to NLMS, but the intent — tame the step when the signal scale changes — is shared). The DSP community spent decades refining these ideas for real-time hardware; ML rediscovered many of the same pressures when training became unstable at scale.

Non-Stationarity Was Always the Real Problem — and Still Is

Adaptive filters were built for non-stationary environments: multipath fading, time-varying echoes, drifting noise floors. The “true” optimal weights are not fixed; they move. The filter is not supposed to converge once and freeze — it is supposed to track. That mindset is closer to production ML than a static batch fit on a fixed dataset.

Modern systems face the same phenomenon under different labels: distribution shift, concept drift, stale features, changing user behavior, adversarial drift in inputs. The model that was optimal last month is not guaranteed to be optimal this month. Retraining on a schedule, online updates, monitoring, and guardrails are the engineering response — conceptually in the same family as “never assume the channel is static.”

Research on in-context learning in linear models (for example Akyürek et al., 2022) even investigates which learning algorithms are implicitly approximated by transformers under simplified settings — another reminder that the boundary between classical adaptive signal processing and contemporary ML is thinner than course catalogs suggest.

The Bigger Picture

For engineers who came up through telecommunications and signal processing, the move into AI is often described as a career pivot. In my experience, it is closer to a change of vocabulary on top of a continuous mathematical thread: error-driven updates, step-size discipline, stability under non-stationarity, and the centrality of second-order statistics (explicitly in LMS, implicitly in much of modern training).

The boundary between DSP and machine learning was never as sharp as the literature implied. If you understand LMS, you already understand a piece of what every deep learning framework is doing when it steps the weights. The rest is scale, architecture, and tooling — important, but not magic.

References

Widrow, B., & Hoff, M. E. “Adaptive switching circuits.” IRE WESCON Convention Record, 4, 96–104, 1960.
Haykin, S. Adaptive Filter Theory (4th ed.). Prentice Hall, 2002.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. “Learning representations by back-propagating errors.” Nature, 323, 533–536, 1986.
Akyürek, E. et al. “What learning algorithm is in-context learning? Investigations with linear models.” 2022. arxiv.org/abs/2211.15661