Mathematical Foundations on Corebaseit — POS · EMV · Payments · AI · Telecommunications

Nyquist is not Shannon: why more samples does not mean more information

contact@corebaseit.com (Vincent Bevia) — Fri, 08 May 2026 12:00:00 +0200

Digital engineers are trained to treat “more” as a default win: higher clock rates, wider bandwidths, deeper bit depths. Applied to analog-to-digital conversion, the intuition follows naturally. Sample faster, capture more points per second, and the digital record should carry more information about the signal.

That intuition breaks in a specific way. A receiver can ingest samples at gigahertz rates and still fail to recover the message. The waveform may be digitized faithfully while the symbols remain indistinguishable. Sampling governs representation. Channel capacity governs recoverable information. Those are related problems in a receiver chain, but they answer different questions.

Two limits, two questions

Nyquist (more precisely, the Nyquist–Shannon sampling theorem) sets the conditions under which a bandlimited continuous-time signal can be represented by discrete samples and reconstructed without aliasing. The sampling rate $f_s$ must exceed twice the highest frequency $f_{\max}$ in the signal. In practice, anti-aliasing filters enforce a usable band so that energy above $f_s/2$ never reaches the ADC.

Shannon channel capacity sets the maximum rate at which information can be transmitted reliably over a noisy channel with bandwidth $B$ and signal-to-noise ratio $\mathrm{SNR}$:

$$ C = B \log_2(1 + \mathrm{SNR}) $$

Nyquist is about the analog–digital boundary: sample rate, filter transition bands, and spectral folding. Shannon is about the communication path: how many bits per second the channel can support given noise.

A system can satisfy Nyquist and still fail at the Shannon layer. An ADC may capture the passband waveform with high fidelity, but if channel SNR is too low, constellation points smear across decision boundaries. The receiver has samples. It lacks evidence to pick the correct symbol. The waveform is preserved; the message is not recoverable at the target rate.

	Nyquist (sampling)	Shannon (capacity)
Primary concern	Waveform representation at the ADC	Reliable information rate over a noisy channel
Key variables	$f_s$, anti-aliasing filter, signal bandwidth	Channel bandwidth $B$, $\mathrm{SNR}$
Typical failure	Aliasing: spectral overlap in the digitized band	Symbol errors: noise dominates decision regions
Fix direction	Raise $f_s$ or tighten the analog filter	More power, better coding, wider channel, lower noise

Aliasing is structural damage

When $f_s \le 2f_{\max}$, high-frequency content folds into lower bands. That is aliasing: distinct spectral components become indistinguishable in the sampled sequence.

Once overlap lands inside the band you intend to process, the mixture is not invertible. DSP, blind source separation, and forward error correction operate on the samples you already have. They cannot reconstruct which energy came from which original frequency component. FEC adds redundancy to survive bit errors; it does not undo a frequency-domain fold that happened at the sampler.

This is why the anti-aliasing filter sits at the front of the chain. Its job is to keep out-of-band energy below the ADC so the first stored sample is not already corrupted.

Confirmed: Aliasing below the Nyquist rate is a sampling-theorem violation with no post-hoc digital fix.

Nuance: If you oversample and aliased images fall entirely outside the band of interest, digital filters can remove them. The irreversibility applies when folded energy overlaps the signal band you need to recover.

The oversampling fallacy about Shannon capacity

A common mistake is to treat a higher ADC sample rate $f_s$ as if it were extra Shannon bandwidth. It is not.

Shannon’s $B$ is the bandwidth of the physical channel (or the effective information-bearing band after filtering), not the converter clock. Raising $f_s$ without increasing usable bandwidth or SNR at the channel does not raise $C$. Sampling a 10 MHz channel at 100 GHz with poor SNR gives you a dense record of a signal buried in noise. The sample count increases. The recoverable information rate does not.

Oversampling also does not create independent observations of the channel. Samples taken faster than the signal bandwidth are correlated. Averaging them after proper filtering can improve implementation SNR in specific architectures, but the spectral-efficiency ceiling for the band is still set by Shannon’s formula on that band’s $B$ and $\mathrm{SNR}$.

What oversampling actually buys you

Oversampling is an architecture lever, not a capacity multiplier.

At the minimum Nyquist rate, the analog anti-aliasing filter must approximate a brick wall: very sharp rolloff, tight transition band, difficult phase and group-delay control. High-order analog filters are expensive, sensitive to component tolerance, and prone to passband ripple and group-delay variation. That group-delay flatness matters when you run high-order QAM: phase distortion at band edges directly hurts EVM.

Sampling well above Nyquist widens the gap between the signal band and the first aliased image around $f_s - f_{\max}$. The analog filter can roll off gently. The steep rejection moves to digital filters, where coefficients are exact, repeatable, and field-updatable.

Oversampling also buys margin in the digital domain: more samples per symbol for timing recovery, simpler interpolation in clock recovery loops, and headroom against clock jitter in mobile links. These are implementation advantages. They simplify hardware and improve robustness. They do not rewrite the Shannon limit on the RF or wired channel feeding the receiver.

Quantization noise: spread, shape, filter

The other major payoff appears inside the ADC itself.

A finite-resolution quantizer injects quantization noise. For a rough first-order model, total quantization noise power is set by step size and full-scale range. When you oversample by factor $M$, that noise power spreads across a wider Nyquist band (up to $f_s/2$). Noise power spectral density in the original signal band drops by approximately $M$, giving roughly 3 dB of in-band SNR improvement per doubling of sample rate for white quantization noise. That trades sample rate for effective resolution (often quoted as ENOB improvement). This is process gain against quantization noise, not an increase in Shannon channel capacity.

Delta–sigma converters push further. A feedback loop shapes quantization noise out of the band of interest and into high frequencies where the oversampling headroom lives. A digital decimation filter then removes the shaped noise. The analog front end can stay coarse; precision emerges from digital filtering and noise shaping rather than from an impractically linear multi-bit analog quantizer at the input.

Typical sequence:

Oversample — spread quantization noise across a wide band.
Shape — move noise energy out of the signal band (delta–sigma loop).
Decimate and filter — keep the signal band, discard the shaped noise.

Layered constraints in real receivers

Modern receivers in cellular base stations, Wi-Fi access points, and satellite modems hit both layers. Nyquist failures corrupt the waveform before baseband processing starts. Shannon failures leave the waveform intact but the link rate unsupportable at the target modulation and coding scheme.

Useful design questions stay separate:

Nyquist layer: Did we band-limit and sample so the digitized waveform preserves the intended spectrum?
Shannon layer: Given channel $B$ and $\mathrm{SNR}$, can this modulation and code rate operate at the target BER or BLER?

Confusing the two produces expensive mistakes: chasing sample rate when the link is SNR-limited, or squeezing the analog filter while ignoring aliasing risk.

In receiver work I have seen both failure modes. A lab trace with clean time-domain samples and a constellation plot that looks like a fuzzy disk is a Shannon-layer problem. A spectrum with folded interferers sitting on top of the wanted channel is a Nyquist-layer problem. The debug path differs completely.

Oversampling belongs in the toolkit for filter relaxation, timing margin, and quantization noise management. Shannon’s formula still defines how much information the channel can carry. Nyquist still defines whether the ADC output is a faithful starting point. Keep the layers separate and the architecture gets easier to reason about.

References

H. Nyquist, “Certain topics in telegraph transmission theory,” Transactions of the American Institute of Electrical Engineers, vol. 47, no. 2, pp. 617–644, 1928.
C. E. Shannon, “Communication in the presence of noise,” Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, Jan. 1949. (Shannon capacity formula for bandlimited AWGN channels.)
C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, 1948.
A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 3rd ed. Pearson, 2010. (Sampling, aliasing, and discrete-time analysis.)
J. G. Proakis and M. Salehi, Digital Communications, 5th ed. McGraw-Hill, 2008. (Channel capacity, modulation, and receiver performance.)
R. G. Lyons, Understanding Digital Signal Processing, 3rd ed. Pearson, 2011. (Oversampling, decimation, and practical ADC considerations.)
R. Schreier and G. C. Temes, Understanding Delta-Sigma Data Converters. IEEE Press, 2005. (Noise shaping and oversampled converter architectures.)

FIR Filter Design on FPGA: Manual Engineering vs AI-Assisted Workflows

contact@corebaseit.com (Vincent Bevia) — Tue, 05 May 2026 17:12:00 +0200

Designing an FIR filter on FPGA still starts with fundamentals: convolution, z-domain structure, fixed-point limits, and synthesis constraints. AI can accelerate the workflow, but it does not remove the need for engineering judgment. In practice, the speed gains only hold when the output is verified with the same rigor used in a manual flow.

This post walks through both paths using the same 8-tap low-pass example:

Manual path: derive and validate the math, implement C reference code, write synthesizable Verilog, and verify behavior.
AI-assisted path: use an agent to generate and iterate quickly, then validate outputs against deterministic checks.

Companion document

For the full engineering reference (extended derivations, full C and Verilog listings, and workflow comparison tables), download the companion PDF:

Download the PDF ->

The baseline: what must be correct before writing RTL

For a causal FIR of length $N$:

\[ y[n] = \sum_{k=0}^{N-1} h[k]\,x[n-k] \]

and

\[ H(z) = \sum_{k=0}^{N-1} h[k] z^{-k} \]

If you rewrite $H(z)$ as $P(z)/z^{N-1}$, the denominator gives $N-1$ poles at $z=0$. That part is structural. The design behavior is driven by the zeros of $P(z)$.

For the coefficient set used here (Q15):

\[ h = [1024, 2048, 4096, 8192, 8192, 4096, 2048, 1024] \]

Key properties:

Symmetric taps $h[k]=h[N-1-k]$ give linear phase.
Group delay is constant: $(N-1)/2 = 3.5$ samples.
DC gain is $\sum h / 32768 = 31744/32768 \approx 0.969$.
Nyquist response is zero for this set, confirming low-pass behavior.

These are fast checks and they catch many implementation mistakes early.

Pole-zero analysis: manual C vs generated code

For $N=8$, the numerator polynomial is degree 7, so closed-form roots are not practical. A numerical root finder (for example, Durand-Kerner) is the right engineering move.

In a manual flow, writing and debugging the complex arithmetic takes time but gives full control over convergence and diagnostics. In an AI-assisted flow, the same solver is typically generated in seconds.

The trade-off is straightforward:

Manual: slower, but you understand every line and failure mode.
AI-assisted: much faster draft, but you must verify normalization, convergence tolerance, and root interpretation.

For symmetric palindromic FIR polynomials, zeros appear in reciprocal-conjugate pairs. In this coefficient set, zeros lie on the unit circle, which matches stopband notch behavior.

RTL architecture choice: direct form or transpose form

Both forms are valid. The right choice depends on clock target and debug priorities.

Direct form is often easier to inspect in simulation because the delay line is explicit. Transpose form usually closes timing at higher frequencies on modern FPGA DSP slices due to cascade-friendly accumulation.

For Xilinx UltraScale+ devices, transpose form maps naturally to DSP48E2 cascade paths (PCIN/PCOUT), reducing fabric routing pressure in the accumulator chain.

Practical rule:

If you are iterating quickly or tap count is modest, direct form is easier to debug.
If Fmax is tight, transpose form is usually the safer production architecture.

Fixed-point discipline is not optional

A reliable accumulator width rule is:

\[ ACC\_WIDTH \ge DATA\_WIDTH + COEF\_WIDTH + \lceil \log_2(N) \rceil \]

For 16-bit data, 16-bit coefficients, and $N=8$:

\[ ACC\_WIDTH \ge 16 + 16 + 3 = 35 \]

Using 40 bits gives practical margin and simplifies edge-case handling when pre-adders and saturation logic are included.

Before touching synthesis reports, validate three items in software:

Impulse response reproduces coefficients in order.
DC and Nyquist points match expected values.
Overflow behavior is explicit and deterministic.

This avoids burning time in RTL debug for issues that are purely numeric.

What AI changes in the engineering loop

AI is strong at producing boilerplate and first-pass implementations:

C99 FIR helpers (frequency response, coefficient generation, fixed-point simulation)
Parameterized Verilog modules and testbenches
Documentation and consistency checks across artifacts

AI is weaker where project context and hardware constraints dominate:

Timing closure decisions
Device-specific synthesis behavior
Numerical edge cases at format boundaries
Silent assumptions in prompt wording

The practical operating model is:

Use AI for acceleration.
Keep deterministic verification as gate criteria.
Treat generated code as untrusted until it passes the same checks as handwritten code.

That model preserves velocity without lowering engineering standards.

A verification protocol that scales

When using an agent for DSP + FPGA tasks, this checklist keeps quality stable:

Confirm poles and zero magnitudes against expected structure.
Recompute key response points independently (at least DC and Nyquist).
Run impulse and sine tests in simulation and compare with software reference output.
Inspect synthesis utilization to confirm intended DSP inference.
Check timing reports for the actual critical path, not the expected one.

If these checks pass, AI is a clear multiplier. If they are skipped, speed gains disappear quickly.

Closing perspective

Manual engineering and AI-assisted engineering are not competing methods. They are layered methods. The manual path builds the model in your head; the AI path compresses execution time once that model exists.

In FIR-on-FPGA work, that distinction matters. The engineer still owns correctness, timing, and numerical integrity. AI can do a lot of work. It cannot assume responsibility.

This diagram summarizes the full thread of the post: start from FIR fundamentals in the z-domain, choose an RTL structure that fits timing targets, validate fixed-point behavior before synthesis, and use AI as an accelerator only after deterministic checks are in place. It is the same workflow, with and without AI; the difference is execution speed, not verification rigor.

References

A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 3rd ed., Pearson, 2010.
J. G. Proakis and D. G. Manolakis, Digital Signal Processing: Principles, Algorithms, and Applications, 4th ed., Pearson, 2006.
Xilinx, FIR Compiler v7.2 Product Guide (PG149), AMD/Xilinx.
Xilinx, UltraScale Architecture DSP Slice User Guide (UG579), AMD/Xilinx.
Intel, AN 306: Implementing FIR Filters in Intel FPGA Devices, Intel Corporation.
IEEE, IEEE Std 1800-2017: SystemVerilog Language Reference Manual, IEEE.

Adaptive Filters and Stochastic Gradient Descent: One Update Rule, Two Vocabularies

contact@corebaseit.com (Vincent Bevia) — Wed, 29 Apr 2026 10:00:00 +0200

Modern AI is often framed as a clean break from classical engineering. For anyone who has worked in adaptive signal processing, that framing is misleading. The mathematical spine of stochastic gradient descent (SGD) is the same spine that has driven adaptive filters in telecommunications since 1960, and the engineering trade-offs that made LMS-based equalizers and echo cancellers reliable in production map almost directly onto the ones that govern training a deep network today.

This post lays out that mapping carefully. The intent is not poetry. It is to give engineers who came up through DSP a precise translation table into modern ML, and engineers who came up through ML a sense of what fifty years of adaptive filtering literature already settled.

A Continuous Lineage, Not a New Beginning

The historical sequence is short and well-documented. Widrow and Hoff formalized the Least Mean Squares (LMS) algorithm in 1960 for the Adaline (Adaptive Linear Neuron). Rumelhart, Hinton, and Williams scaled the same family of error-driven update rules to multi-layer networks with backpropagation in 1986. The vocabulary moved from adaptive filtering to deep learning, but the underlying move — iterative parameter adjustment proportional to error — was continuous across both worlds.

The claim of this post is narrower than “AI came from DSP” and more useful: for a linear model trained on mean squared error, the LMS update is exactly the SGD update. Multi-layer training adds the chain rule on top of that local move; it does not replace it.

LMS as Stochastic Gradient Descent

The LMS update for a linear combiner (FIR filter or single Adaline) is:

$$ \mathbf{w}(n+1) = \mathbf{w}(n) + \mu \, e(n) \, \mathbf{x}(n) $$

with $\mathbf{w}(n)$ the weight vector at step $n$, $\mathbf{x}(n)$ the input vector (tap-delay line in DSP, feature vector in ML), $e(n) = d(n) - y(n)$ the error between the desired response $d(n)$ and the output $y(n) = \mathbf{w}^\top(n)\,\mathbf{x}(n)$, and $\mu$ the step size.

The instantaneous squared error is $\xi(n) = \tfrac{1}{2}e^2(n)$, and its gradient with respect to $\mathbf{w}$ is $-e(n)\,\mathbf{x}(n)$. Substituting that gradient into the standard SGD form recovers the LMS update line for line.

The “stochastic” qualifier deserves the same scrutiny. Classical Wiener filtering minimizes the expected squared error $E[e^2(n)]$, which presupposes knowledge of the input statistics. LMS sidesteps that requirement by using the instantaneous squared error $e^2(n)$ as a high-variance, zero-cost estimator of the expectation. SGD uses the same trick at higher dimensionality: the per-batch loss is a noisy estimate of the population loss, and the noise is doing real work in shaping the optimization trajectory.

The mapping between the two vocabularies for a linear MSE layer is one-to-one:

Symbol	Adaptive filtering (DSP)	Deep learning (ML)
$\mathbf{w}$	Filter coefficients / weight vector	Weight vector / parameters
$\mathbf{x}$	Input vector / tap-delay line	Feature vector / activations
$e(n)$	Desired response $d(n)$ minus output	Target label minus prediction
$\mu$	Step size	Learning rate $\eta$
$\tfrac{1}{2}e^2$	Instantaneous squared error	MSE loss on a single example

So if you have ever shipped an LMS equalizer or echo canceller, you have implemented the local update rule that underlies a large fraction of modern machine learning: small steps proportional to error times input. Notation in Haykin’s Adaptive Filter Theory differs from the PyTorch docs. The mathematics does not.

Multi-layer networks add the chain rule (backpropagation) so error can be attributed back through nonlinear layers. The local update at any linear layer trained on MSE is still the same structural move: adjust weights in proportion to error and activations. Momentum, Adam, and adaptive learning rates are engineering on top of that spine, not departures from it.

Step Size, Learning Rate, and the Geometry of the Loss Surface

In telecommunications, the step size $\mu$ governs the classic compromise between convergence speed and steady-state misadjustment. Too large and the filter can diverge or oscillate. Too small and the filter cannot track a fast-fading channel or a moving echo path. Adaptive filtering textbooks devote whole chapters to stability bounds on $\mu$, usually expressed in terms of input power and filter length, and to variants designed to fix the worst-case behavior.

There is a more specific geometric point worth surfacing, because it carries directly into deep learning. The allowable bounds on $\mu$ for LMS depend on the eigenvalue spread of the input autocorrelation matrix $\mathbf{R} = E[\mathbf{x}\mathbf{x}^\top]$. When $\mathbf{R}$ is poorly conditioned — long tap-delay lines with strongly correlated inputs are a familiar offender — convergence in slow eigen-directions becomes painful, and a step size that is safe in one direction is too aggressive in another. Increasing the filter length tends to worsen this conditioning rather than help it.

This is the same pathology that shows up in deep learning when the loss surface is poorly conditioned. Plain gradient descent crawls along flat directions and bounces along steep ones. Per-coordinate scaling inside Adam, preconditioners, and the entire warm-up / cosine-decay literature exist to attack different facets of that conditioning problem. The DSP community called these eigenvalue spread problems for decades before the ML community started calling them ill-conditioned loss landscapes.

In deep learning, the learning rate $\eta$ plays the same role as $\mu$ at higher abstraction. Too high and training diverges or chatters around a minimum. Too low and you underfit or burn compute without making progress. Learning-rate schedules, warm-up, and cosine decay are all variations on a single instinct: the right step size depends on the local geometry and may need to change over time.

Normalization: NLMS, RMSProp, Adam

To handle wide variations in input signal power, adaptive-filtering practitioners developed Normalized LMS (NLMS), which scales the update by the inverse of the current input energy:

$$ \mathbf{w}(n+1) = \mathbf{w}(n) + \frac{\mu}{\|\mathbf{x}(n)\|^2 + \delta} \, e(n) \, \mathbf{x}(n) $$

with $\delta$ a small regularizer that keeps the denominator from collapsing.

There is a real conceptual line from NLMS to modern adaptive optimizers like RMSProp and Adam. There is also a real mechanical distinction that pop-ML writeups tend to flatten. NLMS normalizes by the instantaneous input energy of the current sample. RMSProp and Adam normalize by an exponential moving average of squared gradients. The intent is shared — keep the update from being driven by signal scale rather than error — but they react on different time horizons and stabilize different things in practice.

Two further points are worth stating without dressing them up.

First, computational cost is part of why LMS and NLMS won and stayed. Both are $O(N)$ per update in the filter length $N$. Recursive Least Squares (RLS) and Newton-style methods give faster theoretical convergence but cost $O(N^2)$ or worse, which is why nobody runs full second-order optimization on a 70-billion-parameter model either. “Good enough updates, very fast” beat “perfect updates, eventually” both in real-time DSP hardware and in the GPU cluster.

Second, normalization by signal-scale statistics is the primary defense against divergence in high-variance environments. The exact statistic differs by algorithm, but the principle is shared across NLMS, RMSProp, Adam, and most of the layer-norm / batch-norm family: bound the update magnitude using something derived from the signal, so error can drive the direction without scale dominating the magnitude.

Tracking, Not Convergence

Adaptive filters were built for non-stationary environments: multipath fading, time-varying echoes, drifting noise floors. The “true” optimal weights are not fixed; they move. The filter is not supposed to converge once and freeze. It is supposed to track. That mindset is much closer to production ML than the static batch fit on a fixed dataset that introductory courses still tend to lead with.

Modern systems face the same phenomenon under different labels:

Distribution shift — the input distribution drifts.
Concept drift — the input-to-target relationship drifts.
Stale features — engineered signals decay as the underlying world changes.

The framing change matters. In a non-stationary environment, “convergence” is the wrong success metric. The right metric is tracking: how well the system follows the moving optimum, with what lag, and at what excess error. DSP literature has been formal about this for decades — excess mean-square error, lag noise, tracking misadjustment — while production ML still tends to discover it informally, usually after a regression has shipped.

Research on in-context learning in linear models (Akyürek et al., 2022) even investigates which classical learning algorithms are implicitly approximated by transformers under simplified conditions. That is one more sign that the boundary between adaptive signal processing and contemporary ML is thinner than course catalogs suggest.

A Continuous Mathematical Thread

The transition from digital signal processing to artificial intelligence is more a change of vocabulary than a change of fundamental principles. The engineering rigor required to stabilize an echo canceller in 1970 is the same rigor required to train a multi-billion-parameter model today: error-driven updates, step-size discipline, normalization to handle input-scale variance, and the discipline of tracking a moving target rather than converging on a fixed one.

For engineers coming from telecommunications, the move into AI is not a career pivot. It is recognition that the tools they already carry — stability analysis, second-order statistics, non-stationary tracking, complexity-aware design — are exactly the tools the new domain still depends on. For engineers coming from ML, the DSP literature is a large, well-tested archive of solutions to problems that show up again at scale.

If you understand LMS, you already understand a piece of what every deep learning framework is doing when it steps the weights. The rest is scale, architecture, and tooling.

References

Widrow, B., & Hoff, M. E. “Adaptive switching circuits.” IRE WESCON Convention Record, 4, 96–104, 1960.
Haykin, S. Adaptive Filter Theory (4th ed.). Prentice Hall, 2002.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. “Learning representations by back-propagating errors.” Nature, 323, 533–536, 1986.
Sayed, A. H. Adaptive Filters. Wiley-IEEE Press, 2008.
Kingma, D. P., & Ba, J. “Adam: A Method for Stochastic Optimization.” ICLR, 2015. arxiv.org/abs/1412.6980
Tieleman, T., & Hinton, G. “Lecture 6.5 — RMSProp: Divide the gradient by a running average of its recent magnitude.” COURSERA: Neural Networks for Machine Learning, 2012.
Akyürek, E., Schuurmans, D., Andreas, J., Ma, T., & Zhou, D. “What learning algorithm is in-context learning? Investigations with linear models.” 2022. arxiv.org/abs/2211.15661

Deriving MMSE: what the Wiener filter actually minimizes

contact@corebaseit.com (Vincent Bevia) — Sun, 26 Apr 2026 15:00:00 +0200

The Wiener filter has a clean claim: among all linear filters operating on wide-sense stationary input, it produces the minimum mean-square estimation error. That claim has a closed-form proof, and the proof is short enough to walk through completely.

This post derives the Minimum Mean-Square Error (MMSE) starting from the cost function, through the Wiener-Hopf equation, to the final result. Every step is explicit. If you have worked with adaptive filters (LMS, NLMS, RLS), this is the deterministic optimum those algorithms approach iteratively without knowing the statistics in advance.

A companion lab implements the full derivation in MATLAB, Python, C, and C++: MMSE-Wiener-Filter-Lab.

Setup and notation

A linear filter with weight vector $\mathbf{w}$ of length $N$ receives input vector $\mathbf{x}(n)$ and produces the estimate:

\[ y(n) = \mathbf{w}^T \mathbf{x}(n) \]

The desired (target) signal is $d(n)$. The estimation error is:

\[ e(n) = d(n) - y(n) = d(n) - \mathbf{w}^T \mathbf{x}(n) \]

The filter’s job is to choose $\mathbf{w}$ so that $e(n)$ is as small as possible in a statistical sense. For real-valued signals, the cost function is the mean-square error:

\[ \xi(\mathbf{w}) = E\!\left[e^2(n)\right] \]

Two statistical quantities determine everything that follows:

Symbol	Definition	Structure
$\mathbf{R}$	$E[\mathbf{x}(n)\mathbf{x}^T(n)]$	$N \times N$ autocorrelation matrix of the input
$\mathbf{p}$	$E[\mathbf{x}(n)\,d(n)]$	$N \times 1$ cross-correlation between input and desired signal

For a wide-sense stationary process, $\mathbf{R}$ is Toeplitz and positive semi-definite by construction. It is positive definite when no sample in the observation window is a perfect linear combination of the others. That non-degeneracy condition is what guarantees a unique optimum.

Expanding the cost function

Start from the definition and expand the square:

\[ \xi = E\!\left[(d(n) - \mathbf{w}^T \mathbf{x}(n))^2\right] \]\[ = E\!\left[d^2(n)\right] - 2\,\mathbf{w}^T E\!\left[\mathbf{x}(n)\,d(n)\right] + \mathbf{w}^T E\!\left[\mathbf{x}(n)\mathbf{x}^T(n)\right]\mathbf{w} \]

Substituting the definitions of $\mathbf{p}$ and $\mathbf{R}$:

\[ \xi = E\!\left[d^2(n)\right] - 2\,\mathbf{w}^T \mathbf{p} + \mathbf{w}^T \mathbf{R}\,\mathbf{w} \]

This is a scalar-valued quadratic in $\mathbf{w}$. The first term, $E[d^2(n)]$, is the variance (power) of the desired signal, fixed by the data. The second and third terms depend on $\mathbf{w}$. Because $\mathbf{R}$ is positive definite, $\xi(\mathbf{w})$ is strictly convex with a single global minimum. Geometrically, it is a bowl-shaped surface in $N$-dimensional weight space.

The weight vector $\mathbf{w}$ is the only thing under the designer’s control. $\mathbf{R}$ and $\mathbf{p}$ are determined by the signal environment.

The Wiener-Hopf equation

The minimum of a differentiable convex function is where the gradient vanishes. Taking the gradient of $\xi$ with respect to $\mathbf{w}$:

\[ \nabla_{\mathbf{w}}\,\xi = -2\,\mathbf{p} + 2\,\mathbf{R}\,\mathbf{w} \]

Setting to zero and solving for the optimal weight vector:

\[ \mathbf{R}\,\mathbf{w}_{\text{opt}} = \mathbf{p} \]\[ \mathbf{w}_{\text{opt}} = \mathbf{R}^{-1}\,\mathbf{p} \]

This is the Wiener-Hopf equation [1][2]. It has a unique solution when $\mathbf{R}$ is positive definite (and therefore invertible). The optimal weights depend only on the second-order statistics of the input and the cross-statistics between input and desired signal. No higher-order moments are needed.

One thing to note about practice: computing $\mathbf{R}^{-1}$ directly is $O(N^3)$ and numerically fragile when the eigenvalue spread of $\mathbf{R}$ is large (ill-conditioned input). This is a primary motivation for adaptive algorithms. LMS approximates $\mathbf{w}_{\text{opt}}$ iteratively at $O(N)$ per update, without ever forming or inverting $\mathbf{R}$.

Deriving the minimum MSE

To find the error floor, substitute $\mathbf{w}_{\text{opt}} = \mathbf{R}^{-1}\mathbf{p}$ back into the cost function:

\[ \xi_{\min} = E\!\left[d^2(n)\right] - 2\,(\mathbf{R}^{-1}\mathbf{p})^T\,\mathbf{p} + (\mathbf{R}^{-1}\mathbf{p})^T\,\mathbf{R}\,(\mathbf{R}^{-1}\mathbf{p}) \]

The simplification uses three properties of real-valued autocorrelation matrices. The transpose of a product reverses order: $(\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T$. The matrix $\mathbf{R}$ is symmetric ($\mathbf{R} = \mathbf{R}^T$), so its inverse is also symmetric: $(\mathbf{R}^{-1})^T = \mathbf{R}^{-1}$. And a matrix times its own inverse yields the identity: $\mathbf{R}^{-1}\mathbf{R} = \mathbf{I}$.

Apply these to the second term:

\[ -2\,(\mathbf{R}^{-1}\mathbf{p})^T\mathbf{p} = -2\,\mathbf{p}^T(\mathbf{R}^{-1})^T\mathbf{p} = -2\,\mathbf{p}^T\mathbf{R}^{-1}\mathbf{p} \]

And the third term:

\[ (\mathbf{R}^{-1}\mathbf{p})^T\,\mathbf{R}\,(\mathbf{R}^{-1}\mathbf{p}) = \mathbf{p}^T\mathbf{R}^{-1}\,\mathbf{R}\,\mathbf{R}^{-1}\mathbf{p} = \mathbf{p}^T\,(\mathbf{I})\,\mathbf{R}^{-1}\mathbf{p} = \mathbf{p}^T\,\mathbf{R}^{-1}\,\mathbf{p} \]

The second term contributes $-2\,\mathbf{p}^T\mathbf{R}^{-1}\mathbf{p}$. The third contributes $+\mathbf{p}^T\mathbf{R}^{-1}\mathbf{p}$. They partially cancel:

\[ \boxed{\xi_{\min} = E\!\left[d^2(n)\right] - \mathbf{p}^T\,\mathbf{R}^{-1}\,\mathbf{p}} \]

Since $\mathbf{R}^{-1}\mathbf{p} = \mathbf{w}_{\text{opt}}$, this is equivalently:

\[ \xi_{\min} = E\!\left[d^2(n)\right] - \mathbf{p}^T\,\mathbf{w}_{\text{opt}} \]

The quantity $\mathbf{p}^T \mathbf{w}_{\text{opt}}$ is a scalar, so it equals its own transpose $\mathbf{w}_{\text{opt}}^T \mathbf{p}$. Both forms appear in the literature [2][3].

What the result says

The MMSE has two components. The first, $E[d^2(n)]$, is the total power of the signal you are trying to estimate. The second, $\mathbf{p}^T \mathbf{R}^{-1} \mathbf{p}$, is the portion of that power the filter can explain using the input data. The difference is what remains unexplained: the irreducible estimation error for any linear filter under these statistics.

The MSE cannot go negative. If $\mathbf{p}^T \mathbf{R}^{-1} \mathbf{p} = E[d^2(n)]$, the filter reconstructs the desired signal with zero error. That happens when $d(n)$ lies entirely in the column span of the input observation matrix. In a channel equalization context, it means the channel distortion is fully invertible within the filter’s tap length.

$\xi_{\min}$ is a hard bound on linear estimators. Deviating from $\mathbf{w}_{\text{opt}}$ by any $\Delta\mathbf{w}$ adds a penalty of $\Delta\mathbf{w}^T \mathbf{R}\,\Delta\mathbf{w}$ to the error. Because $\mathbf{R}$ is positive definite, that penalty is always positive. Any weight vector other than $\mathbf{w}_{\text{opt}}$ produces strictly higher MSE.

One more property worth stating explicitly: the Wiener filter is optimal among linear filters under MSE without requiring the input to be Gaussian. If the input and desired signal happen to be jointly Gaussian, the conditional expectation $E[d(n)|\mathbf{x}(n)]$ is itself linear, and the Wiener filter becomes optimal among all estimators, linear and nonlinear [2, Ch. 2][5]. For non-Gaussian inputs, nonlinear estimators can potentially do better, but the Wiener filter remains the best you can do with a linear structure.

Connection to LMS and adaptive filtering

The Wiener-Hopf equation requires $\mathbf{R}$ and $\mathbf{p}$ in closed form. In most real systems (equalizers, echo cancellers, noise reduction) you do not have these. The channel changes. The statistics drift. Estimating $\mathbf{R}$ from finite data and inverting it is both noisy and expensive.

LMS sidesteps the problem by replacing the expected gradient $-2\mathbf{p} + 2\mathbf{R}\mathbf{w}$ with the instantaneous gradient $-2\,e(n)\,\mathbf{x}(n)$, then taking a small step at each sample:

\[ \mathbf{w}(n+1) = \mathbf{w}(n) + \mu\,e(n)\,\mathbf{x}(n) \]

This is stochastic gradient descent on $\xi$. The noise in the gradient estimate is the cost of not needing $\mathbf{R}$ and $\mathbf{p}$. The step size $\mu$ controls the trade-off between convergence speed and steady-state misadjustment, and its stability bounds depend on the eigenvalue spread of the same $\mathbf{R}$ that the Wiener-Hopf equation inverts directly [2, Ch. 5][3].

The Wiener filter defines the destination. LMS is one way to get there without knowing the terrain. The MMSE is the floor that LMS approaches but does not reach: for any $\mu > 0$, the steady-state excess MSE is strictly positive. That gap is the price of real-time adaptation.

Lab

The full derivation is implemented in MATLAB, Python, C, and C++ in the companion repository: MMSE-Wiener-Filter-Lab.

References

N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series, MIT Press, 1949.
S. Haykin, Adaptive Filter Theory, 5th ed., Pearson, 2014.
B. Widrow and S. D. Stearns, Adaptive Signal Processing, Prentice Hall, 1985.
A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 3rd ed., Pearson, 2010.
S. M. Kay, Fundamentals of Statistical Signal Processing, Vol. I: Estimation Theory, Prentice Hall, 1993.

Maximum Entropy Inverse Reinforcement Learning: Understanding the Trajectory Formula

contact@corebaseit.com (Vincent Bevia) — Thu, 23 Apr 2026 10:00:00 +0100

Inverse reinforcement learning (IRL) asks a different question from classical RL: instead of assuming a reward function and learning a policy, you observe expert behavior and infer what reward would make that behavior look rational. One of the cleanest probabilistic formulations is the maximum entropy trajectory model. This post is a practical engineering note on what the formula means, why entropy matters, and where the Markov decision process (MDP) shows up under the hood.

The formula

Over candidate trajectories $T_i$, the model defines:

$$ P(T_i \mid \mathbf{w}) = \frac{1}{Z(\mathbf{w})} \exp\Bigl( \sum_{(s,a) \in T_i} \mathbf{w} \cdot \boldsymbol{\phi}(s,a) \Bigr) $$

$T_i$ — a complete trajectory (a sequence of state–action pairs).
$\mathbf{w}$ — a weight vector; in linear IRL, $\mathbf{w}$ parameterizes the implied reward $R(s,a) = \mathbf{w} \cdot \boldsymbol{\phi}(s,a)$.
$\boldsymbol{\phi}(s,a)$ — a feature vector for taking action $a$ in state $s$ (progress toward a goal, proximity to obstacles, smoothness, etc.).
$\mathbf{w} \cdot \boldsymbol{\phi}(s,a)$ — scalar reward contribution for one step.
$\sum_{(s,a) \in T_i} \mathbf{w} \cdot \boldsymbol{\phi}(s,a)$ — trajectory score (total reward along the path).
$Z(\mathbf{w})$ — the partition function: sum of $\exp(\text{score})$ over all trajectories so probabilities normalize to one.

Notation at a glance

Term	Meaning
$T_i$	One candidate trajectory, path, or behavior sequence
$\mathbf{w}$	Learned weights; defines the linear reward in this setup
$\boldsymbol{\phi}(s,a)$	Features for $(s,a)$: progress, risk, smoothness, compliance, …
$\mathbf{w} \cdot \boldsymbol{\phi}(s,a)$	Scalar reward for one step
$\sum_{(s,a) \in T_i}$	Sum of step rewards along the trajectory
$\exp(\cdot)$	Turns scores into positive, unnormalized masses
$Z(\mathbf{w})$	Normalizer over trajectories (the hard part in large spaces)

Intuition: a softmax over trajectories

Operationally, the recipe is:

Score each trajectory by summing $\mathbf{w} \cdot \boldsymbol{\phi}(s,a)$ along the path.
Exponentiate each score.
Divide by the sum of all exponentiated scores.

That is a softmax over trajectories: higher-scoring paths get higher probability, but nothing is forced to probability one unless the data and weights make that inevitable.

That matters for real experts. Demonstrations are rarely perfectly deterministic; several trajectories can be good. Maximum-entropy modeling keeps that uncertainty instead of collapsing everything onto a single “best” path.

IRL, MDPs, and feature matching

Forward RL: given dynamics and reward, find a good policy. IRL: given demonstrations, infer a reward (or reward parameters) that rationalizes them.

In maximum-entropy IRL, one assumes $R(s,a) = \mathbf{w} \cdot \boldsymbol{\phi}(s,a)$. The trajectory score is the sum of per-step rewards. Expert data are treated as if they were drawn from a distribution where higher-reward trajectories are exponentially more likely — exactly the Boltzmann form above.

Estimation goal: find $\mathbf{w}$ such that observed demonstrations are likely under $P(T \mid \mathbf{w})$. In practice, that means learning what the expert seems to care about — safety, efficiency, smooth motion, rule compliance — only through the features you encode.

A standard theoretical consequence is feature matching (in expectation): the model’s distribution aligns expected feature counts with those implied by the demonstrations. If experts consistently avoid obstacles and move smoothly, the inferred reward should induce similar statistics. Feature design is not cosmetic; it is the language in which preferences become identifiable.

Where the MDP enters

Trajectories $T_i$ are usually feasible paths in an environment: states, actions, and transitions. In the finite case, that is a finite MDP: discrete states and actions, and transitions $P(s' \mid s, a)$ constrain which sequences are valid. The formula does not print the transition table, but candidate trajectories are generated inside that structure. When the state space is huge, you cannot enumerate all trajectories; $Z(\mathbf{w})$ is approximated with dynamic programming, sampling, or other tractable surrogates — that is the central engineering bottleneck.

What “maximum entropy” means here

Entropy measures how spread out a distribution is. High entropy: mass over many trajectories. Low entropy: mass concentrated on a few.

Maximum entropy means: among all distributions that satisfy chosen constraints (typically, matching statistics of the demonstrations, especially feature expectations), pick the distribution that is otherwise least committal — it adds no extra assumptions beyond those constraints.

If several trajectories fit the data, probability should spread across them instead of assigning one path probability one and the rest zero. The exponential-family form is not arbitrary: it arises from maximizing entropy subject to feature-matching constraints. The same structure appears in maximum-entropy IRL, conditional random fields, Gibbs distributions, and softmax policies.

Tiny worked example (features and weights)

Suppose step-level features capture: progress toward goal, collision / obstacle contact, and smooth motion. A weight vector might be $\mathbf{w} = [2.0,\,-5.0,\,1.0]$: progress is good, collisions are strongly penalized, smoothness is mildly rewarded.

Trajectory A reaches the goal while avoiding obstacles → total score 6.
Trajectory B reaches the goal but clips an obstacle → total score 1.

Because $\exp(6) \gg \exp(1)$, A gets much higher probability under $P(T \mid \mathbf{w})$. B is not impossible — only less likely. That is the maximum-entropy mindset: bias toward what explains the expert, without zeroing out plausible alternatives.

Python: scores, partition function, probabilities

The snippet below scores a small finite set of candidate trajectories (enumeration is only toy-scale here).

import math

w = [0.8, -0.2]
T1 = [[1, 0], [0, 1], [1, 1]]
T2 = [[0, 1], [0, 1], [1, 0]]
T3 = [[1, 1], [1, 0]]
trajectories = [T1, T2, T3]


def dot(u, v):
 return sum(ui * vi for ui, vi in zip(u, v))


def trajectory_score(T, w):
 return sum(dot(w, phi) for phi in T)


scores = [trajectory_score(T, w) for T in trajectories]
unnorm = [math.exp(score) for score in scores]
Z = sum(unnorm)
probs = [u / Z for u in unnorm]

for i, (score, prob) in enumerate(zip(scores, probs), start=1):
 print(f"T{i}: score={score:.3f}, P(T{i}|w)={prob:.3f}")

Each step is a feature vector $\boldsymbol{\phi}$.
The trajectory score is $\sum \mathbf{w} \cdot \boldsymbol{\phi}$.
$\exp(\text{score})$ gives an unnormalized weight; $Z$ sums them; division yields valid probabilities.

A mobile robot moves to a packing station while avoiding shelves, slowing near people, and preferring smooth motion. A natural feature vector per $(s,a)$ might include: moves toward goal, close to obstacle, near human worker, smooth motion. Learned $\mathbf{w}$ encodes implicit priorities from demonstrations (positive weights reward preferred behavior, negative weights penalize risk or inefficiency).

Pipeline (conceptual):

Collect expert demonstrations (human teleop, planner traces, etc.).
Represent each trajectory by summing $\boldsymbol{\phi}(s,a)$ along the path (or equivalently summing $\mathbf{w}\cdot\boldsymbol{\phi}$ if optimizing $\mathbf{w}$).
Use $P(T \mid \mathbf{w}) \propto \exp(\sum \mathbf{w}\cdot\boldsymbol{\phi})$ so expert-like paths get higher mass.
Fit $\mathbf{w}$ so demonstrated trajectories are likely and expected features match the data.
Deploy the implied reward for planning in new layouts.

In a real warehouse, many paths can be safe. Maximum entropy avoids pretending there is a single “correct” route; it learns a reward that ranks paths while preserving uncertainty where data are ambiguous.

Feature (example)	Typical meaning
Moves toward goal	Progress to target
Close to obstacle	Penalize risk near shelves or walls
Near human worker	Tighter safety margin
Smooth motion	Predictable, executable motion

Numerical walkthrough (two trajectories, two features)

Let $\boldsymbol{\phi}(s,a) = [\text{towardGoal}, \text{energyCost}]$ and $\mathbf{w} = [2,\,-1]$: reward progress toward the goal, penalize energy.

Two toy three-step trajectories (chosen by hand for readability — in real IRL they would come from demos, simulation, or a planner):

Trajectory	Step 1 $\boldsymbol{\phi}$	Step 2 $\boldsymbol{\phi}$	Step 3 $\boldsymbol{\phi}$	Total score $\sum \mathbf{w}\cdot\boldsymbol{\phi}$
$T_1$	$[1,1]$	$[1,0]$	$[1,1]$	4
$T_2$	$[1,1]$	$[0,1]$	$[1,1]$	1

Step scores for $T_1$: $1,\,2,\,1$ → sum 4. For $T_2$: $1,\,-1,\,1$ → sum 1.

Unnormalized weights: $\exp(4) \approx 54.60$, $\exp(1) \approx 2.72$. So $Z(\mathbf{w}) \approx 57.32$, and

$$ P(T_1 \mid \mathbf{w}) \approx 0.953,\qquad P(T_2 \mid \mathbf{w}) \approx 0.047. $$

The lower-scoring trajectory is unlikely but not zero — by design.

import math

w = [2, -1]
T1 = [[1, 1], [1, 0], [1, 1]]
T2 = [[1, 1], [0, 1], [1, 1]]


def dot(a, b):
 return sum(x * y for x, y in zip(a, b))


def score(T, w):
 return sum(dot(w, phi) for phi in T)


s1, s2 = score(T1, w), score(T2, w)
e1, e2 = math.exp(s1), math.exp(s2)
Z = e1 + e2
print(round(e1 / Z, 3), round(e2 / Z, 3)) # 0.953, 0.047

Engineering takeaways

Takeaway	Why it matters
Feature quality dominates	Weak or wrong features → weak or misleading inferred rewards.
Maximum entropy reduces brittleness	Multiple near-optimal behaviors can coexist instead of a forced deterministic story.
$Z(\mathbf{w})$ is the hard part	Exact enumeration is intractable in large MDPs; implementations use DP, sampling, or approximations.
IRL targets objectives, not only actions	A learned reward often generalizes to new situations better than pure behavior cloning — when the model is right.

Where else this pattern appears

The same score then normalize structure shows up wherever you need a distribution over structured sequences: imitation learning under ambiguity, trajectory prediction in driving, user / gameplay sequence modeling, and structured prediction in ML (exponential models over outputs). The unifying pattern: meaningful features, learned weights, trajectory scores, and a normalized probability over candidates.

Bottom line

The maximum-entropy trajectory model gives a precise way to say: expert-like trajectories should be more probable when they score higher under a hidden linear reward, while the model stays honest about uncertainty when the data do not support a sharper conclusion.

For builders: define features carefully, infer $\mathbf{w}$ so demonstrations and feature statistics are explained, and use the maximum-entropy distribution to avoid overfitting a single story to limited data.

References

Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. “Maximum entropy inverse reinforcement learning.” AAAI, 2008.
Ng, A. Y., & Russell, S. “Algorithms for inverse reinforcement learning.” ICML, 2000.
Sutton, R. S., & Barto, A. G. Reinforcement Learning: An Introduction (2nd ed.). MIT Press, 2018. incompleteideas.net
Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994.

Markov Decision Processes: The Mathematical Foundation of Reinforcement Learning

contact@corebaseit.com (Vincent Bevia) — Wed, 22 Apr 2026 10:00:00 +0100

The Markov Decision Process (MDP) is the standard formal object for sequential decision-making under uncertainty. It separates problem definition — states, actions, how the world evolves, what you want to optimize — from solution methods (value iteration, Q-learning, policy gradients, and their deep variants). That separation is why the same vocabulary shows up across robotics, games, RLHF-tuned language models, and tool-using agents.

I keep coming back to this when people treat LLMs, RL, and “agents” as unrelated product categories. Implementations differ, but state, action, reward, policy, and value recur for a reason: a large class of systems is still answering “what should I do next to maximize something, given what I know now?”

This post builds the MDP core with enough precision to be useful — Markov property, the five-tuple, policies and Bellman equations, how classical methods differ, and inverse reinforcement learning — then connects it to LLMs, RLHF, DPO, and agentic stacks. It is not universal: many deployed models are one-shot predictors or rankers with no explicit sequential RL loop. Where the MDP applies, the mapping is operational, not metaphorical.

The Markov Property and the Five-Tuple

At the center is the Markov property (memorylessness): the next state depends on the recent past only through the current state and action:

$$ P(s_{t+1} \mid s_t, a_t, s_{t-1}, a_{t-1}, \ldots) = P(s_{t+1} \mid s_t, a_t). $$

So the state must summarize whatever matters for the future; if something important is missing from $s_t$, the model is misspecified and you are really in a POMDP (partial observability) — more on that when we get to context windows.

An MDP is usually written as $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

Symbol	Component	Role
$\mathcal{S}$	State space	Configurations the agent can be in
$\mathcal{A}$	Action space	Choices (discrete or continuous)
$P(s' \mid s, a)$	Transition law	Dynamics: where you land after $(s,a)$
$R$	Reward	Immediate signal, often $R(s,a)$ or $R(s,a,s')$
$\gamma$	Discount	$\gamma \in [0,1]$ weights future reward vs. now

The objective is a policy $\pi$ that maximizes expected discounted return $ \mathbb{E}\big[\sum_t \gamma^t R_t\big] $. An optimal policy $\pi^*$ achieves the best achievable value from each state (under the usual regularity conditions).

State and action design is an engineering problem: too little information in $s$ and the Markov assumption is false; too much and you fight the curse of dimensionality. Actions drive algorithm choice — small discrete spaces admit tabular methods; huge discrete vocabularies (e.g. tens of thousands of tokens) or continuous control push you toward function approximation and deep RL.

$\gamma$ sets the effective horizon: $\gamma = 0$ is myopic (only immediate reward); $\gamma$ close to $1$ cares about the long run (in infinite-horizon settings, $\gamma < 1$ keeps returns bounded). Pure $\gamma = 1$ is typical for finite-horizon episodic problems without discounting.

Rewards are the lever everyone feels in production: misspecify them and you get reward hacking — policies that maximize the signal you wrote, not the outcome you wanted. That story continues unchanged in RLHF.

Policies, Value Functions, and the Bellman Equation

A policy $\pi$ maps states to actions (deterministic: $a = \pi(s)$) or distributions (stochastic: $\pi(a \mid s)$). To rank policies, RL uses value functions:

State value $V^\pi(s)$: expected return starting in $s$ and following $\pi$.
Action-value $Q^\pi(s,a)$: expected return from taking $a$ in $s$, then following $\pi$.

They satisfy Bellman consistency. For example, for $V^\pi$:

$$ V^\pi(s) = \sum_{a} \pi(a \mid s) \sum_{s'} P(s' \mid s,a) \big[ R(s,a,s') + \gamma V^\pi(s') \big]. $$

Optimal values $V^*$, $Q^*$ obey the Bellman optimality recursion and are the target of dynamic programming when $P$ and $R$ are known. When the model is unknown, you fall back to sample-based methods.

Solution Methods (High Level)

Family	Idea	When it fits
Dynamic programming	Value / policy iteration using $P$	Model known, moderate $\\|\mathcal{S}\\|$
Monte Carlo	Return estimates from full episodes	Episodic, no step-by-step model
Temporal difference	Bootstrap from current estimates (e.g. Q-learning, SARSA)	Online learning, unknown model
Deep RL	Neural nets for $Q$ or $\pi$ (DQN, PPO, …)	Large or continuous state spaces

Deep RL does not change the MDP; it changes how you represent and optimize value and policy when tabulation is impossible — including settings as large as language.

Inverse Reinforcement Learning

Forward RL: given $R$ (and dynamics), find a good $\pi$. Inverse RL (IRL) flips the problem: given demonstrations from a (near-)expert, infer an $R$ that makes those trajectories rational. That matters when rewards are hard to write down but behavior is easy to show — classic examples include imitation-style control and parsing “what the human cared about” from what they did.

Maximum-entropy IRL (Ziebart et al.) makes the expert stochastic but high reward: trajectories are scored by accumulated reward features, and probability over trajectories often takes a Boltzmann form, with a partition function coupling normalization to the underlying MDP structure. The details are involved; the takeaway for this post is that IRL is still built on the same sequential decision formalism — you are inferring preferences compatible with observed paths, not escaping the MDP language.

Where the Markov Assumption Meets LLMs

In autoregressive generation, a standard idealization is: state = prompt plus all generated tokens so far; action = next token; transition = append token (deterministic at the string level); policy = conditional distribution from the model. Then the next distribution depends only on the prefix — Markov in that state representation.

The usual engineering gap: true conversational or task state may live outside the window or never be observed. That is partial observability again (POMDP / belief-state view). “Lost context” is often finite window or wrong state summary, not a random tone failure — which is why memory, retrieval, and tool traces are architecture, not cosmetics.

RLHF, DPO, and the Same Sequential Picture

RLHF (InstructGPT-style): the LM is a policy over tokens; a reward model from human preferences scores completions; optimization (often PPO-class in the original stack) increases reward while a KL penalty to a reference policy limits drift. Mapping:

MDP role	Typical RLHF instantiation
State	Prompt + generated prefix
Action	Next token (or chunk)
Transition	Append; dynamics deterministic given action choice
Reward	Learned preference score (minus KL / auxiliary terms)

Framed this way, alignment pain is largely reward specification and optimization under misspecified proxy rewards — the same failure mode family as classical reward hacking. OpenAI’s GPT-4o sycophancy rollback (April 2025) is a concrete example when short-term preference signals diverge from what you want long term. See also AI Sycophancy here.

DPO (Direct Preference Optimization) and related methods avoid an explicit online RL loop by optimizing from pairwise preferences in a way derived from the RLHF objective — still preference-driven alignment, but not “PPO on tokens” in implementation. The MDP is still the right mental model for what is being aligned (sequential decisions under a goal), even when the optimizer is not vanilla policy gradients.

A Practical Decision Landscape (Not Five Silos)

The field is messier than any chart, but this is a useful lens for choosing tooling:

Situation	Common approach
Known reward, safe exploration	Forward RL (e.g. PPO, Q-learning variants)
Expert demos, unclear reward	IRL / imitation / inverse-optimal-control style methods
Broad open-ended language capability	Pretrained LM (supervised / next-token objective)
Align to human taste or policy	RLHF, DPO-class preference training, or hybrids
Multi-step tools + retrieval + planning	Agentic systems (often LM policy + search / ReAct-style loops)

Agentic systems stack LLMs (policy / world-model substrate), search or tree exploration, RAG (state enrichment), and tools (expanded actions). Under the hood it is still: maintain state, choose actions, observe outcomes, repeat — with stochasticity from both the model and the environment.

The Engineering Takeaway

You do not need to re-derive Bellman on a whiteboard every sprint. You do need to:

Separate problem definition from algorithms — clarify $\mathcal{S}, \mathcal{A}, R$ before debating PPO vs. DPO vs. prompts.
Treat alignment bugs as reward–policy interaction, not vague “personality.”
Design memory and retrieval as state construction when Markov fails.
Ask what each agent demo actually optimizes — implicit reward, success predicate, or human-in-the-loop only.

The MDP is not a graduate-school ornament. It is the backbone that makes much of RL debuggable and much of sequential AI legible — whether or not your README says “Markov.”

References

Bellman, R. Dynamic Programming. Princeton University Press, 1957.
Puterman, M. L. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994.
Sutton, R. S., & Barto, A. G. Reinforcement Learning: An Introduction (2nd ed.). MIT Press, 2018. incompleteideas.net
Ng, A. Y., & Russell, S. “Algorithms for inverse reinforcement learning.” ICML, 2000.
Ziebart, B. D., Maas, A. L., Bagnell, J. A., & Dey, A. K. “Maximum entropy inverse reinforcement learning.” AAAI, 2008.
Ouyang, L. et al. “Training language models to follow instructions with human feedback.” NeurIPS, 2022. arxiv.org/abs/2203.02155
Schulman, J. et al. “Proximal Policy Optimization Algorithms.” 2017. arxiv.org/abs/1707.06347
Rafailov, R. et al. “Direct Preference Optimization: Your Language Model is Secretly a Reward Model.” NeurIPS, 2023. arxiv.org/abs/2305.18290
Yao, S. et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR, 2023. arxiv.org/abs/2210.03629
OpenAI. “Sycophancy in GPT-4o: What happened and what we’re doing about it.” April 2025. openai.com
Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. “Planning and acting in partially observable stochastic domains.” Artificial Intelligence, 101(1–2), 99–134, 1998.

SNR: the number that decides whether a signal survives

contact@corebaseit.com (Vincent Bevia) — Thu, 16 Apr 2026 14:00:00 +0200

Every communication system starts with the same goal: move a signal from one place to another and recover its meaning at the far end. In practice the signal passes through copper, air, fiber, antennas, amplifiers, filters, and ADCs. At each stage it picks up thermal noise, interference, quantization error, phase noise, and distortion.

By the time the waveform reaches the receiver, the question is not whether something arrived. The question is whether the useful signal is strong enough relative to the noise for the receiver to decide what was sent. That ratio is signal-to-noise ratio (SNR).

The diagram above ties the pieces together: SNR as a power ratio, what falling SNR does to a QPSK constellation, why higher-order QAM needs more margin, and where Shannon capacity sets the ceiling.

What SNR measures

SNR compares signal power to noise power:

$$ \mathrm{SNR} = \frac{P_s}{P_n} $$

When $P_s \gg P_n$, symbol decisions are reliable. When the two powers are comparable, the receiver is guessing. When noise dominates, the message is buried.

SNR is therefore both a measurement and a statement about decision confidence. Communication receivers are, at bottom, machines that infer which symbol or bit was transmitted from a noisy observation.

Why engineers use decibels

Power ratios in radio links span enormous dynamic range. Expressing SNR in decibels keeps the arithmetic manageable:

$$ \mathrm{SNR}_{\mathrm{dB}} = 10 \log_{10}\left(\frac{P_s}{P_n}\right) $$

Each 10 dB step is a tenfold change in power ratio:

Linear SNR	SNR (dB)
10	10
100	20
1,000	30

Link budgets, antenna gains, cable losses, and amplifier noise figures are almost always handled in dB for this reason. The underlying idea stays simple: higher SNR means the signal stands farther above the noise floor.

The received signal and the QPSK picture

A simplified continuous-time model of what the receiver sees is:

$$ r(t) = s(t) + n(t) $$

The receiver must map each observation $r(t)$ (or its sampled form) to the most likely transmitted symbol. Small noise keeps the sample near the correct decision region. Large noise pushes it toward a neighbor. That is where bit errors start.

QPSK maps two bits to one of four phases in the I/Q plane. At high SNR, received points cluster tightly around the ideal corners. As SNR falls, the clouds spread. Points cross the I/Q axes that separate symbols, and the demodulator starts flipping bits. The symbol energy is still present; the evidence for which symbol it was is not.

Confirmed: Constellation spreading with falling SNR is the standard AWGN intuition for square QAM and PSK families.

Nuance: Real channels add fading, frequency offset, and ISI. Constellation diagrams then show rotation, elliptical spreading, or smeared trajectories — not just larger circular clouds. SNR alone does not fully describe those impairments.

SNR and data rate

SNR also limits how aggressively a link can modulate.

BPSK and QPSK place constellation points far apart relative to bits per symbol. They tolerate lower SNR. Higher-order formats — 16-QAM, 64-QAM, 256-QAM — pack more bits into the same bandwidth by moving points closer together. Spectral efficiency rises. Noise margin falls.

That trade-off shows up in adaptive modulation and coding (AMC) in Wi-Fi, LTE, and 5G: when measured SNR (or SINR) is high, the link selects a higher-order modulation and a stronger code rate; when it drops, the stack retreats to a robust mode. That fallback is not waste. It is the system staying inside a BER or BLER target.

Connection to Shannon capacity

SNR enters Shannon’s capacity formula for an AWGN channel with bandwidth $B$:

$$ C = B \log_2(1 + \mathrm{SNR}) $$

Here $\mathrm{SNR}$ is a linear power ratio, not a dB value. Bandwidth and SNR both lift capacity, but the log term means returns diminish: doubling transmit power does not double capacity. At high SNR, capacity grows roughly as $\log_2(\mathrm{SNR})$.

Confirmed: Shannon’s bound sets a theoretical ceiling for reliable rate on a noisy channel [1][2].

Interpretation: Pushing more bits per second through a fixed band requires more SNR, more bandwidth, stronger coding gain, or some combination. There is no free margin once you are near the bound.

In deployed systems, raising transmit power is only one lever — and often not the best. Regulatory EIRP limits, battery drain, PA nonlinearity, and co-channel interference all cap how far “turn it up” can go. Filtering, FEC, MIMO, equalization, and better channel estimation usually share the workload with power.

Not every “SNR” is the same number

SNR is quoted at many points in a receiver chain:

At the antenna port
After the LNA
After channel filtering
At the ADC
After digital gain and correction

Related metrics answer slightly different questions:

Metric	What it emphasizes
$\mathrm{SINR}$	Signal vs. noise plus interference
$\mathrm{EVM}$	How far received symbols deviate from ideal constellation points
$\mathrm{BER}$ / $\mathrm{PER}$	End-to-end error rate after demodulation and decoding
$E_b/N_0$	Bit energy relative to noise spectral density $N_0$

$E_b/N_0$ is the usual figure for comparing modulation and coding schemes on an AWGN reference channel. It ties to SNR through data rate and bandwidth; they are not interchangeable without stating assumptions.

A headline SNR can look acceptable while the link still fails — for example, when phase noise rotates the constellation, timing error shifts samples, co-channel interference raises the effective noise floor, or channel-estimation error smears the reference. EVM, BER, and SINR often localize the failure better than a single RF SNR number.

What falling SNR looks like in practice

On a QPSK link, high SNR gives four separated clusters and negligible errors. Medium SNR widens the clusters; most symbols still decode, but edge cases near boundaries fail. Low SNR produces overlapping clouds: the demodulator runs, yet BER climbs, packets retry, and throughput collapses. To the user it feels like a slow connection. To the receiver it is a maximum-likelihood decision with weak evidence.

The same pattern appears across domains — Wi-Fi rate adaptation, cellular handover margins, satellite link closures in rain fade, optical OSNR limits, and ADC dynamic range before quantization noise dominates.

References

C. E. Shannon, “Communication in the presence of noise,” Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, Jan. 1949.
J. G. Proakis and M. Salehi, Digital Communications, 5th ed. McGraw-Hill, 2008. (SNR, modulation, and AWGN channel performance.)
B. Sklar, Digital Communications: Fundamentals and Applications, 2nd ed. Prentice Hall, 2001. (Constellation diagrams, $E_b/N_0$, and link budgets.)
D. Tse and P. Viswanath, Fundamentals of Wireless Communication. Cambridge University Press, 2005. (SINR, fading, and adaptive modulation.)
R. G. Lyons, Understanding Digital Signal Processing, 3rd ed. Pearson, 2011. (SNR in sampled and quantized systems.)

I Spent Years on Adaptive Filters. I Was Already Training Neural Networks.

contact@corebaseit.com (Vincent Bevia) — Tue, 14 Apr 2026 10:00:00 +0100

I spent years implementing LMS-based equalizers and echo cancellers in telecommunications. Only later did I fully appreciate what I had been doing mathematically: the same family of update rules that powers neural network training today.

Not as a loose analogy — as the same structure of optimization. Widrow and Hoff formalized the Least Mean Squares (LMS) algorithm in 1960 for the Adaline. Rumelhart, Hinton, and Williams scaled related ideas through multi-layer networks with backpropagation in 1986. The vocabulary changed from adaptive filtering to deep learning, but the core idea — adjust parameters in the direction that reduces error, one small step at a time — is continuous across both worlds.

This post is my attempt to make that lineage explicit: what LMS actually is, why it is structurally the same rule as stochastic gradient descent on a linear model, how the engineering trade-offs line up, and why non-stationarity remains the hard problem in both domains.

LMS Is Not a Metaphor for Training — It Is the Algorithm

The LMS update for a linear combiner (FIR filter or single Adaline) is:

$$ \mathbf{w}(n+1) = \mathbf{w}(n) + \mu \, e(n) \, \mathbf{x}(n) $$

Here $\mathbf{w}(n)$ is the weight vector at time $n$, $\mathbf{x}(n)$ is the input vector (tap-delay line or feature vector), $e(n) = d(n) - y(n)$ is the error between the desired response $d(n)$ and the output $y(n) = \mathbf{w}^\top(n)\mathbf{x}(n)$, and $\mu$ is the step size.

That is stochastic gradient descent on the instantaneous squared error $\frac{1}{2}e^2(n)$ with respect to $\mathbf{w}$. The gradient of $\frac{1}{2}(d - \mathbf{w}^\top\mathbf{x})^2$ with respect to $\mathbf{w}$ is $-e\,\mathbf{x}$. Walking in the opposite direction of the gradient (or equivalently, in the direction $+e\,\mathbf{x}$ when you define the update as above) is exactly the LMS rule.

So if you have ever shipped an LMS equalizer or echo canceller, you have implemented the foundational learning rule that underlies a huge fraction of modern machine learning: small steps proportional to error times input. The notation in Haykin’s Adaptive Filter Theory differs from PyTorch docs; the mathematics does not.

Multi-layer networks add the chain rule (backpropagation) to compute how error propagates to earlier layers, but the local update at a linear layer trained with mean squared error is still the same structural move: adjust weights in proportion to error and activations. Everything else — momentum, Adam, adaptive learning rates — is engineering on top of that spine.

The Engineering Trade-Offs Are the Same Trade-Offs

In telecommunications, the step size $\mu$ controls the classic compromise: convergence speed versus steady-state misadjustment. Too large — the filter can diverge or oscillate. Too small — the filter cannot track a fast-fading channel or a moving echo path. Entire chapters of adaptive filtering textbooks are devoted to stability bounds on $\mu$ (often expressed in terms of input power and filter length) and to variants that fix the worst-case behavior.

In deep learning, the learning rate $\eta$ plays the same role at a higher level: too high and training diverges or chatters around a minimum; too low and you underfit or burn compute without making progress. The community talks about learning-rate schedules, warm-up, and cosine decay — different names for the same instinct: the right step size depends on the landscape and may need to change over time.

Normalized LMS (NLMS) scales the update by the inverse of the input energy $\|\mathbf{x}(n)\|^2$ (with a small regularizer to avoid division by zero). The goal is stable convergence when input power varies — the same motivation that shows up in adaptive optimizers that normalize updates by running statistics of gradients (RMSProp-style normalization is not identical to NLMS, but the intent — tame the step when the signal scale changes — is shared). The DSP community spent decades refining these ideas for real-time hardware; ML rediscovered many of the same pressures when training became unstable at scale.

Non-Stationarity Was Always the Real Problem — and Still Is

Adaptive filters were built for non-stationary environments: multipath fading, time-varying echoes, drifting noise floors. The “true” optimal weights are not fixed; they move. The filter is not supposed to converge once and freeze — it is supposed to track. That mindset is closer to production ML than a static batch fit on a fixed dataset.

Modern systems face the same phenomenon under different labels: distribution shift, concept drift, stale features, changing user behavior, adversarial drift in inputs. The model that was optimal last month is not guaranteed to be optimal this month. Retraining on a schedule, online updates, monitoring, and guardrails are the engineering response — conceptually in the same family as “never assume the channel is static.”

Research on in-context learning in linear models (for example Akyürek et al., 2022) even investigates which learning algorithms are implicitly approximated by transformers under simplified settings — another reminder that the boundary between classical adaptive signal processing and contemporary ML is thinner than course catalogs suggest.

The Bigger Picture

For engineers who came up through telecommunications and signal processing, the move into AI is often described as a career pivot. In my experience, it is closer to a change of vocabulary on top of a continuous mathematical thread: error-driven updates, step-size discipline, stability under non-stationarity, and the centrality of second-order statistics (explicitly in LMS, implicitly in much of modern training).

The boundary between DSP and machine learning was never as sharp as the literature implied. If you understand LMS, you already understand a piece of what every deep learning framework is doing when it steps the weights. The rest is scale, architecture, and tooling — important, but not magic.

References

Widrow, B., & Hoff, M. E. “Adaptive switching circuits.” IRE WESCON Convention Record, 4, 96–104, 1960.
Haykin, S. Adaptive Filter Theory (4th ed.). Prentice Hall, 2002.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. “Learning representations by back-propagating errors.” Nature, 323, 533–536, 1986.
Akyürek, E. et al. “What learning algorithm is in-context learning? Investigations with linear models.” 2022. arxiv.org/abs/2211.15661

Stochastic, Entropy & AI: From Thermodynamics to Information Theory to Modern Machine Learning

contact@corebaseit.com (Vincent Bevia) — Sat, 07 Mar 2026 12:00:00 +0100

I was listening to a podcast the other day about AI and the mathematics behind it — especially stochastic processes, entropy, and probability — and it immediately drew me in. With a background in electrical engineering and telecommunications, I have always found this intersection fascinating, so I decided to write this article. I hope you enjoy it.

There is a thread running through thermodynamics, information theory, and modern artificial intelligence — and it is deeper than analogy. The mathematics used to describe the disorder of a gas, the uncertainty of a message, and the optimization of a neural network are closely related. Understanding that connection is not merely academic. It clarifies why stochasticity and entropy are not bugs in AI systems, but foundational design principles.

This post traces that thread: from Boltzmann and Shannon to cross-entropy loss, temperature settings, and elliptic curves in modern cryptography. Physics, information theory, and language models rest on deeply connected mathematical foundations.

1. Stochastic: Governed by Probability

The word stochastic comes from the Ancient Greek στοχαστικός (stokhastikós), related to στοχάζομαι (“to aim, to guess”) and τόχος (“target”). In modern science and engineering, it means governed by probability. A stochastic process is one where outcomes are not deterministic; they are drawn from a probability distribution. Given the same initial conditions, you may get a different result each time.

The opposite is deterministic — the same input always yields exactly the same output. But not all randomness is the same.

Epistemic vs. Ontic Randomness

A coin flip, at the level of classical mechanics, is deterministic. Given exact knowledge of initial position, velocity, air currents, and surface properties, Newtonian physics would predict the outcome with certainty. The randomness we assign to it is epistemic — a product of our ignorance of initial conditions, not of any fundamental indeterminacy in nature. We model it as a fair Bernoulli trial because we cannot practically measure or control those conditions.

Thermal noise — Johnson–Nyquist noise — is different. It arises from the random thermal agitation of charge carriers in a conductor and is rooted in quantum and statistical mechanics. At practical engineering scales, such fluctuations are treated as fundamentally irreducible and modeled statistically. This is ontic randomness — intrinsic to the physical system.

Epistemic randomness reflects our ignorance; in principle, a perfect observer could remove it. Ontic randomness is intrinsic; no amount of additional information eliminates it. This distinction matters for how we interpret probabilistic models in physics, engineering, and AI.

2. Entropy in Communications: Shannon’s Measure of Uncertainty

In 1948, Claude Shannon published A Mathematical Theory of Communication. He defined a precise mathematical measure of uncertainty — which he called entropy — deliberately borrowing the term from thermodynamics.

Shannon entropy measures the average uncertainty of an information source:

H(X) = −∑ p(x) · log₂ p(x)

Where p(x) is the probability of each possible symbol. If a source always sends the same symbol, entropy is zero — no surprise, no information. If all symbols are equally likely, entropy is maximized — maximum uncertainty and maximum information per symbol.

Shannon entropy is the theoretical lower bound on how many bits you need to encode a message without loss. It answers the question: how unpredictable is this source? A source with low entropy can be heavily compressed. A source with high entropy cannot be compressed further — it is already maximally dense with information.

This is the foundation of data compression and channel capacity theory. The famous Shannon limit defines the maximum rate at which information can be transmitted over a noisy channel without error.

3. Thermodynamic and Information Entropy: Shared Mathematical Form

The relationship between Shannon’s information entropy and Boltzmann’s thermodynamic entropy is not a metaphor. It is a deep mathematical connection.

Boltzmann defined thermodynamic entropy as:

S = k · ln(W)

Where k is Boltzmann’s constant and W is the number of possible microstates a physical system can occupy. A gas with molecules spread randomly everywhere has more possible configurations — higher entropy. A perfectly ordered crystal has very few microstates — low entropy.

When Shannon showed his formula to John von Neumann and asked what to call it, von Neumann reportedly replied: “Call it entropy. Nobody knows what entropy really is, so in a debate you will always have the advantage.”

Beyond the wit, Shannon recognized something profound: the formulas are closely related in structure. Both describe multiplicity, uncertainty, and the distribution of possible states. Both measure, in different domains, how much is not fully specified about a system.

Landauer’s Principle: Information Has Physical Cost

Maxwell’s Demon — a thought experiment from 1867 — imagined a tiny demon sorting fast molecules from slow ones, seemingly reducing thermodynamic entropy without doing work. The resolution, formalized by Rolf Landauer, is that the demon must store information about each molecule. When it erases that information from memory, that erasure costs energy and generates heat.

Landauer’s Principle: Erasing one bit of information dissipates a minimum amount of energy and produces a corresponding increase in thermodynamic entropy.

Information is not abstract. It has a physical cost. The second law of thermodynamics and the limits of data compression are deeply connected constraints viewed from different angles.

Thermodynamics	Information Theory
Physical disorder	Message unpredictability
Heat dissipation	Bit erasure cost
Second law: entropy increases	Cannot compress below Shannon entropy
Equilibrium tends toward high entropy	Random noise is a maximum-entropy source

4. How These Concepts Percolate into AI

Stochastic processes and entropy are structurally embedded in how neural networks are trained, how language models generate text, and how reinforcement learning agents explore.

Cross-Entropy Loss

The most widely used training objective in neural networks — especially for classification and language models — is cross-entropy loss. It measures how different the model’s predicted probability distribution is from the target distribution. Minimizing cross-entropy loss is equivalent to maximizing the likelihood of correct outputs. Every time a language model trains, it is performing optimization grounded in Shannon-style information measures.

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) samples random mini-batches instead of computing gradients over the full dataset. The randomness this introduces is not merely a computational shortcut — it also helps models explore the loss surface more effectively than a fully deterministic optimizer would.

Temperature as an Entropy Control

When a large language model generates the next token, it samples from a probability distribution over the vocabulary. Temperature directly affects the entropy of that distribution:

Low temperature — peaky distribution, near-deterministic, low entropy. The model tends to pick the highest-probability token.
High temperature — flatter distribution, more random, higher entropy. The model explores less likely but sometimes more creative options.

When you adjust temperature in an LLM, you are rescaling the logits, which usually makes the next‑token distribution lower‑entropy (more peaked) at low temperatures and higher‑entropy (flatter) at high temperatures. In doing so, you reshape uncertainty in the output distribution. Physics, information theory, and language models all rely on closely related mathematics.

KL Divergence and Entropy Regularization

Kullback–Leibler divergence measures how one probability distribution diverges from another. It is defined in terms of entropy and is used in settings such as variational autoencoders and RLHF to keep models from drifting too far from a target distribution.

In reinforcement learning, entropy regularization — used in algorithms like Soft Actor-Critic (SAC) — explicitly rewards a policy for maintaining high entropy, encouraging exploration rather than premature collapse into a single deterministic strategy.

These ideas also surface in modern cryptography, where secure systems rely on mathematical structure, one-way functions, and carefully managed randomness.

An elliptic curve is defined by the Weierstrass equation:

y² = x³ + ax + b

In practical cryptography, elliptic curves are defined over finite fields, turning the curve into a discrete set of points with useful algebraic properties.

Public key Q = Private key k × G

Where G is a fixed public generator point and k is a secret private integer. Point multiplication means repeated elliptic-curve addition under well-defined algebraic rules.

Why this is a one-way function: Computing Q from k is efficient. Recovering k from Q and G is computationally infeasible. This is the Elliptic Curve Discrete Logarithm Problem (ECDLP). A 256-bit elliptic-curve key is commonly regarded as offering security comparable to a 3072-bit RSA key.

The connection to entropy is subtle but important: digital signature schemes such as ECDSA rely on per-signature randomness. If that randomness is reused or becomes predictable, the private key may be exposed. In cryptography, randomness is not a convenience. It is a security requirement.

Key Takeaways

Stochasticity is the mechanism — uncertainty is not a failure of understanding, but a fundamental feature of physical and informational systems.
Entropy is the measurement — a precise mathematical way to quantify that uncertainty.
These domains share related mathematical structures — from Boltzmann in the nineteenth century to Shannon in the twentieth, and from there to cross-entropy loss, temperature scaling, and KL divergence in modern AI.
Information has physical cost — Landauer’s principle links information theory and thermodynamics at a physical level.
Cryptography and AI both depend on structured uncertainty — whether in probabilistic modeling, optimization, or secure randomness.

The second law of thermodynamics and the limits of data compression are deeply connected constraints, viewed through different lenses. The disorder of a physical system, the uncertainty of a message, and the probabilistic behavior of a language model can all be described using closely related mathematical ideas. That is one of the most elegant continuities in the history of science.

References

Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423.
Landauer, R. (1961). Irreversibility and Heat Generation in the Computing Process. IBM Journal of Research and Development, 5(3), 183–191.
Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. (Chapters on cross-entropy, SGD, and variational methods.)
Hopsworks (2025). LLM Temperature. Covers low-T = peaked/predictable and high-T = flat/creative in LLMs.
Gebodh, N. (2024). Why Does My LLM Have A Temperature?. Softmax and temperature math.

Symbol	Definition	Structure
\(\mathbf{R}\)	\(E[\mathbf{x}(n)\mathbf{x}^T(n)]\)	\(N \times N\) autocorrelation matrix of the input
\(\mathbf{p}\)	\(E[\mathbf{x}(n)\,d(n)]\)	\(N \times 1\) cross-correlation between input and desired signal

Trajectory	Step 1 \(\boldsymbol{\phi}\)	Step 2 \(\boldsymbol{\phi}\)	Step 3 \(\boldsymbol{\phi}\)	Total score \(\sum \mathbf{w}\cdot\boldsymbol{\phi}\)
\(T_1\)	\([1,1]\)	\([1,0]\)	\([1,1]\)	4
\(T_2\)	\([1,1]\)	\([0,1]\)	\([1,1]\)	1

Symbol	Adaptive filtering (DSP)	Deep learning (ML)
\(\mathbf{w}\)	Filter coefficients / weight vector	Weight vector / parameters
\(\mathbf{x}\)	Input vector / tap-delay line	Feature vector / activations
\(e(n)\)	Desired response \(d(n)\) minus output	Target label minus prediction
\(\mu\)	Step size	Learning rate \(\eta\)
\(\tfrac{1}{2}e^2\)	Instantaneous squared error	MSE loss on a single example

Term	Meaning
\(T_i\)	One candidate trajectory, path, or behavior sequence
\(\mathbf{w}\)	Learned weights; defines the linear reward in this setup
\(\boldsymbol{\phi}(s,a)\)	Features for \((s,a)\): progress, risk, smoothness, compliance, …
\(\mathbf{w} \cdot \boldsymbol{\phi}(s,a)\)	Scalar reward for one step
\(\sum_{(s,a) \in T_i}\)	Sum of step rewards along the trajectory
\(\exp(\cdot)\)	Turns scores into positive, unnormalized masses
\(Z(\mathbf{w})\)	Normalizer over trajectories (the hard part in large spaces)

Symbol	Component	Role
\(\mathcal{S}\)	State space	Configurations the agent can be in
\(\mathcal{A}\)	Action space	Choices (discrete or continuous)
\(P(s' \mid s, a)\)	Transition law	Dynamics: where you land after \((s,a)\)
\(R\)	Reward	Immediate signal, often \(R(s,a)\) or \(R(s,a,s')\)
\(\gamma\)	Discount	\(\gamma \in [0,1]\) weights future reward vs. now

Family	Idea	When it fits
Dynamic programming	Value / policy iteration using \(P\)	Model known, moderate \(\\|\mathcal{S}\\|\)
Monte Carlo	Return estimates from full episodes	Episodic, no step-by-step model
Temporal difference	Bootstrap from current estimates (e.g. Q-learning, SARSA)	Online learning, unknown model
Deep RL	Neural nets for \(Q\) or \(\pi\) (DQN, PPO, …)	Large or continuous state spaces

Metric	What it emphasizes
\(\mathrm{SINR}\)	Signal vs. noise plus interference
\(\mathrm{EVM}\)	How far received symbols deviate from ideal constellation points
\(\mathrm{BER}\) / \(\mathrm{PER}\)	End-to-end error rate after demodulation and decoding
\(E_b/N_0\)	Bit energy relative to noise spectral density \(N_0\)

Mathematical Foundations on Corebaseit — POS · EMV · Payments · AI · Telecommunications

Nyquist is not Shannon: why more samples does not mean more information

Two limits, two questions

Aliasing is structural damage

The oversampling fallacy about Shannon capacity

What oversampling actually buys you

Quantization noise: spread, shape, filter

Layered constraints in real receivers

References

Further reading

FIR Filter Design on FPGA: Manual Engineering vs AI-Assisted Workflows

Companion document

The baseline: what must be correct before writing RTL

Pole-zero analysis: manual C vs generated code

RTL architecture choice: direct form or transpose form

Fixed-point discipline is not optional

What AI changes in the engineering loop

A verification protocol that scales

Closing perspective

References

Adaptive Filters and Stochastic Gradient Descent: One Update Rule, Two Vocabularies

A Continuous Lineage, Not a New Beginning

LMS as Stochastic Gradient Descent

Step Size, Learning Rate, and the Geometry of the Loss Surface

Normalization: NLMS, RMSProp, Adam

Tracking, Not Convergence

A Continuous Mathematical Thread

References

Further Reading

Deriving MMSE: what the Wiener filter actually minimizes

Setup and notation

Expanding the cost function

The Wiener-Hopf equation

Deriving the minimum MSE

What the result says

Connection to LMS and adaptive filtering

Lab

References

Maximum Entropy Inverse Reinforcement Learning: Understanding the Trajectory Formula

The formula

Notation at a glance

Intuition: a softmax over trajectories

IRL, MDPs, and feature matching

Where the MDP enters

What “maximum entropy” means here

Tiny worked example (features and weights)

Python: scores, partition function, probabilities

Application sketch: warehouse navigation

Numerical walkthrough (two trajectories, two features)

Engineering takeaways

Where else this pattern appears

Bottom line

References

Further reading

Markov Decision Processes: The Mathematical Foundation of Reinforcement Learning

The Markov Property and the Five-Tuple

Policies, Value Functions, and the Bellman Equation

Solution Methods (High Level)

Inverse Reinforcement Learning

Where the Markov Assumption Meets LLMs

RLHF, DPO, and the Same Sequential Picture

A Practical Decision Landscape (Not Five Silos)

The Engineering Takeaway

References

Further reading

SNR: the number that decides whether a signal survives

What SNR measures

Why engineers use decibels

The received signal and the QPSK picture

SNR and data rate

Connection to Shannon capacity

Not every “SNR” is the same number

What falling SNR looks like in practice

References

Further reading

I Spent Years on Adaptive Filters. I Was Already Training Neural Networks.

LMS Is Not a Metaphor for Training — It Is the Algorithm

The Engineering Trade-Offs Are the Same Trade-Offs

Non-Stationarity Was Always the Real Problem — and Still Is

The Bigger Picture