From MATLAB to metal: 6 hard-won lessons in SoC integration

contact@corebaseit.com (Vincent Bevia) — Fri, 12 Jun 2026 10:00:00 +0200

The PHY algorithm passes every simulation in MATLAB. The Bit Error Rate (BER) curve hits the theoretical bound. The Error Vector Magnitude (EVM) budget has margin. The golden model is signed off. Then the design goes onto silicon, and the system fails to lock. Or it locks, but sits on a BER floor that no amount of loop-gain tuning can break through.

I have seen this pattern enough times to know where it comes from. The gap between a floating-point MATLAB model and a working SoC is not one problem. It is a stack of them: fixed-point quantization, compiler assumptions about memory visibility, register interface timing, coefficient update atomicity. Each is individually tractable. Together, they form a discipline that rarely shows up in a single textbook or course.

These are six lessons from that discipline. All of them cost real silicon time to learn.

Languages are a pipeline, not a resume line

A common mistake is treating MATLAB, C/C++, and Python as separate skills. In practice, they are stages in one verification pipeline:

MATLAB defines the golden reference: the algorithm, the expected BER curve, the coefficient sets, the input/output vectors. C implements the firmware-facing datapath and hardware register interface. C++ provides simulation infrastructure and reusable DSP blocks (RAII for resource management, templates for parameterized filter stages) without obscuring the hardware cost model. Python runs regression harnesses, BER sweeps, and sample-by-sample comparisons between hardware captures and MATLAB references.

The flow looks like this:

Reference Model → Vector Export → C/C++ Implementation → Register Interface → Hardware Capture → Python Comparison → Debug → Spec Update.

That last step matters more than it usually gets credit for. The loop only closes when the debug results feed back into the specification. Specifications that are not updated after bring-up become a liability on the next tape-out. The documented behavior and the actual silicon behavior quietly diverge, and the next team inherits assumptions that no longer match the hardware they are building on.

Each language handles a different fidelity level and a different verification role. Treating them as isolated tools breaks the chain of evidence between “mathematically correct” and “physically correct on this die.”

The register map is a contract, not a header file

The boundary between hardware and firmware is the decision with the highest cost of change in SoC design. Once the chip is taped out, the register interface is fixed. Firmware can be patched. Silicon cannot.

Hardware should own the sample-rate and symbol-rate datapath: FFT/IFFT engines, FIR filters, FEC datapaths, timing recovery (the Timing Error Detector and interpolator). These run at rates where firmware cannot keep up.

Firmware should own intelligence and policy: loop sequencing, gain updates (Kp/Ki in a PLL, step-size μ in an equalizer), power-state transitions, and calibration routines. These change across operating modes, channel conditions, and device revisions.

The architectural discipline is in what goes where and how the two sides communicate through registers.

Mixing control (read/write) and status (read-only) fields in the same register invites Read-Modify-Write (RMW) hazards. The firmware reads the register, modifies a control bit, and writes it back. Between the read and the write, the hardware sets a status bit. The write overwrites it. The event is lost. The fix is structural: keep control and status in separate registers. This is standard practice in ARM AMBA peripheral designs and is explicitly recommended in the Accellera SystemRDL 2.0 register description methodology [1][2].

For sticky status bits (saturation flags, overflow indicators, lock-loss events), use Write-1-to-Clear (W1C) semantics. The hardware sets the bit on the event. The firmware clears it by writing a 1 to that bit position. The event persists until explicitly acknowledged. Writing 0 has no effect, which eliminates the RMW race entirely for that field.

`volatile` is necessary, and not always sufficient

A classic bring-up surprise: the firmware reads a hardware register and gets stale data. The engineer stares at the waveform, confirms the hardware updated the value two microseconds ago, and the C code is still reading the old one.

The cause is compiler optimization. The C standard permits a conforming compiler to cache the value of a memory read if no intervening software operation modifies that address (ISO/IEC 9899:2011, §6.7.3) [3]. On an SoC, hardware modifies memory-mapped registers asynchronously to the CPU. The compiler has no visibility into that.

The volatile qualifier instructs the compiler: do not cache reads or defer writes to this address. Every access must produce an actual bus transaction.

#define IQ_STATUS (*(volatile uint32_t *)0x40001004)

uint32_t status = IQ_STATUS; // guaranteed load from 0x40001004

That handles the compiler. It does not handle the memory system.

On multi-core SoCs or any system where caches sit between the CPU and the peripheral bus, volatile alone is insufficient. You also need cache management (invalidate before reads, clean/flush after writes) or the memory-mapped I/O region must be configured as Device or Strongly-Ordered memory in the MMU/MPU page tables [4]. On ARM Cortex-M class cores without data caches, volatile is typically the whole story. On Cortex-A cores running Linux, the kernel’s readl()/writel() accessors bundle the volatile semantics with the required memory barriers and, on cached systems, operate through regions already mapped as Device memory.

The short version: volatile is the floor. Whether it is also the ceiling depends on the memory hierarchy between your CPU and your peripheral bus.

Fixed-point success is won in the accumulator

Moving from MATLAB’s double-precision floats to C’s fixed-point integers is where many BER floors originate. The arithmetic is straightforward. The discipline is in managing bit growth, rounding bias, and overflow behavior.

Multiplying two Q1.15 values produces a Q2.30 result. In an N-tap FIR filter, the accumulator sums N such products, adding \(\lceil \log_2 N \rceil\) bits of potential growth. For a 64-tap filter:

\[ \text{ACC\_WIDTH} \geq 16 + 16 + \lceil \log_2(64) \rceil = 38 \text{ bits} \]

Using int64_t for the accumulator provides 64 bits, leaving 26 bits of headroom above the minimum. That margin absorbs coefficient sets you have not tested yet and input sequences that hit worst-case correlation patterns. On Xilinx UltraScale+ devices, the DSP48E2 slice provides a 48-bit accumulator natively [5]; on Intel FPGAs, the variable-precision DSP block goes up to 64 bits in accumulation mode.

When scaling the accumulator back to Q1.15 (right-shift by 15), truncation introduces a negative DC bias of half an LSB on average. Over a long integration, that bias accumulates. In a feedback loop (equalizer, PLL loop filter), it shows up as a carrier offset or a BER floor that does not respond to gain adjustments. The fix is to add a rounding constant before the shift:

int16_t result = (int16_t)((acc + (1 << 14)) >> 15);

This implements round-half-up and removes the systematic bias. Oppenheim and Schafer treat the statistical properties of roundoff noise in detail [6, Ch. 6].

Saturation is the other non-negotiable. If a signed 16-bit value overflows, two’s complement wraps: +32768 becomes −32768. In a signal chain, that is a full-scale phase inversion — a single sample can corrupt an entire symbol or frame. The fix:

static inline int16_t sat16(int32_t x) {
 if (x > 32767) return 32767;
 if (x < -32768) return -32768;
 return (int16_t)x;
}

On ARM Cortex-M4 and later, the SSAT instruction performs signed saturation in a single cycle [4]. In MATLAB, saturation does not exist by default. If you are using the Fixed-Point Designer toolbox, you must explicitly set OverflowAction to 'Saturate' on your fi() objects. If your golden model silently wraps on overflow, the comparison against hardware is testing the wrong reference.

Atomic register commits prevent transient nightmares

Firmware updates a 2×2 IQ correction matrix. Four coefficients: C11, C12, C21, C22. Firmware writes them sequentially over an APB bus. Between the write to C12 and the write to C21, the hardware processes a block of samples using two new coefficients and two old ones. The constellation rotates for a few symbols, then recovers when the remaining writes land. The BER measurement shows an intermittent spike that looks like a calibration instability.

I have spent more debug hours on this failure mode than I care to admit. It is particularly insidious because it is intermittent, timing-dependent, and the steady-state behavior after all writes complete looks correct.

The fix is shadow registers with an atomic commit. Firmware writes new coefficients to shadow locations that the active datapath does not read:

Address	Register	Access	Function
0x0000	IQ_CTRL	R/W	ENABLE\[0\], MODE\[2:1\]
0x0004	IQ_STATUS	R/O	LOCKED\[0\], SAT\[1\] (W1C)
0x0010	IQ_C11_SHADOW	R/W	Coefficient C11, Q2.14
0x0014	IQ_C12_SHADOW	R/W	Coefficient C12, Q2.14
0x0018	IQ_C21_SHADOW	R/W	Coefficient C21, Q2.14
0x001C	IQ_C22_SHADOW	R/W	Coefficient C22, Q2.14
0x0020	IQ_COMMIT	W/O	Loads all shadow → active at next symbol boundary

A write to IQ_COMMIT triggers the hardware to transfer all four shadow registers into the active datapath simultaneously, aligned to the next symbol or sample clock boundary. The datapath never sees a partial coefficient set.

This pattern applies to any multi-register parameter group: equalizer tap sets, AGC gain tables, NCO frequency words, anything where a partial update produces a transient that corrupts the signal chain.

Debugging by bisection

When the BER is wrong, comparing the final output against the golden model tells you that something failed. It does not tell you where. On a signal chain with six or eight processing stages, that distinction is the difference between hours of debug and weeks.

The method that works is bisection. Place capture buffers at internal tap points along the datapath: post-ADC, post-AGC, post-channel-filter, post-FFT, post-equalizer, pre-demapper. Firmware reads each capture buffer into memory. Python compares the captured vectors, sample by sample, against the corresponding intermediate outputs from the MATLAB golden model.

The divergence point is the stage where the implementation first departs from the reference. Upstream stages match; this stage does not. The fault is between those two tap points. The search space collapses from “the entire SoC” to one processing block.

In practice, these capture buffers become permanent regression infrastructure. After any firmware or RTL change, the same tap-point comparison runs automatically. If a stage regresses, the CI pipeline flags the specific block before anyone looks at the overall BER curve. This turns debugging from opinion into measurement.

One practical note: the capture buffers consume block RAM (on FPGA) or SRAM area (on ASIC). Budget for them early in the floorplan. Removing debug infrastructure to save area and then needing it during bring-up is a trade you make only once.

The discipline underneath

SoC integration is the work of making decisions that are hard to reverse, with as much evidence as you can gather before they become permanent. The register map freezes at tape-out. The fixed-point format freezes when RTL is signed off. The coefficient update scheme freezes when the micro-architecture is locked.

The programming side of this work is not separate from the algorithm. It is the mechanism by which the algorithm becomes something that ships. The pipeline from MATLAB to C to silicon is a validation chain. Every link either builds confidence in the design or hides a failure that surfaces during bring-up, when the cost of finding it is highest.

If the register boundary, the quantization scheme, and the update atomicity are right, bring-up is calibration. If any of them are wrong, bring-up is archaeology.

References

\[1\]

ARM, AMBA APB Protocol Specification, ARM IHI 0024E, Rev 2.0, 2021.

\[2\]

Accellera Systems Initiative, SystemRDL 2.0 Register Description Language, 2018.

\[3\]

ISO/IEC 9899:2011, Programming Languages — C (C11 Standard), §6.7.3 (Type qualifiers).

\[4\]

ARM, Cortex-A Series Programmer’s Guide for ARMv8-A, ARM DEN0024A — Device memory ordering, barriers, and DSP/SIMD saturation instructions.

\[5\]

AMD/Xilinx, UltraScale Architecture DSP Slice User Guide (UG579) — DSP48E2 accumulator width and cascade paths.

\[6\]

A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, 3rd ed., Pearson, 2010.

\[7\]

J. G. Proakis and M. Salehi, Digital Communications, 5th ed., McGraw-Hill, 2008.

\[8\]

R. G. Lyons, Understanding Digital Signal Processing, 3rd ed., Pearson, 2011.

Register-Map on Corebaseit — POS · EMV · Payments · AI · Telecommunications

From MATLAB to metal: 6 hard-won lessons in SoC integration

Languages are a pipeline, not a resume line

The register map is a contract, not a header file

`volatile` is necessary, and not always sufficient

Fixed-point success is won in the accumulator

Atomic register commits prevent transient nightmares

Debugging by bisection

The discipline underneath

References

Further reading

Register-Map on Corebaseit — POS · EMV · Payments · AI · Telecommunications

From MATLAB to metal: 6 hard-won lessons in SoC integration

Languages are a pipeline, not a resume line

The register map is a contract, not a header file

volatile is necessary, and not always sufficient

Fixed-point success is won in the accumulator

Atomic register commits prevent transient nightmares

Debugging by bisection

The discipline underneath

References

Further reading

`volatile` is necessary, and not always sufficient