There’s a gap between using language models and understanding them. You can call an API, get a response, and build a product on top of it — without ever knowing what happens between the prompt and the output. For most use cases, that’s fine. But if you want to make informed engineering decisions about AI systems — what they can do, where they fail, and why — you need to look inside the machine.
That’s why I built BeviaLLM: a miniature GPT-like language model implemented entirely from scratch using Python and NumPy. No PyTorch. No TensorFlow. No autograd. Every matrix multiplication, every gradient calculation, every optimization step is explicit and traceable.
The full implementation and playbook are available:
- Source code: github.com/Bevia/BeviaLLM
- Playbook PDF: Download the BeviaLLM Playbook →
Why Build From Scratch?
Modern deep learning frameworks hide enormous complexity behind convenient abstractions. torch.nn.Linear gives you a working layer. loss.backward() computes all your gradients. But do you understand what actually happens during that backward pass? Do you know why the attention mechanism divides by the square root of the key dimension? Do you know what layer normalization is actually normalizing, and why it matters for training stability?
When you implement every component manually, three things happen:
Understanding replaces mystery. Terms like “self-attention,” “residual connections,” and “causal masking” stop being jargon and become concrete operations you can trace through with a debugger.
Design decisions become visible. Why does GPT use pre-normalization instead of post-normalization? Why AdamW instead of vanilla SGD? Why causal masking? Building from scratch reveals the engineering trade-offs behind these choices.
Failure modes become predictable. When you’ve manually implemented softmax and watched it overflow, you understand why numerical stability matters. When you’ve traced gradients through six transformer blocks, you understand why residual connections exist.
What BeviaLLM Implements
BeviaLLM is a decoder-only transformer — the same architecture family as GPT — operating at the character level. It predicts the next character given a sequence of previous characters. The model is intentionally small: designed to train on a laptop CPU in minutes, not on GPU clusters for weeks.
The Full Stack
Every component is implemented from scratch:
Embeddings. Token embeddings map character indices to dense vectors. Position embeddings encode sequence order. Both are simple lookup tables — but the backward pass requires careful gradient accumulation when the same token appears multiple times.
Self-Attention. The core of the transformer. Three linear projections produce Query, Key, and Value matrices. The attention formula computes weighted relevance scores across all positions in the sequence. Causal masking ensures the model can’t look at future tokens during generation — implemented by setting future positions to negative infinity before softmax.
Layer Normalization. Normalizes activations across the embedding dimension to stabilize training. The backward pass through layer normalization is one of the more complex gradient computations in the model — involving dependencies on both mean and variance.
Feed-Forward Network (MLP). A two-layer network with ReLU activation, processing each position independently. This is where the model adds computational capacity beyond what attention provides.
Residual Connections. Each sublayer’s input is added to its output, creating skip connections that allow gradients to flow directly through the network. Without these, deep transformers are effectively untrainable.
AdamW Optimizer. Combines momentum, adaptive learning rates, and decoupled weight decay. Implemented step by step: first moment estimation, second moment estimation, bias correction, parameter update, weight decay.
Cross-Entropy Loss. Measures how well the model’s predicted probability distribution matches the true next character. The gradient has an elegant form: subtract 1 from the probability assigned to the correct class.
The Architecture in Practice
When BeviaLLM processes a sequence of text, six stages execute in order:
| Stage | Component | What Happens |
|---|---|---|
| 1 | Tokenization | Characters are converted to integer indices |
| 2 | Token Embedding | Indices are mapped to dense vectors |
| 3 | Position Embedding | Sequence position information is added |
| 4 | Transformer Blocks | Attention + MLP with residual connections |
| 5 | Output Projection | Vectors are projected back to vocabulary size |
| 6 | Sampling | Next character is sampled from the probability distribution |
The backward pass mirrors this in reverse: gradients flow from the loss through the output projection, back through each transformer block (in reverse order), and finally through the embedding layers.
What You Learn by Building It
Attention Is a Soft Lookup Table
Think of attention as a dynamic, learnable lookup. Given a query, the model finds which keys are most relevant and returns a weighted combination of their values. Unlike a dictionary (exact match, single value), attention uses soft matching and returns blended results.
For “The cat sat on the mat” — when processing “sat,” the model can attend heavily to “cat” (the subject doing the sitting) and less to other words. This dynamic information routing is what gives transformers their power over sequential architectures.
Scale Prevents Gradient Collapse
The attention formula divides by √d (the square root of the key dimension). Without this scaling, dot products between Q and K grow large as the dimension increases, pushing softmax into regions with near-zero gradients. A small detail in the formula — but implementing it manually makes you understand why it’s there.
Pre-Norm Is More Stable Than Post-Norm
The original transformer paper placed layer normalization after the residual addition (post-norm). GPT-2 and subsequent models moved it before the sublayer (pre-norm). When you train both variants from scratch, you see the difference directly: pre-norm trains more smoothly, especially as you add layers.
Temperature Controls the Creativity-Coherence Trade-off
During text generation, temperature scales the logits before softmax:
- Low temperature (0.5): Conservative, repetitive output — the model strongly favors high-probability characters
- Balanced (1.0): The natural learned distribution
- High temperature (1.5): Creative but chaotic — low-probability characters get a real chance
This single parameter controls the exploration-exploitation balance in generation.
Running It Yourself
BeviaLLM requires only Python and NumPy:
git clone https://github.com/Bevia/BeviaLLM.git
cd BeviaLLM
python -m venv .venv
source .venv/bin/activate
pip install numpy
python main.py --data data.txt --ctx 64 --dim 64 --layers 1 --batch 8 --steps 2000
Start with conservative settings. Watch the loss decrease. Watch the generated text improve from random noise to recognizable patterns. Then start experimenting:
- Increase model size —
--dim 128 --layers 2for better quality at slower training - Try different data — code, poetry, technical docs each produce different learned patterns
- Extend the context —
--ctx 256lets the model capture longer dependencies (at quadratic memory cost)
Exercises That Deepen Understanding
Once you’re comfortable with the base implementation:
- Implement multi-head attention. The current implementation uses single-head attention. Splitting into multiple heads lets the model attend to different types of relationships simultaneously.
- Replace ReLU with GELU. GPT-2 uses GELU activation — implement it and compare training dynamics.
- Add dropout. Regularization that randomly zeros activations during training, reducing overfitting.
- Implement learning rate scheduling. Warmup followed by cosine decay — the standard training recipe for transformers.
- Visualize attention maps. See which characters the model attends to during generation. This makes the abstract concept of “attention” concrete and interpretable.
The Point
BeviaLLM is simply a friendly way to peek behind the curtain and understand how large language models like ChatGPT actually work. When you trace through the matrix multiplications, debug a gradient calculation, and watch the loss decrease — you build intuition that reading documentation alone can’t provide.
The best way to understand deep learning is to get your hands dirty with the math. And the best way to make informed decisions about AI systems is to understand what’s actually happening inside them.
Resources
- Source code: github.com/Bevia/BeviaLLM
- BeviaLLM Playbook (PDF): Download →
- Vaswani, A. et al. “Attention Is All You Need.” NeurIPS 2017. arxiv.org/abs/1706.03762
- Radford, A. et al. “Language Models are Unsupervised Multitask Learners.” OpenAI, 2019. (GPT-2)
- Brown, T. et al. “Language Models are Few-Shot Learners.” NeurIPS 2020. (GPT-3)
- Goodfellow, I. et al. Deep Learning. MIT Press.
- Karpathy, A. “Let’s build GPT” video series.
- Alammar, J. “The Illustrated Transformer.”
- Transformers vs. Diffusion Models — companion post on AI architectures
- The Obsolescence Paradox: Why the Best Engineers Will Thrive in the AI Era — why understanding the internals matters more than ever