01Transformer Architecture — Core Concepts
Token Embedding
Converts discrete tokens (words/sub-words) into dense vectors. Used in both encoder and decoder. Maps vocabulary IDs to dmodel-dimensional space.
Positional Encoding
Added to token embeddings in both encoder and decoder to inject sequence order information. Transformers have no inherent sense of order without this.
Sinusoidal Encoding
Each position's encoding vector uses sine for even dimensions and cosine for odd dimensions. Both functions are used per position — NOT per word-position.
Encoder
Processes the full input sequence in parallel using bidirectional self-attention. Produces context-rich representations. Used in seq2seq models.
Decoder
Generates output tokens auto-regressively (one at a time). Uses masked self-attention + cross-attention to encoder output. GPT-style models are decoder-only.
Look-Ahead Mask
Applied in every decoder self-attention layer (not just the first). Prevents each position from attending to future tokens. Future scores set to −∞ before softmax.
Masked Self-Attention
For token xk at position k, only positions 1…k are visible. Future positions masked to −∞ so softmax assigns them zero weight.
Key architecture facts: Token embeddings → both encoder & decoder. Positional encoding → both encoder & decoder. Look-ahead mask → every decoder self-attention layer, not just the first.
02Attention Mechanism & Multi-Head Attention
Scaled Dot-Product Attention
Attention(Q,K,V) = softmax(QKᵀ / √dk) · V. Dividing by √dk prevents dot products from growing too large, which would push softmax into regions with vanishing gradients.
Why divide by √dk?
To avoid numerical issues and vanishing gradients during training. Large dot products → extreme softmax outputs → near-zero gradients → slow/failed learning.
Query (Q), Key (K), Value (V)
Each input is projected into three separate spaces. Q × Kᵀ computes compatibility scores. Softmax normalises them. V is aggregated by these weights.
Multi-Head Attention
Runs h parallel attention heads, each with its own Q/K/V projections. Outputs concatenated and projected by WO. Allows learning diverse local and global relations simultaneously.
Head Dimensionality
When number of heads h increases while keeping dmodel fixed, each head's dimension dk = dmodel/h decreases. More heads = narrower per-head attention.
Attention Score Matrix
For sequence length T with causal masking, each sequence has T×T scores but only the lower-triangular (including diagonal) are non-zero = T(T+1)/2 non-zero scores per head.
Formulas — Attention & Parameters
Scaled Dot-Product
softmax(QKᵀ / √d_k) · V
Head dimension
d_k = d_model / h
h = number of heads
Scalar attention scores (1 pass)
B × h × T × T
B=batch, h=heads, T=seq length
Q projection matrix size
d_model × d_k
= d_model × (d_model/h)
W_O output projection
(h × d_v) × d_model
= d_model × d_model
Non-zero scores (causal, per head)
T(T+1) / 2
Lower-triangular including diagonal
Worked Examples — from paper
Q166: B=4, T=12, h=2 → total scalar scores = 4 × 2 × 12 × 12 = 1152
Q168: d_model=512, h=8, d_k=64 → Q matrix params = 512 × 64 = 32768
────────────────────────────────────────────────────────
GPT setup (Q169-171): T=1024, B=32, d_model=768, h=8, d_v=96
Q169: next-token targets = B × (T−1) = 32 × 1023 = 32736
Q170: non-zero attn scores per head = T(T+1)/2 = 1024×1025/2 = 524800
Q171: W_O params = (h × d_v) × d_model = 768 × 768 = 589824
03Training — Teacher Forcing & Language Modelling
Teacher Forcing
During training, the correct ground-truth token from the training data is fed as input for each next time step — regardless of what the model predicted. Speeds up and stabilises training.
Decoder Input (teacher forcing)
At time step t, the decoder input is the ground-truth token from position t−1. Example: if target = [je, suis, heureux], at t=3 the input is 'suis' (t=2 ground truth).
Causal Language Modelling
Next-token prediction. Each token in a sequence of length T generates one prediction target (except the last). For T tokens: T−1 prediction targets per sequence.
Autoregressive Inference
At inference (GPT), each new token is conditioned on all previously generated tokens. Generation is sequential — no teacher forcing. Decoding strategies choose each next token.
Batch Normalisation
Normalises each feature across the batch. For feature column: subtract batch mean, divide by batch std (ε=0). With γ=1, β=0: normalised values are z-scores across the batch dimension.
Teacher forcing vs inference: Training uses teacher forcing (ground-truth inputs) → fast, stable. Inference is autoregressive (model's own outputs as next inputs) → error can accumulate (exposure bias).
04Decoding Strategies
Greedy Search
At each step, picks the single highest-probability token. Always 1 beam. Fast but suboptimal — can miss better sequences. Deterministic.
Beam Search
Keeps the top-k (beam size) partial sequences at each step. Deterministic — always keeps the same top-k. Beam search with size 2 always keeps both top-2 sequences.
Exhaustive (Exact) Decoding
Evaluates all possible sequences. At step t with vocab size V: Vt candidates. Computationally infeasible for large V or t. Deterministic.
Nucleus (Top-p) Sampling
Samples from the smallest set of tokens whose cumulative probability ≥ p. Inherently non-deterministic — introduces randomness via sampling.
Top-k Sampling
Restricts choices to top-k tokens, then samples one probabilistically. Non-deterministic. Top-2 sampling picks 1 token from the top 2 by sampling, not greedily.
Savings Heuristic (TSP)
A heuristic for Travelling Salesman Problem. Tries each city as base (fulcrum) and computes a tour. Always terminates (finite cities). Not suitable for computing the optimal TSP tour — it's a heuristic.
| Strategy | Beams at step 1 (V=10) | Beams at step 3 | Deterministic? |
| Greedy | 1 | 1 | Yes |
| Beam (size=3) | 3 | 3 | Yes |
| Exhaustive | 10 (=V) | 1000 (=V³) | Yes |
| Nucleus / Top-k Sampling | — | — | No (random) |
Beam vs Top-k sampling: Beam search with size 2 → always keeps both top-2 candidates (deterministic). Top-2 sampling → restricts to top 2 tokens then samples one probabilistically (non-deterministic).
05Sequences, Vocabulary & Complexity
Sequences of max length L
With vocabulary size |V|, total sequences of max length L = Σk=1L Vk. This counts sequences of length 1, 2, …, L.
Exhaustive decoding growth
Going from length L to L+1 adds (V−1)·VL additional sequences to evaluate. (Not VL — those already existed.)
Attention score memory
Each score is 32-bit float (4 bytes). Total memory = h × T² × 4 bytes per sequence. Doubling heads doubles memory linearly.
Key Formulas — Sequences & Memory
Total sequences (max length L)
Σ V^k for k=1 to L
Extra seqs (L → L+1)
(V − 1) · V^L
Attention score memory (per seq)
h × T² × 4 bytes
Training targets (full batch)
B × (T − 1)
Each token except last predicts next
Q167 — Memory comparison worked example
T=1000, float32=4 bytes, 1 MB=10⁶ bytes
Model A (h=8): 8 × 1000² × 4 = 32,000,000 bytes = 32 MB
Model B (h=16): 16 × 1000² × 4 = 64,000,000 bytes = 64 MB
% increase = (64−32)/32 × 100 = 100%
06Quick Reference — Rules to Remember
Architecture Facts
Token embeddings → encoder AND decoder
Positional encoding → encoder AND decoder
Look-ahead mask → every decoder self-attn layer
Sinusoidal → sin=even dims, cos=odd dims
√d_k scaling → prevent vanishing gradients
Decoding Rules
Only non-deterministic → Nucleus / sampling
Beam keeps → top-k deterministically
Top-k sampling → samples 1 from top-k
Savings heuristic → always terminates, not optimal
All Formulas at a Glance
Attention = softmax(QKᵀ/√d_k)·V
d_k = d_model / h
Scores (1 pass) = B × h × T × T
Q params = d_model × d_k
W_O params = d_model × d_model
Non-zero causal scores = T(T+1)/2
Train targets = B × (T−1)
Total seqs (max L) = Σ V^k, k=1..L
Extra seqs (L→L+1) = (V−1)·V^L