How GPT-2 flows
Week 1 — AI Models Deep Dive · Saran ·
Source repo
· Pure CSS motion (respects prefers-reduced-motion)
1 · Tokens enter
Autoregressive generation: each step conditions on all previous tokens (causal mask).
The
future
of
AI
is
2 · Through the stack
Token + position embeddings → dropout
LayerNorm → Multi-head self-attention → + residual
LayerNorm → FFN (768→3072→768, GELU) → + residual
Repeat ×12 blocks — same pattern, different learned weights
Final LayerNorm → LM head (weight-tied) → next-token logits
Pulse = activation moving down the forward pass
Residuals preserve a path for gradients
3 · Causal attention (toy 4×4)
Lower triangle can attend; upper triangle is masked (no peeking at future tokens).
Brighter cells = where probability mass can flow this step