How GPT-2 flows

Week 1 — AI Models Deep Dive · Saran · Source repo · Pure CSS motion (respects prefers-reduced-motion)

1 · Tokens enter

Autoregressive generation: each step conditions on all previous tokens (causal mask).

The future of AI is

2 · Through the stack

Token + position embeddings → dropout
LayerNorm → Multi-head self-attention → + residual
LayerNorm → FFN (768→3072→768, GELU) → + residual
Repeat ×12 blocks — same pattern, different learned weights
Final LayerNorm → LM head (weight-tied) → next-token logits
Pulse = activation moving down the forward pass Residuals preserve a path for gradients

3 · Causal attention (toy 4×4)

Lower triangle can attend; upper triangle is masked (no peeking at future tokens).

Brighter cells = where probability mass can flow this step