How GPT-2 flows

Week 1 — AI Models Deep Dive · Saran · Source repo · Pure CSS motion (respects prefers-reduced-motion)

1 · Tokens enter

Autoregressive generation: each step conditions on all previous tokens (causal mask).

The future of AI is

Token + position embeddings → dropout

LayerNorm → Multi-head self-attention → + residual

LayerNorm → FFN (768→3072→768, GELU) → + residual

Repeat ×12 blocks — same pattern, different learned weights

Final LayerNorm → LM head (weight-tied) → next-token logits

Pulse = activation moving down the forward pass Residuals preserve a path for gradients

Lower triangle can attend; upper triangle is masked (no peeking at future tokens).

Brighter cells = where probability mass can flow this step