Building GPT-2 From Scratch (and Loading Real Weights)
Week 1 of a 24-model series: implement every layer in PyTorch, load OpenAI's checkpoint, and see why today's LLMs are still this architecture.
About 7 min read · Technical, beginner-friendly if you know Python
I’m Saran, co-founder of Tekvo. This is week 1 of an open notebook: one model a week, each with runnable code and visuals. The code lives in model-atlas on GitHub.
At a glance
- What: A small GPT-2 (Generative Pre-trained Transformer 2) decoder-only model written by hand in PyTorch, then loaded with real pretrained weights so outputs match a reference implementation.
- Why: Almost every big LLM (large language model) today still uses the same core idea (attention + feed-forward blocks, predict the next token). GPT-2 is the shortest path to seeing that clearly.
- How to use this post: Skim the bullets and diagrams first, then run the commands in Run it locally if you want to verify on your machine.
Who this is for
- You can run Python in a terminal and are OK installing packages in a virtual environment (an isolated folder of packages so projects do not clash).
- You do not need a GPU (graphics processing unit) for the steps below—inference for this size is fine on a laptop CPU (central processing unit) for short generations.
- If terms like “embedding” or “softmax” are new, follow the diagrams anyway—the flow is more important than memorizing names. Use Acronyms & terms and the links there as a sidebar.
Why GPT-2 still matters
It is easy to dismiss GPT-2 as old news. In practice, GPT-4-class models, Claude, Llama, and DeepSeek still share the same skeleton: decoder-only transformer, causal self-attention, pre-norm residuals, and next-token prediction at the end. What changed is scale, data, and engineering—not a totally different architecture.
What I built
A from-scratch GPT-2 (124M parameters) in PyTorch (Facebook’s open-source tensor / neural-network library for Python): explicit Linear, LayerNorm, multi-head attention, GELU feed-forward, and weight tying on the LM (language modeling) head—no Hugging Face AutoModel shortcut for the core forward pass.
Then I loaded the real checkpoint (a saved file of trained weights). That is the sanity check: if even one shape is wrong, weights do not line up or the logits (unnormalized scores before softmax) are garbage. When generations look right, the code matches the math.
Run it locally
You need: Python 3.10+, git, and about 5–10 minutes the first time (downloads weights).
- Clone the repo and enter the code folder.
- Create and activate a virtual environment (
python -m venv .venv). - Install dependencies (
pip install -r requirements.txt). Stack: PyTorch, Hugging Facetransformers(for tokenizer / checkpoint loading helpers only), NumPy (numerical arrays). - Run generation or the optional attention visualization.
git clone https://github.com/saran-io/model-atlas.git
cd model-atlas/models/01-gpt2-from-scratch/code
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
python gpt2.py --prompt "The future of AI is" --max_tokens 80
python gpt2.py --visualize --viz_text "The cat sat on the mat"
If a command fails, check that your venv is active and that PyTorch installed for your platform (see pytorch.org).
The stack (one screen)
Raw text
The future of AI is …
BPE tokenizer
Subword pieces mapped to vocabulary IDs.
token IDs · vocab 50,257 Token embedding
We · 50,257 × 768
Position embedding
1,024 positions × 768
Dropout · residual stream
768-dimensional hidden state enters the transformer blocks.
Repeat 12 times
Same block shape; different learned weights per layer.
LayerNorm, then 12 heads × 64-d, dropout. Causal mask: position i attends only to j ≤ i.
LayerNorm, Linear 768→3072, GELU, Linear 3072→768, dropout.
Final LayerNorm · LM head
Weights tied with We. Logits over 50,257 tokens → softmax → next token.
How to read it, in one breath: text becomes token IDs; IDs become vectors (token + position); then the same block runs twelve times—attention mixes information across positions, feed-forward (FFN / MLP) refines each position, and residuals (adding the input back) keep training stable. At the end, the LM head outputs scores over the vocabulary; weights are tied with the input embeddings (predicting token k and representing k share one space).
Self-attention in one pass
The same pattern shows up in modern models (sometimes with extras like GQA and RoPE—see glossary).
1
Project
Linear maps from hidden state
Q = X WQ K = X WK V = X WV 2
Scores
Scaled dot products
S = QKT / √dk 3
Causal mask
Future positions set to −∞ before softmax.
4
Mix values
Normalized weights × values
A = softmax(S) out = A V Multi-head: concat heads → WO (768×768)
Residual adds this back into the stream before the FFN sublayer.
Toy 4×4 causal mask: lower triangle allowed; upper triangle blocked.
Scaling (the 1/√dk step): attention scores are divided by the square root of the head dimension so dot products stay in a range where softmax (a function that turns scores into probabilities that sum to 1) does not saturate. I removed that division once with correct weights loaded: nonsense output. One divisor is the difference between a toy and a working stack.
Animated explainer
Open full screen in a new tab · Source in model-atlas
Why this matters in production
At Tekvo I work with LLM APIs (application programming interfaces) and agents (systems that loop: model → tool → model) every day. Internals are not academic trivia: they help explain why longer context costs more, why tokenizer quirks become bugs, and why models sometimes “reach” for the wrong context (attention is doing what it was trained to do).
What is next
Week 2 in the repo is GPT-4o vs GPT-4.1 on real agent-style tasks—not leaderboard scores, workloads that look like production.
If you are learning LLMs, implement one classic model end to end before chasing the newest API. GPT-2 remains one of the best compressed lessons in how the industry builds language models.
Acronyms & terms
- API
- Application Programming Interface — a contract for how one piece of software talks to another. Here: remote model services you call over the network.Intro: MDN — APIs
- BPE
- Byte Pair Encoding — a way to turn text into subword tokens by merging frequent pairs. GPT-2 uses a byte-level BPE vocabulary (50,257 units).Paper (subword NMT): Sennrich et al., 2016 · Course: Hugging Face — Byte-Pair Encoding tokenization
- CPU / GPU
- Central / Graphics Processing Unit — general-purpose vs parallel math hardware. Big training needs a GPU; small GPT-2 inference can run on CPU for short texts.PyTorch: Learn the basics
- FFN / MLP
- Feed-Forward Network / Multi-Layer Perceptron — the per-position MLP inside each transformer block (expand → activation → contract).See transformer FFN in The Illustrated Transformer
- GELU
- Gaussian Error Linear Unit — a smooth activation function used in GPT-2’s FFN instead of ReLU.Paper: Hendrycks & Gimpel, 2016
- GPT
- Generative Pre-trained Transformer — OpenAI’s family of decoder-only LMs; GPT-2 is the 2019 public release (~124M–1.5B parameters).Report: Radford et al. — Language Models are Unsupervised Multitask Learners (PDF)
- GQA
- Grouped-Query Attention — shares key/value heads across query heads to save memory in long contexts (not in vanilla GPT-2).Paper: Ainslie et al., 2023
- LLM
- Large Language Model — a neural net trained to model text, usually by predicting the next token.Gentle stack walkthrough: The Illustrated GPT-2
- LM head
- Language Modeling head — final linear map from hidden size to vocabulary size to produce logits for the next token.Same Illustrated GPT-2 link above
- Logits
- Raw scores before softmax; higher means “more likely” after normalization.Softmax intuition: DeepAI glossary — softmax
- PyTorch
- Open-source Python library for tensors and differentiable programs (how most research and many products train models).Official tutorials
- RoPE
- Rotary Position Embedding — encodes position by rotating features (common in Llama-style models; GPT-2 used learned absolute position embeddings).Paper: Su et al., 2021
- Transformer
- Architecture based on attention + FFN stacks; “decoder-only” means one direction (left-to-right), suited to generation.Original paper: Vaswani et al. — Attention Is All You Need · Illustrated Transformer
References & sources
What this post draws on
Primary papers and standard teaching material—not a claim of novelty. Implementations were checked against public GPT-2 weights and the repo’s own DEEP-DIVE.md.
- Vaswani, Shazeer, Parmar, et al. (2017). Attention Is All You Need. arXiv:1706.03762
- Radford, Wu, Child, et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2). OpenAI PDF
- Sennrich, Haddow, Birch (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). arXiv:1508.07909
- Alammar, J. The Illustrated Transformer / The Illustrated GPT-2. jalammar.github.io
- Hugging Face. NLP Course (tokenizers, transformers). huggingface.co/learn
- Karpathy, A. Let’s reproduce GPT-2 (educational video series). YouTube
Resources
- model-atlas — full series, week 1 code under
models/01-gpt2-from-scratch/ - DEEP-DIVE.md — longer technical write-up in the repo
- Animated explainer
Connect
If this was useful—questions, corrections, or you’re building in the same space—say hi. Saran — I post builds and notes on X and LinkedIn.