Building GPT-2 From Scratch (and Loading Real Weights)

About 7 min read · Technical, beginner-friendly if you know Python

I’m Saran, co-founder of Tekvo. This is week 1 of an open notebook: one model a week, each with runnable code and visuals. The code lives in model-atlas on GitHub.

At a glance

What: A small GPT-2 (Generative Pre-trained Transformer 2) decoder-only model written by hand in PyTorch, then loaded with real pretrained weights so outputs match a reference implementation.
Why: Almost every big LLM (large language model) today still uses the same core idea (attention + feed-forward blocks, predict the next token). GPT-2 is the shortest path to seeing that clearly.
How to use this post: Skim the bullets and diagrams first, then run the commands in Run it locally if you want to verify on your machine.

Who this is for

You can run Python in a terminal and are OK installing packages in a virtual environment (an isolated folder of packages so projects do not clash).
You do not need a GPU (graphics processing unit) for the steps below—inference for this size is fine on a laptop CPU (central processing unit) for short generations.
If terms like “embedding” or “softmax” are new, follow the diagrams anyway—the flow is more important than memorizing names. Use Acronyms & terms and the links there as a sidebar.

Why GPT-2 still matters

It is easy to dismiss GPT-2 as old news. In practice, GPT-4-class models, Claude, Llama, and DeepSeek still share the same skeleton: decoder-only transformer, causal self-attention, pre-norm residuals, and next-token prediction at the end. What changed is scale, data, and engineering—not a totally different architecture.

What I built

A from-scratch GPT-2 (124M parameters) in PyTorch (Facebook’s open-source tensor / neural-network library for Python): explicit Linear, LayerNorm, multi-head attention, GELU feed-forward, and weight tying on the LM (language modeling) head—no Hugging Face AutoModel shortcut for the core forward pass.

Then I loaded the real checkpoint (a saved file of trained weights). That is the sanity check: if even one shape is wrong, weights do not line up or the logits (unnormalized scores before softmax) are garbage. When generations look right, the code matches the math.

Run it locally

You need: Python 3.10+, git, and about 5–10 minutes the first time (downloads weights).

Clone the repo and enter the code folder.
Create and activate a virtual environment (python -m venv .venv).
Install dependencies (pip install -r requirements.txt). Stack: PyTorch, Hugging Face transformers (for tokenizer / checkpoint loading helpers only), NumPy (numerical arrays).
Run generation or the optional attention visualization.

git clone https://github.com/saran-io/model-atlas.git
cd model-atlas/models/01-gpt2-from-scratch/code
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -r requirements.txt
python gpt2.py --prompt "The future of AI is" --max_tokens 80
python gpt2.py --visualize --viz_text "The cat sat on the mat"

If a command fails, check that your venv is active and that PyTorch installed for your platform (see pytorch.org).

The stack (one screen)

Raw text

The future of AI is …

BPE tokenizer

Subword pieces mapped to vocabulary IDs.

token IDs · vocab 50,257

Token embedding

W_e · 50,257 × 768

Position embedding

1,024 positions × 768

Dropout · residual stream

768-dimensional hidden state enters the transformer blocks.

Repeat 12 times

Same block shape; different learned weights per layer.

Pre-norm self-attention

LayerNorm, then 12 heads × 64-d, dropout. Causal mask: position i attends only to j ≤ i.

out = x + Attention(LayerNorm(x))

Pre-norm feed-forward

LayerNorm, Linear 768→3072, GELU, Linear 3072→768, dropout.

out = x + FFN(LayerNorm(x))

Final LayerNorm · LM head

Weights tied with W_e. Logits over 50,257 tokens → softmax → next token.

How to read it, in one breath: text becomes token IDs; IDs become vectors (token + position); then the same block runs twelve times—attention mixes information across positions, feed-forward (FFN / MLP) refines each position, and residuals (adding the input back) keep training stable. At the end, the LM head outputs scores over the vocabulary; weights are tied with the input embeddings (predicting token k and representing k share one space).

Self-attention in one pass

The same pattern shows up in modern models (sometimes with extras like GQA and RoPE—see glossary).

Project

Linear maps from hidden state

Q = X W_Q K = X W_K V = X W_V

Scores

Scaled dot products

S = QK^T / √d_k

Causal mask

Future positions set to −∞ before softmax.

Mix values

Normalized weights × values

A = softmax(S) out = A V

Multi-head: concat heads → W_O (768×768)

Residual adds this back into the stream before the FFN sublayer.

Toy 4×4 causal mask: lower triangle allowed; upper triangle blocked.

Scaling (the 1/√d_k step): attention scores are divided by the square root of the head dimension so dot products stay in a range where softmax (a function that turns scores into probabilities that sum to 1) does not saturate. I removed that division once with correct weights loaded: nonsense output. One divisor is the difference between a toy and a working stack.

Animated explainer

Open full screen in a new tab · Source in model-atlas

Why this matters in production

At Tekvo I work with LLM APIs (application programming interfaces) and agents (systems that loop: model → tool → model) every day. Internals are not academic trivia: they help explain why longer context costs more, why tokenizer quirks become bugs, and why models sometimes “reach” for the wrong context (attention is doing what it was trained to do).

What is next

Week 2 in the repo is GPT-4o vs GPT-4.1 on real agent-style tasks—not leaderboard scores, workloads that look like production.

If you are learning LLMs, implement one classic model end to end before chasing the newest API. GPT-2 remains one of the best compressed lessons in how the industry builds language models.

Acronyms & terms

API: Application Programming Interface — a contract for how one piece of software talks to another. Here: remote model services you call over the network.Intro: MDN — APIs
BPE: Byte Pair Encoding — a way to turn text into subword tokens by merging frequent pairs. GPT-2 uses a byte-level BPE vocabulary (50,257 units).Paper (subword NMT): Sennrich et al., 2016 · Course: Hugging Face — Byte-Pair Encoding tokenization
CPU / GPU: Central / Graphics Processing Unit — general-purpose vs parallel math hardware. Big training needs a GPU; small GPT-2 inference can run on CPU for short texts.PyTorch: Learn the basics
FFN / MLP: Feed-Forward Network / Multi-Layer Perceptron — the per-position MLP inside each transformer block (expand → activation → contract).See transformer FFN in The Illustrated Transformer
GELU: Gaussian Error Linear Unit — a smooth activation function used in GPT-2’s FFN instead of ReLU.Paper: Hendrycks & Gimpel, 2016
GPT: Generative Pre-trained Transformer — OpenAI’s family of decoder-only LMs; GPT-2 is the 2019 public release (~124M–1.5B parameters).Report: Radford et al. — Language Models are Unsupervised Multitask Learners (PDF)
GQA: Grouped-Query Attention — shares key/value heads across query heads to save memory in long contexts (not in vanilla GPT-2).Paper: Ainslie et al., 2023
LLM: Large Language Model — a neural net trained to model text, usually by predicting the next token.Gentle stack walkthrough: The Illustrated GPT-2
LM head: Language Modeling head — final linear map from hidden size to vocabulary size to produce logits for the next token.Same Illustrated GPT-2 link above
Logits: Raw scores before softmax; higher means “more likely” after normalization.Softmax intuition: DeepAI glossary — softmax
PyTorch: Open-source Python library for tensors and differentiable programs (how most research and many products train models).Official tutorials
RoPE: Rotary Position Embedding — encodes position by rotating features (common in Llama-style models; GPT-2 used learned absolute position embeddings).Paper: Su et al., 2021
Transformer: Architecture based on attention + FFN stacks; “decoder-only” means one direction (left-to-right), suited to generation.Original paper: Vaswani et al. — Attention Is All You Need · Illustrated Transformer

References & sources

What this post draws on

Primary papers and standard teaching material—not a claim of novelty. Implementations were checked against public GPT-2 weights and the repo’s own DEEP-DIVE.md.

Vaswani, Shazeer, Parmar, et al. (2017). Attention Is All You Need. arXiv:1706.03762
Radford, Wu, Child, et al. (2019). Language Models are Unsupervised Multitask Learners (GPT-2). OpenAI PDF
Sennrich, Haddow, Birch (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). arXiv:1508.07909
Alammar, J. The Illustrated Transformer / The Illustrated GPT-2. jalammar.github.io
Hugging Face. NLP Course (tokenizers, transformers). huggingface.co/learn
Karpathy, A. Let’s reproduce GPT-2 (educational video series). YouTube

Connect

If this was useful—questions, corrections, or you’re building in the same space—say hi. Saran — I post builds and notes on X and LinkedIn.