Building Llama 3.2 From Scratch (How Modern LLMs Improved on GPT-2)

About 6 min read · Technical, beginner-friendly if you know Python

I’m Saran, co-founder of Tekvo. This is week 2 of an open notebook: one model a week, each with runnable code and visuals. Last week I built GPT-2 by hand. This week I rebuilt the small Llama 3.2-style decoder block to answer one question: what actually changed between the classic GPT-2 stack and the LLM architecture most open models use today? The code lives in model-atlas on GitHub.

TL;DR

Llama 3.2 is still a decoder-only transformer: tokens in, next-token logits out.
The modern upgrades are mostly engineering choices: RMSNorm, RoPE, GQA, and SwiGLU.
Compared with GPT-2, Llama-style models are better suited for long context, efficient serving, and local deployment.
The big lesson: modern LLMs did not replace the transformer block. They made the block cheaper, longer-context, and easier to scale.

Why Llama after GPT-2?

GPT-2 is the cleanest starting point because it shows the original recipe: learned token embeddings, learned position embeddings, multi-head attention, GELU MLP, LayerNorm, and next-token prediction.

Llama is the best next step because it keeps the same mental model while swapping in the parts that became standard in modern open-weight LLMs. If GPT-2 teaches the skeleton, Llama teaches the production shape.

The Llama 3.2 1B model is also small enough to reason about. Meta describes the Llama 3.2 text models as optimized transformer models with grouped-query attention and long-context support in the public model card on Hugging Face and the Llama docs. Sebastian Raschka’s LLM Architecture Gallery is useful here because it makes these architecture differences visible instead of hiding them behind benchmark numbers.

The architecture

At a high level, Llama 3.2 is still:

tokens -> embeddings -> repeated decoder blocks -> final norm -> LM head

The interesting changes are inside the decoder block.

Input hidden state
  -> RMSNorm
  -> Grouped-Query Self-Attention with RoPE
  -> Residual add
  -> RMSNorm
  -> SwiGLU feed-forward network
  -> Residual add

Llama 3.2 decoder block — RMSNorm, GQA + RoPE, RMSNorm, SwiGLU, residual adds

What changed from GPT-2?

Component	GPT-2	Llama 3.2-style stack	Why it matters
Position	Learned absolute position embeddings	RoPE	Better relative position behavior and long-context scaling
Attention	Multi-head attention	Grouped-query attention	Lower KV-cache memory during inference
Normalization	LayerNorm	RMSNorm	Simpler normalization, no mean subtraction
FFN	GELU MLP	SwiGLU MLP	Stronger gated feed-forward block
Context	1,024 tokens in GPT-2 small	Much longer context in Llama 3.2 text models	More useful for documents, agents, and code
Tokenizer	GPT-2 BPE vocabulary	Larger modern tokenizer	Better coverage across code and multilingual text

GPT-2 vs Llama 3.2 — same skeleton, four structural upgrades

None of these changes alter the core training objective. The model still predicts the next token. But each change removes a bottleneck that shows up when you scale model size, context length, or serving traffic.

The four modern upgrades

1. RMSNorm

GPT-2 uses LayerNorm:

y = (x - mean(x)) / sqrt(var(x) + eps)

Llama uses RMSNorm:

y = x / sqrt(mean(x * x) + eps)

RMSNorm skips mean subtraction. That sounds small, but at LLM scale “small” operations happen billions of times. The idea is simple: normalize the magnitude of the vector, then learn a scale.

LayerNorm vs RMSNorm — same regularizing effect, two operations removed

2. RoPE

GPT-2 learns a position embedding table. Position 0 has a vector. Position 1 has a vector. The model adds those vectors to token embeddings.

RoPE does something different: it rotates query and key vectors based on position before computing attention. This lets attention scores carry relative position information directly.

The useful intuition:

GPT-2: "add a learned position vector to the token"
RoPE:  "rotate attention vectors so distance is built into Q dot K"

That is one reason Llama-style models are a better foundation for long-context systems.

RoPE rotates Q and K by position so the dot product carries relative distance

3. Grouped-query attention

In classic multi-head attention, every query head has its own key and value head. During generation, the server stores all previous keys and values in the KV cache. Long context makes this expensive.

Grouped-query attention keeps many query heads but shares fewer key/value heads across them.

MHA: 32 query heads, 32 key heads, 32 value heads
GQA: 32 query heads,  8 key heads,  8 value heads

The output still has many query heads, but the cache is smaller. That matters in production because KV-cache memory is often the reason long-context inference gets slow or expensive.

Multi-Head vs Grouped-Query Attention — fewer K/V heads serve more Q heads

KV-cache memory at long context — GQA cuts the bill 4×

4. SwiGLU

GPT-2 uses a simple GELU feed-forward network:

x -> Linear -> GELU -> Linear

Llama uses a gated feed-forward block:

x -> SiLU(gate_proj(x)) * up_proj(x) -> down_proj

One branch decides what to let through; the other branch carries candidate features. The multiplication makes the feed-forward layer more expressive without changing the basic transformer loop.

SwiGLU feed-forward — gate branch (SiLU) ⊙ candidate branch, then down-projection

What I built

The code for this week implements the core Llama-style decoder in PyTorch:

token embeddings
RMSNorm
RoPE
grouped-query self-attention
SwiGLU feed-forward network
tied LM head
small config that runs on a laptop

Run it locally:

git clone https://github.com/saran-io/model-atlas.git
cd model-atlas/models/02-llama32-from-scratch/code
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python llama32.py --prompt "The future of open models is" --max_new_tokens 40

The default config is intentionally tiny. It is not trying to reproduce Meta’s weights. It is a readable implementation of the architecture pattern, so the math lines up with the diagrams.

Llama 3.2 end-to-end — tokens in, next-token logits out, all four upgrades inside the decoder block

Where this applies

Llama-style models are useful when you care about:

Local AI features — small open-weight models can run on laptops, edge devices, or private servers.
Private enterprise workflows — teams can host models closer to sensitive data.
Agent backends — long context plus efficient KV cache helps tool-using systems carry more state.
RAG pipelines — open models can be tuned, quantized, routed, and evaluated inside your own stack.
Model learning — Llama is the architecture family you need to understand before reading most modern open LLM repos.

Production lesson

At Tekvo AI, most model decisions are not “which benchmark is highest?” They are tradeoffs:

How much context does the task really need?
What is the latency budget?
Can we send this data to a hosted API?
Does the model need tool use, coding ability, or multilingual coverage?
Can a small local model handle 80% of the traffic?

Llama matters because it gives builders a model family they can inspect, deploy, quantize, and adapt. You do not get that with a closed API.

Remember

GPT-2 and Llama are the same species: decoder-only transformers trained for next-token prediction.
RMSNorm simplifies normalization.
RoPE moves position information into attention instead of adding a learned position vector.
GQA reduces KV-cache pressure, which matters for long-context inference.
SwiGLU makes the feed-forward block stronger through gating.

Verdict

GPT-2 is the best model to learn the transformer. Llama is the best next model to learn what modern LLM engineering looks like.

If Week 1 answered “how does a language model work?”, Week 2 answers “what did modern open models change to make this practical at scale?”

Week 2/24. Next up: Mixture-of-Experts with Mistral/Mixtral, where the model gets bigger without activating every parameter on every token.