Understanding the Transformer Architecture: From Attention to GPT
A deep dive into the transformer architecture that powers modern LLMs. Learn how self-attention, positional encoding, and feed-forward layers work together.
The transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need," fundamentally changed how we build language models. In this post, we'll break down each component and understand why it works so well.
Why Transformers Matter
Before transformers, sequence modeling relied heavily on RNNs and LSTMs. These architectures processed tokens sequentially, creating a bottleneck for long sequences. Transformers solve this with parallel processing through self-attention.
Key Insight
The self-attention mechanism allows every token in a sequence to directly attend to every other token, regardless of distance. This is what makes transformers so powerful for language understanding.
The Self-Attention Mechanism
At its core, self-attention computes three vectors for each token:
- Query (Q): What am I looking for?
- Key (K): What do I contain?
- Value (V): What information do I provide?
The attention score is computed as:
import torch
import torch.nn.functional as F
def self_attention(Q, K, V):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
attention_weights = F.softmax(scores, dim=-1)
return torch.matmul(attention_weights, V)Multi-Head Attention
Rather than computing a single attention function, transformers use multi-head attention — running several attention operations in parallel:
class MultiHeadAttention(torch.nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = torch.nn.Linear(d_model, d_model)
self.W_k = torch.nn.Linear(d_model, d_model)
self.W_v = torch.nn.Linear(d_model, d_model)
self.W_o = torch.nn.Linear(d_model, d_model)
def forward(self, x):
batch_size = x.size(0)
Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
attn_output = self_attention(Q, K, V)
attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
return self.W_o(attn_output)Positional Encoding
Since transformers process all tokens in parallel, they need a way to understand token order. Positional encodings add position information to the input embeddings using sine and cosine functions at different frequencies.
Modern Approaches
While the original paper used fixed sinusoidal encodings, modern models like LLaMA and GPT use Rotary Position Embeddings (RoPE), which encode relative positions more effectively.
From Transformer to GPT
GPT-style models use the decoder-only variant of the transformer:
- Causal masking prevents tokens from attending to future positions
- Autoregressive generation predicts one token at a time
- Scale — GPT-4 is estimated to use ~1.8 trillion parameters across a mixture-of-experts architecture
What's Next
In upcoming posts, we'll explore:
- How fine-tuning adapts pretrained transformers to specific tasks
- The role of RLHF in making models safe and helpful
- Emerging architectures like Mamba and state space models
Understanding transformers is the foundation for everything happening in modern AI. Master this, and the rest follows.