Understanding the Transformer Architecture: From Attention to GPT

The transformer architecture, introduced in the landmark 2017 paper "Attention Is All You Need," fundamentally changed how we build language models. In this post, we'll break down each component and understand why it works so well.

Why Transformers Matter

Before transformers, sequence modeling relied heavily on RNNs and LSTMs. These architectures processed tokens sequentially, creating a bottleneck for long sequences. Transformers solve this with parallel processing through self-attention.

Key Insight

The self-attention mechanism allows every token in a sequence to directly attend to every other token, regardless of distance. This is what makes transformers so powerful for language understanding.

📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/understanding-transformer-architecture

The Self-Attention Mechanism

At its core, self-attention computes three vectors for each token:

Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I provide?

The attention score is computed as:

import torch
import torch.nn.functional as F
 
def self_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    attention_weights = F.softmax(scores, dim=-1)
    return torch.matmul(attention_weights, V)

Multi-Head Attention

Rather than computing a single attention function, transformers use multi-head attention — running several attention operations in parallel:

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
 
        self.W_q = torch.nn.Linear(d_model, d_model)
        self.W_k = torch.nn.Linear(d_model, d_model)
        self.W_v = torch.nn.Linear(d_model, d_model)
        self.W_o = torch.nn.Linear(d_model, d_model)
 
    def forward(self, x):
        batch_size = x.size(0)
 
        Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
 
        attn_output = self_attention(Q, K, V)
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
 
        return self.W_o(attn_output)

Positional Encoding

Since transformers process all tokens in parallel, they need a way to understand token order. Positional encodings add position information to the input embeddings using sine and cosine functions at different frequencies.

Modern Approaches

While the original paper used fixed sinusoidal encodings, modern models like LLaMA and GPT use Rotary Position Embeddings (RoPE), which encode relative positions more effectively.

From Transformer to GPT

GPT-style models use the decoder-only variant of the transformer:

Causal masking prevents tokens from attending to future positions
Autoregressive generation predicts one token at a time
Scale — GPT-4 is estimated to use ~1.8 trillion parameters across a mixture-of-experts architecture

What's Next

In upcoming posts, we'll explore:

How fine-tuning adapts pretrained transformers to specific tasks
The role of RLHF in making models safe and helpful
Emerging architectures like Mamba and state space models

Understanding transformers is the foundation for everything happening in modern AI. Master this, and the rest follows.

Why Transformers Matter

Key Insight

The self-attention mechanism allows every token in a sequence to directly attend to every other token, regardless of distance. This is what makes transformers so powerful for language understanding.

📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/understanding-transformer-architecture

The Self-Attention Mechanism

At its core, self-attention computes three vectors for each token:

Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I provide?

The attention score is computed as:

import torch
import torch.nn.functional as F
 
def self_attention(Q, K, V):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    attention_weights = F.softmax(scores, dim=-1)
    return torch.matmul(attention_weights, V)

Multi-Head Attention

Rather than computing a single attention function, transformers use multi-head attention — running several attention operations in parallel:

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads
 
        self.W_q = torch.nn.Linear(d_model, d_model)
        self.W_k = torch.nn.Linear(d_model, d_model)
        self.W_v = torch.nn.Linear(d_model, d_model)
        self.W_o = torch.nn.Linear(d_model, d_model)
 
    def forward(self, x):
        batch_size = x.size(0)
 
        Q = self.W_q(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(x).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
 
        attn_output = self_attention(Q, K, V)
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)
 
        return self.W_o(attn_output)

Positional Encoding

Modern Approaches

While the original paper used fixed sinusoidal encodings, modern models like LLaMA and GPT use Rotary Position Embeddings (RoPE), which encode relative positions more effectively.

From Transformer to GPT

GPT-style models use the decoder-only variant of the transformer:

Causal masking prevents tokens from attending to future positions
Autoregressive generation predicts one token at a time
Scale — GPT-4 is estimated to use ~1.8 trillion parameters across a mixture-of-experts architecture

What's Next

In upcoming posts, we'll explore:

How fine-tuning adapts pretrained transformers to specific tasks
The role of RLHF in making models safe and helpful
Emerging architectures like Mamba and state space models

Understanding transformers is the foundation for everything happening in modern AI. Master this, and the rest follows.

Understanding the Transformer Architecture: From Attention to GPT

Why Transformers Matter

The Self-Attention Mechanism

Multi-Head Attention

Positional Encoding

From Transformer to GPT

What's Next

Related Posts

Speculative Decoding in Production: How a 1B Draft Model Cuts 70B Latency by 3-5×

AI Solved a Frontier Math Problem This Week. It Also Scored 1% on Tasks a Child Masters in Minutes.

How LinkedIn Replaced Five Retrieval Systems with One LLM at 1.3 Billion User Scale

Comments

Leave a comment

Understanding the Transformer Architecture: From Attention to GPT

Why Transformers Matter

The Self-Attention Mechanism

Multi-Head Attention

Positional Encoding

From Transformer to GPT

What's Next

Related Posts

Speculative Decoding in Production: How a 1B Draft Model Cuts 70B Latency by 3-5×

AI Solved a Frontier Math Problem This Week. It Also Scored 1% on Tasks a Child Masters in Minutes.

How LinkedIn Replaced Five Retrieval Systems with One LLM at 1.3 Billion User Scale

Comments

Leave a comment