Tier 2: Architecture

Assemble the core components into a complete Transformer

🎯 Learning goals

After finishing this tier, you will be able to:

✅ Understand how residual connections work
✅ Understand the Pre-Norm Transformer block layout
✅ Understand gradient flow in deep networks
✅ Implement a complete Transformer block from scratch

📚 Module list

01. Residual Connection

Core questions:

Why do we need residual connections?
How do residuals solve gradient vanishing?
What is the mathematical form of residuals?

Key experiments:

Exp 1: training with/without residuals
Exp 2: visualize gradient flow
Exp 3: depth impact (shallow vs deep)

Estimated time: 1 hour

Prerequisites:

Tier 1: Foundation (all modules)

02. Transformer Block

Core questions:

How do we assemble Norm, Attention, FFN, Residual?
Why this order?
Pre-Norm vs Post-Norm: what’s the difference?

Key experiments:

Exp 1: component ablation
Exp 2: ordering impact
Exp 3: depth impact (2 layers vs 8 layers)

Estimated time: 1.5 hours

Prerequisites:

1. Residual Connection

🏗️ From components to architecture

Assembly logic of a Transformer block

Input x
  ├─ save as residual
  ↓
RMSNorm (normalize, stabilize)
  ↓
Attention (attend) + RoPE (position)
  ↓
+ residual (residual #1)
  ├─ save current state
  ↓
RMSNorm (normalize again)
  ↓
FeedForward (process, store knowledge)
  ↓
+ residual (residual #2)
  ↓
Output

Design points:

Pre-Norm: normalize before each sub-layer
- Stable input distribution per sub-layer
- Smoother gradient flow
Double residuals: one for Attention, one for FFN
- Gradients can skip complex sub-layers
- Incremental learning: only learn the “delta”
Fixed order: Attention → FFN
- Attention: find relevant information
- FFN: process the information

🚀 Learning advice

Recommended order

Follow this order:

Residual Connection (understand residuals)
Transformer Block (understand full assembly)

Practice projects (optional)

After finishing this tier, try:

Project 1: Implement a Transformer Decoder from scratch

python

class TransformerDecoder(nn.Module):
    def __init__(self, num_layers=8, hidden_size=512, ...):
        # stack multiple Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(...) for _ in range(num_layers)
        ])

    def forward(self, x):
        for block in self.blocks:
            x = block(x)
        return x

Project 2: Train a tiny model

Data: TinyShakespeare (1MB)
Model: 4-layer Transformer, 256 hidden size
Goal: perplexity < 2.5
Time: ~30 minutes (CPU)

📊 Completion check

After finishing this tier, you should be able to answer:

Theory

[ ] What is the mathematical form of residual connection?
[ ] Why do residuals solve gradient vanishing?
[ ] How does gradient flow differ in Pre-Norm vs Post-Norm?
[ ] Why does Attention come before FFN?

Practice

[ ] Implement residual connections from scratch
[ ] Implement a Pre-Norm Transformer block
[ ] Stack blocks into a full model
[ ] Train a simple language model

Design intuition

[ ] Explain why deep nets need residuals
[ ] Explain why Pre-Norm is more stable than Post-Norm
[ ] Explain why double residuals are necessary
[ ] Draw the full Transformer block dataflow

🎓 Next step

After finishing this tier, go to: 👉 Tier 3: Training (planned)

You will learn how to train these models:

Data preparation and tokenization
Optimizers and LR scheduling
Distributed training
Evaluation and debugging

🔗 Resources

Papers

Deep Residual Learning for Image Recognition - ResNet
On Layer Normalization in the Transformer Architecture - Pre-LN vs Post-LN

Code references

MiniMind: model/model_minimind.py:359-380 (TransformerBlock)
MiniMind: model/model_minimind.py:430-520 (full model)

Tier 2: Architecture ​

🎯 Learning goals ​

📚 Module list ​

01. Residual Connection ​

02. Transformer Block ​

🏗️ From components to architecture ​

Assembly logic of a Transformer block ​

🚀 Learning advice ​

Recommended order ​

Practice projects (optional) ​

📊 Completion check ​

Theory ​

Practice ​

Design intuition ​

🎓 Next step ​

🔗 Resources ​

Papers ​

Code references ​

Visualization ​