Tier 1: Foundation
Understand the basic building blocks of a Transformer
🎯 Learning goals
After finishing this tier, you will be able to:
- ✅ Explain the role and math behind each core component
- ✅ Understand why modern LLMs choose these designs (over alternatives)
- ✅ Validate design choices via experiments
- ✅ Implement these components from scratch
📚 Module list
01. Normalization
Core questions:
- Why do deep networks need normalization?
- RMSNorm vs LayerNorm: what is different?
- Pre-LN vs Post-LN: why is Pre-LN more stable?
Key experiments:
- Exp 1: visualize gradient vanishing (no norm vs norm)
- Exp 2: compare four configs (NoNorm / Post-LN / Pre-LN / RMSNorm)
- Exp 3: precision impact (FP32 / FP16 / BF16)
Estimated time: 1 hour
02. Position Encoding
Core questions:
- Why does Attention need position?
- How does RoPE work?
- Why does RoPE extrapolate better?
Key experiments:
- Exp 1: permutation invariance without position encoding
- Exp 2: RoPE vs absolute position encoding
- Exp 3: visualize multi-frequency mechanism
- Exp 4: length extrapolation test
Estimated time: 1.5 hours
03. Attention
Core questions:
- What is the intuition behind QKV?
- Why do we need Multi-Head Attention?
- How does GQA improve efficiency?
Key experiments:
- Exp 1: visualize attention weights
- Exp 2: single-head vs multi-head
- Exp 3: GQA efficiency test
- Exp 4: causal mask effect
Estimated time: 2 hours
04. FeedForward
Core questions:
- What role does FFN play in Transformer?
- Why the expand–compress structure?
- SwiGLU vs ReLU: what is different?
Key experiments:
- Exp 1: expansion ratio impact
- Exp 2: activation function comparison
- Exp 3: ablation (remove FFN)
Estimated time: 1 hour
🚀 Learning advice
Recommended order
Follow this order:
- Normalization → stabilize training
- Position Encoding → encode position
- Attention → core mechanism
- FeedForward → knowledge storage
Reason: later modules rely on earlier concepts.
Learning method
For each module:
Run experiments first (20 min)
- Build intuition: “so that’s what happens”
- No need to fully understand code yet
Read theory (20 min)
- Read
teaching.md - Understand math and intuition
- Read
Read code (10 min)
- Read
code_guide.md - Linked to MiniMind original implementation
- Read
Self-check (10 min)
- Complete
quiz.md - Check your understanding
- Complete
Common questions
Q: What if experiments fail?
A: Check the following:
- Did you activate the virtual environment?
source venv/bin/activate - Did you download data?
cd modules/common && python datasets.py --download-all - Are you in the correct folder? Experiments must run inside
experiments/
Q: Experiments are too slow?
A: Use quick mode:
python exp_xxx.py --quickQuick mode reduces steps (100) to verify the concept fast.
Q: Want to go deeper on a topic?
A: Each module’s teaching.md ends with “Further Reading”:
- Original papers
- Blogs
- Video tutorials
📊 Completion check
After finishing this tier, you should be able to answer:
Theory
- [ ] Why do deep networks need normalization?
- [ ] How does RoPE encode relative position?
- [ ] What do Q, K, V represent in attention?
- [ ] Why does FFN expand then compress?
Practice
- [ ] Implement RMSNorm from scratch
- [ ] Implement RoPE from scratch
- [ ] Implement Scaled Dot-Product Attention
- [ ] Implement SwiGLU FFN
Design intuition
- [ ] Explain why Pre-LN is more stable than Post-LN
- [ ] Explain why RoPE beats absolute position encoding
- [ ] Explain why Multi-Head beats Single-Head
- [ ] Explain why FFN cannot be removed
🎓 Next step
After finishing this tier, go to: 👉 Tier 2: Architecture
Learn how to assemble these components into a full Transformer block.