03. Attention
How does Self-Attention work? What is the intuition behind Q, K, V?
🎯 Learning goals
After this module, you will be able to:
- ✅ Understand the math of Self-Attention
- ✅ Understand the intuition of Q, K, V
- ✅ Understand the benefits of Multi-Head Attention
- ✅ Understand GQA (Grouped Query Attention)
- ✅ Implement Scaled Dot-Product Attention from scratch
📚 Learning path
1️⃣ Quick experience (15 min)
bash
cd experiments
# Exp 1: Attention basics
python exp1_attention_basics.py
# Exp 2: Q, K, V explained
python exp2_qkv_explained.py🔬 Experiment list
| Experiment | Purpose | Time |
|---|---|---|
| exp1_attention_basics.py | permutation invariance + basic computation | 5 min |
| exp2_qkv_explained.py | intuition for Q, K, V | 5 min |
| exp3_multihead_attention.py | multi-head mechanism | 10 min |
💡 Key points
1. Core Self-Attention formula
Intuition:
: relevance scores (who relates to whom) : scaling factor (avoid softmax saturation) - softmax: normalize to a probability distribution
: weighted sum (extract relevant info)
2. Intuition for Q, K, V
| Role | Full name | Question | Analogy |
|---|---|---|---|
| Q | Query | What am I looking for? | Library search |
| K | Key | What labels do I provide? | Book keywords |
| V | Value | What content do I return? | Book contents |
Example: sentence "The cat sat on the mat"
- Query of "cat": “find info about animals”
- Key of "sat": “I am an action word”
- If Q·K matches → use the Value of "sat"
3. Why Multi-Head?
Problem: single-head attention learns only one “relation pattern.”
Solution: multiple heads in parallel, each learns a different pattern
- Head 1: syntax (subject–verb–object)
- Head 2: semantics (synonyms)
- Head 3: position (neighboring words)
- ...
Formula:
4. GQA (Grouped Query Attention)
MHA: each head has its own Q, K, V
- Params: 3 × n_heads × head_dim
GQA: multiple Q heads share a set of K, V
- Params: (n_heads + 2 × n_kv_heads) × head_dim
- Smaller KV cache, faster inference
MiniMind: n_heads=8, n_kv_heads=2
- 8 Q heads share 2 KV heads
- Each 4 Q heads share a KV group
📖 Docs
- 📘 teaching.md - full concept explanation
- 💻 code_guide.md - MiniMind code walkthrough
- 📝 quiz.md - self-check
✅ Completion check
After finishing, you should be able to:
Theory
- [ ] Write the attention formula
- [ ] Explain the roles of Q, K, V
- [ ] Explain the role of
- [ ] Explain why multi-head helps
Practice
- [ ] Implement Scaled Dot-Product Attention from scratch
- [ ] Visualize attention weights
- [ ] Understand causal masking
🔗 Resources
Papers
- Attention Is All You Need - original Transformer
- GQA: Training Generalized Multi-Query Transformer
Code reference
- MiniMind:
model/model_minimind.py:250-330
🎓 Next step
After finishing this module, go to: 👉 04. FeedForward