Skip to content

03. Attention

How does Self-Attention work? What is the intuition behind Q, K, V?


🎯 Learning goals

After this module, you will be able to:

  • ✅ Understand the math of Self-Attention
  • ✅ Understand the intuition of Q, K, V
  • ✅ Understand the benefits of Multi-Head Attention
  • ✅ Understand GQA (Grouped Query Attention)
  • ✅ Implement Scaled Dot-Product Attention from scratch

📚 Learning path

1️⃣ Quick experience (15 min)

bash
cd experiments

# Exp 1: Attention basics
python exp1_attention_basics.py

# Exp 2: Q, K, V explained
python exp2_qkv_explained.py

🔬 Experiment list

ExperimentPurposeTime
exp1_attention_basics.pypermutation invariance + basic computation5 min
exp2_qkv_explained.pyintuition for Q, K, V5 min
exp3_multihead_attention.pymulti-head mechanism10 min

💡 Key points

1. Core Self-Attention formula

Attention(Q,K,V)=softmax(QKTdk)V

Intuition:

  • QKT: relevance scores (who relates to whom)
  • dk: scaling factor (avoid softmax saturation)
  • softmax: normalize to a probability distribution
  • ×V: weighted sum (extract relevant info)

2. Intuition for Q, K, V

RoleFull nameQuestionAnalogy
QQueryWhat am I looking for?Library search
KKeyWhat labels do I provide?Book keywords
VValueWhat content do I return?Book contents

Example: sentence "The cat sat on the mat"

  • Query of "cat": “find info about animals”
  • Key of "sat": “I am an action word”
  • If Q·K matches → use the Value of "sat"

3. Why Multi-Head?

Problem: single-head attention learns only one “relation pattern.”

Solution: multiple heads in parallel, each learns a different pattern

  • Head 1: syntax (subject–verb–object)
  • Head 2: semantics (synonyms)
  • Head 3: position (neighboring words)
  • ...

Formula:

MultiHead(Q,K,V)=Concat(head1,...,headh)WO

4. GQA (Grouped Query Attention)

MHA: each head has its own Q, K, V

  • Params: 3 × n_heads × head_dim

GQA: multiple Q heads share a set of K, V

  • Params: (n_heads + 2 × n_kv_heads) × head_dim
  • Smaller KV cache, faster inference

MiniMind: n_heads=8, n_kv_heads=2

  • 8 Q heads share 2 KV heads
  • Each 4 Q heads share a KV group

📖 Docs


✅ Completion check

After finishing, you should be able to:

Theory

  • [ ] Write the attention formula
  • [ ] Explain the roles of Q, K, V
  • [ ] Explain the role of dk
  • [ ] Explain why multi-head helps

Practice

  • [ ] Implement Scaled Dot-Product Attention from scratch
  • [ ] Visualize attention weights
  • [ ] Understand causal masking

🔗 Resources

Papers

Code reference

  • MiniMind: model/model_minimind.py:250-330

🎓 Next step

After finishing this module, go to: 👉 04. FeedForward

Built on MiniMind for learning and experiments