Skip to content

Attention Quiz

Answer the following questions to check your understanding.


Q1

Why divide by √d_k in Self-Attention?

A
To speed up computation
B
To reduce parameters
C
To prevent softmax saturation and stabilize gradients
D
To support longer sequences
Q2

What do Q, K, V represent?

A
Query=question, Key=key, Value=value
B
Query=problem, Key=keyword, Value=answer
C
Query=what I want, Key=what tags I have, Value=actual content
D
Query=input, Key=output, Value=intermediate
Q3

What is the main advantage of Multi-Head Attention?

A
Reduce computation
B
Reduce parameters
C
Learn multiple relationship patterns in parallel
D
Support longer sequences
Q4

What does GQA (Grouped Query Attention) do?

A
Improve accuracy
B
Reduce KV cache memory and speed up inference
C
Increase model capacity
D
Support more heads
Q5

What is the role of a causal mask?

A
Speed up computation
B
Prevent the model from seeing future tokens
C
Reduce parameters
D
Improve accuracy
Q6

What does repeat_kv do in GQA?

A
Copy K and V to match the number of Q heads
B
Increase KV cache size
C
Improve numerical precision
D
Reduce memory usage
Q7

What is the advantage of Flash Attention over the standard implementation?

A
Improves accuracy
B
Reduces parameters
C
Reduces GPU memory traffic and speeds up computation
D
Supports more attention heads

🎯 Comprehensive Questions

Q8: Practical scenario

If you find that attention weights are almost uniform (each position is ~1/seq_len), what might be wrong and how can you fix it?

Show reference answer

Possible causes:

  1. Q/K initialization issues:

    • projection matrices too small
    • Q·K scores near 0
    • softmax(0, 0, ..., 0) ≈ uniform
  2. Scaling factor issues:

    • divided by too large a value
    • or forgot the square root (divide by d_k instead of √d_k)
  3. Head dim misconfigured:

    • overly large head_dim makes variance huge
    • but this tends to cause extreme, not uniform, distributions
  4. Model didn’t learn meaningful patterns:

    • data issues
    • insufficient capacity

Diagnostics:

python
# check Q·K scores before softmax
scores = torch.matmul(xq, xk.transpose(-2, -1)) / math.sqrt(head_dim)
print(f"scores mean: {scores.mean()}, std: {scores.std()}")

# normal: mean ≈ 0, std ≈ 1
# problematic: std too small (near 0)

Fixes:

  1. Check initialization:

    python
    # use Xavier or Kaiming initialization
    nn.init.xavier_uniform_(self.wq.weight)
  2. Verify scaling factor:

    python
    # ensure sqrt(head_dim), not head_dim
    scale = math.sqrt(self.head_dim)
  3. Visualize attention:

    python
    plt.imshow(attn_weights[0, 0].detach().numpy())
    plt.title("Attention weights")
    plt.colorbar()

Q9: Conceptual understanding

Why do Q, K, V all come from the same input x, but still require three different projection matrices? Can we just use x directly?

Show reference answer

Problem with using x directly:

If Q = K = V = x, then:

python
scores = x @ x.T  # self dot-products

This is essentially cosine similarity in the embedding space.

Issues:

  1. Symmetry: attention from token_i to token_j equals attention from token_j to token_i

    • but language relations are often asymmetric
    • e.g., “cat eats fish”: cat should attend to fish, fish doesn’t need to attend to cat
  2. Limited expressiveness:

    • only captures similarity
    • cannot represent relations like subject/object or modifier

Why three projections matter:

python
Q = x @ W_Q  # "as a query, what should I focus on?"
K = x @ W_K  # "as a key, what should I reveal?"
V = x @ W_V  # "what content should I pass?"

Advantages:

  1. Asymmetry: Q and K differ, enabling directional relations
  2. Role separation: query view, key view, content view
  3. Expressiveness: can learn arbitrary relation patterns

Analogy:

  • Library search:
    • your question (Q): written in natural language
    • index labels (K): keywords/tags
    • content (V): the actual text
  • different “languages,” matched to retrieve the right content

Q10: Code understanding

Explain why contiguous() is necessary here:

python
output = output.transpose(1, 2).contiguous().view(batch, seq_len, -1)
Show reference answer

Background:

PyTorch tensors have two concepts:

  1. Storage: physical memory layout
  2. View: how you interpret that memory

What transpose does:

python
# suppose output is [batch, n_heads, seq, head_dim]
# memory is laid out in that order

output = output.transpose(1, 2)
# now shape is [batch, seq, n_heads, head_dim]
# but memory layout is unchanged

Problem:

view() requires contiguous memory, but transpose makes it non-contiguous:

python
# original memory order (simplified):
# [head0_pos0, head0_pos1, head1_pos0, head1_pos1, ...]

# logical order after transpose:
# [head0_pos0, head1_pos0, head0_pos1, head1_pos1, ...]

# non-contiguous → view fails

What contiguous() does:

python
output = output.transpose(1, 2).contiguous()
# 1. reallocate memory
# 2. reorder data to match new view
# 3. now view is safe

Performance note:

  • contiguous() copies memory (costly)
  • but it’s required here
  • alternative: reshape() handles this implicitly but is less explicit

Best practice:

python
# when you know you need contiguous memory
output = output.transpose(1, 2).contiguous().view(...)

# or use reshape
output = output.transpose(1, 2).reshape(...)

✅ Completion check

After finishing all questions, check whether you can:

  • [ ] Get Q1–Q7 all correct: solid basics
  • [ ] Provide 2+ diagnostics in Q8: debugging ability
  • [ ] Explain projections in Q9: design principles understood
  • [ ] Explain why contiguous() is needed in Q10: PyTorch memory model

If anything is unclear, return to teaching.md or rerun the experiments.


🎓 Advanced challenge

Want to go deeper? Try:

  1. Modify experiment code:

    • implement Attention without scaling and observe softmax
    • implement MQA (Multi-Query Attention) and compare with GQA
    • visualize attention patterns from different heads
  2. Read papers:

  3. Implement variants:

    • implement Cross-Attention (Q and KV from different inputs)
    • implement Sliding Window Attention
    • implement Sparse Attention

Next: go to 04. FeedForward to learn feed-forward networks.

Built on MiniMind for learning and experiments