Skip to content

Normalization Quiz

Check your understanding of core concepts in the Normalization module


🧠 Quiz (multiple choice)

Q1

What is the main cause of gradient vanishing in deep networks?

A
Learning rate is too high
B
Activation scale shrinks across layers so signals vanish
C
The model has too many parameters
D
The dataset is too small
Q2

What is the key difference between RMSNorm and LayerNorm?

A
RMSNorm uses more trainable parameters
B
RMSNorm does not subtract the mean; it normalizes by RMS only
C
RMSNorm is only used in Transformers
D
RMSNorm is slower than LayerNorm
Q3

Why is Pre-LN more stable than Post-LN?

A
It has better GPU utilization
B
It reduces parameter count
C
It keeps a clean residual path and stabilizes gradients
D
It removes normalization completely
Q4

In MiniMind, how many RMSNorm layers does each TransformerBlock contain?

A
1
B
2
C
4
D
8
Q5

Why does RMSNorm use .float() and .type_as(x) in forward?

A
To speed up the CPU
B
To reduce parameter count
C
To avoid FP16/BF16 underflow during normalization
D
To avoid PyTorch errors

💬 Open questions

Q6: Debugging scenario

You train a 16-layer Transformer and the loss becomes NaN around step ~100. Based on what you learned, list possible causes and fixes.

Suggested answer

Possible causes:

  1. Missing normalization → vanishing/exploding gradients
  2. Using Post-LN → unstable in deep stacks
  3. Too high learning rate
  4. Poor initialization
  5. FP16 numerical instability

Fixes (top options):

  1. Switch to Pre-LN + RMSNorm

    python
    class TransformerBlock(nn.Module):
        def forward(self, x):
            x = x + self.attn(self.norm1(x))  # Pre-Norm
            x = x + self.ffn(self.norm2(x))   # Pre-Norm
            return x
  2. Lower learning rate

    • From 1e-3 down to 1e-4 or lower
    • Add warmup
  3. Check initialization

    • Use Kaiming or Xavier initialization
    • Ensure activation std starts near 1.0
  4. Use gradient clipping

    python
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
  5. Prefer BF16 over FP16

    python
    with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
        output = model(input)

Debug checklist:

  1. Log activation stats per layer (mean/std/max/min)
  2. Monitor for NaNs in gradients
  3. Compare with smaller configs

Q7: Explain the intuition

Explain in your own words why normalization stabilizes training, using a simple analogy.

Suggested answer

Example intuition:

  • Without normalization: each layer acts like a filter that can amplify or shrink signals, so the signal scale quickly drifts.
  • With normalization: each layer first normalizes the input to a stable scale, so gradients flow reliably.

Analogy:

  • A water system without pressure control: higher floors get almost no water.
  • A system with pressure regulators: every floor receives stable pressure.

Key takeaway: Normalization keeps signal scale stable, so gradients remain healthy and training converges.


✅ Completion checklist

  • [ ] Q1–Q5 correct: core concepts confirmed
  • [ ] Q6: list 3+ causes and fixes: practical debugging
  • [ ] Q7: clear analogy: intuitive understanding

If you are not fully confident, review teaching.md and rerun the experiments.


🎓 Next step

Continue to 02. Position Encoding.

Built on MiniMind for learning and experiments