Normalization Quiz

Check your understanding of core concepts in the Normalization module

🧠 Quiz (multiple choice)

💬 Open questions

Q6: Debugging scenario

You train a 16-layer Transformer and the loss becomes NaN around step ~100. Based on what you learned, list possible causes and fixes.

Suggested answer

Possible causes:

Missing normalization → vanishing/exploding gradients
Using Post-LN → unstable in deep stacks
Too high learning rate
Poor initialization
FP16 numerical instability

Fixes (top options):

Switch to Pre-LN + RMSNorm

python

class TransformerBlock(nn.Module):
    def forward(self, x):
        x = x + self.attn(self.norm1(x))  # Pre-Norm
        x = x + self.ffn(self.norm2(x))   # Pre-Norm
        return x

Lower learning rate
- From 1e-3 down to 1e-4 or lower
- Add warmup
Check initialization
- Use Kaiming or Xavier initialization
- Ensure activation std starts near 1.0

Use gradient clipping

python

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Prefer BF16 over FP16

python

with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    output = model(input)

Debug checklist:

Log activation stats per layer (mean/std/max/min)
Monitor for NaNs in gradients
Compare with smaller configs

Q7: Explain the intuition

Explain in your own words why normalization stabilizes training, using a simple analogy.

✅ Completion checklist

[ ] Q1–Q5 correct: core concepts confirmed
[ ] Q6: list 3+ causes and fixes: practical debugging
[ ] Q7: clear analogy: intuitive understanding

If you are not fully confident, review teaching.md and rerun the experiments.

🎓 Next step

Continue to 02. Position Encoding.

Normalization Quiz

🧠 Quiz (multiple choice)

What is the main cause of gradient vanishing in deep networks?

What is the key difference between RMSNorm and LayerNorm?

Why is Pre-LN more stable than Post-LN?

In MiniMind, how many RMSNorm layers does each TransformerBlock contain?

Why does RMSNorm use .float() and .type_as(x) in forward?

💬 Open questions

Q6: Debugging scenario

Q7: Explain the intuition

✅ Completion checklist

🎓 Next step

Normalization Quiz ​

🧠 Quiz (multiple choice) ​

What is the main cause of gradient vanishing in deep networks?

What is the key difference between RMSNorm and LayerNorm?

Why is Pre-LN more stable than Post-LN?

In MiniMind, how many RMSNorm layers does each TransformerBlock contain?

Why does RMSNorm use .float() and .type_as(x) in forward?

💬 Open questions ​

Q6: Debugging scenario ​

Q7: Explain the intuition ​

✅ Completion checklist ​

🎓 Next step ​

Normalization Quiz

🧠 Quiz (multiple choice)

💬 Open questions

Q6: Debugging scenario

Q7: Explain the intuition

✅ Completion checklist

🎓 Next step