FeedForward Quiz
Answer the following questions to check your understanding.
🎮 Interactive Quiz (Recommended)
Why does FeedForward use "expand-compress"?
How many projection matrices does SwiGLU use?
What is the formula for SiLU?
What does the gating mechanism do in SwiGLU?
What is the main difference between FeedForward and Attention?
Why is SwiGLU often better than ReLU?
In MiniMind, intermediate_size is typically how many times hidden_size?
🎯 Comprehensive Questions
Q8: Practical scenario
If FeedForward outputs are always close to 0, what might be wrong and how would you debug it?
Show reference answer
Possible causes:
Initialization issues:
- weights initialized too small
- outputs become tiny
Gate signal issues:
- gate_proj outputs mostly negative
- SiLU(negative) ≈ 0
- gate * up ≈ 0
Vanishing gradients:
- deep network, gradients don’t flow
- weights stop updating
Precision issues:
- low precision (e.g., FP16)
- small values underflow to 0
Debug steps:
# 1. inspect intermediates
gate = self.gate_proj(x)
print(f"gate mean: {gate.mean()}, std: {gate.std()}")
silu_gate = F.silu(gate)
print(f"silu_gate mean: {silu_gate.mean()}, std: {silu_gate.std()}")
up = self.up_proj(x)
print(f"up mean: {up.mean()}, std: {up.std()}")
hidden = silu_gate * up
print(f"hidden mean: {hidden.mean()}, std: {hidden.std()}")
# 2. check weight initialization
for name, param in self.named_parameters():
print(f"{name}: mean={param.mean():.6f}, std={param.std():.6f}")
# 3. check gradients
output.sum().backward()
for name, param in self.named_parameters():
if param.grad is not None:
print(f"{name} grad: {param.grad.abs().mean():.6f}")Solutions:
Adjust initialization:
pythonnn.init.xavier_uniform_(self.gate_proj.weight)Check RMSNorm:
- ensure inputs are normalized
- avoid extreme ranges
Mixed precision:
- use FP32 for critical ops
- BF16 for the rest
Q9: Conceptual understanding
Why does FeedForward process each position independently instead of global interaction like Attention?
Show reference answer
Design rationale:
Clear division of labor:
- Attention already mixes information
- FeedForward focuses on "deep processing"
- avoids redundant functionality
Compute efficiency:
- global interaction: O(n²d)
- independent processing: O(nd²)
- when n > d, independent is cheaper
Parameter efficiency:
- global interaction needs position-dependent parameters
- independent processing can reuse parameters
Theory:
- Transformer design:
- Attention = global routing
- FFN = local feature transformation
- similar to spatial conv + 1x1 conv in CNNs
- Transformer design:
Analogy:
- meeting (Attention): exchange info
- thinking (FFN): digest and organize
- alternating works best
Empirical evidence:
- papers show this division works well
- all-FFN or all-Attention designs perform worse
Q10: Code understanding
Explain each step in the following code:
def forward(self, x):
return self.down_proj(
F.silu(self.gate_proj(x)) * self.up_proj(x)
)Show reference answer
Step-by-step:
def forward(self, x):
# x: [batch, seq, hidden_dim]
# e.g., [32, 512, 512]
# Step 1: gate signal
gate = self.gate_proj(x)
# gate: [batch, seq, intermediate_dim]
# e.g., [32, 512, 2048]
# Step 2: SiLU activation
gate_activated = F.silu(gate)
# SiLU(x) = x * sigmoid(x)
# smooth activation, preserves some negative info
# gate_activated: [32, 512, 2048]
# Step 3: value signal
up = self.up_proj(x)
# up: [32, 512, 2048]
# Step 4: gating
hidden = gate_activated * up
# elementwise multiply
# gate_activated controls the flow of up
# hidden: [32, 512, 2048]
# Step 5: compress back
output = self.down_proj(hidden)
# output: [32, 512, 512]
return outputKey points:
- Two parallel paths: gate_proj and up_proj
- Gating: SiLU(gate) controls up
- Dimension flow: 512 → 2048 → 512
- No bias: all Linear layers use bias=False
Why this style?
- concise: one-line formula
- efficient: frameworks optimize this pattern
- direct mapping to SwiGLU formula
✅ Completion check
After finishing all questions, check whether you can:
- [ ] Get Q1–Q7 all correct: solid basics
- [ ] Provide 2+ debugging steps in Q8: troubleshooting ability
- [ ] Explain design rationale in Q9: architecture understanding
- [ ] Explain code step-by-step in Q10: implementation clarity
If anything is unclear, return to teaching.md or rerun the experiments.
🎓 Advanced challenge
Want to go deeper? Try:
Modify experiment code:
- compare ReLU, GELU, SiLU activation distributions
- visualize gating patterns
- test different intermediate_size values
Read papers:
- GLU Variants Improve Transformer - GLU family comparison
- Searching for Activation Functions - Swish activation
Implement variants:
- implement GeGLU (GELU gating)
- implement standard FFN (compare results)
- implement MoE FeedForward
Next: go to 05. Residual Connection to learn residual connections.