FeedForward Quiz

Answer the following questions to check your understanding.

🎮 Interactive Quiz (Recommended)

🎯 Comprehensive Questions

Q8: Practical scenario

If FeedForward outputs are always close to 0, what might be wrong and how would you debug it?

Show reference answer

Possible causes:

Initialization issues:
- weights initialized too small
- outputs become tiny
Gate signal issues:
- gate_proj outputs mostly negative
- SiLU(negative) ≈ 0
- gate * up ≈ 0
Vanishing gradients:
- deep network, gradients don’t flow
- weights stop updating
Precision issues:
- low precision (e.g., FP16)
- small values underflow to 0

Debug steps:

python

# 1. inspect intermediates
gate = self.gate_proj(x)
print(f"gate mean: {gate.mean()}, std: {gate.std()}")

silu_gate = F.silu(gate)
print(f"silu_gate mean: {silu_gate.mean()}, std: {silu_gate.std()}")

up = self.up_proj(x)
print(f"up mean: {up.mean()}, std: {up.std()}")

hidden = silu_gate * up
print(f"hidden mean: {hidden.mean()}, std: {hidden.std()}")

# 2. check weight initialization
for name, param in self.named_parameters():
    print(f"{name}: mean={param.mean():.6f}, std={param.std():.6f}")

# 3. check gradients
output.sum().backward()
for name, param in self.named_parameters():
    if param.grad is not None:
        print(f"{name} grad: {param.grad.abs().mean():.6f}")

Solutions:

Adjust initialization:

python

nn.init.xavier_uniform_(self.gate_proj.weight)

Check RMSNorm:
- ensure inputs are normalized
- avoid extreme ranges
Mixed precision:
- use FP32 for critical ops
- BF16 for the rest

Q9: Conceptual understanding

Why does FeedForward process each position independently instead of global interaction like Attention?

Show reference answer

Design rationale:

Clear division of labor:
- Attention already mixes information
- FeedForward focuses on "deep processing"
- avoids redundant functionality
Compute efficiency:
- global interaction: O(n²d)
- independent processing: O(nd²)
- when n > d, independent is cheaper
Parameter efficiency:
- global interaction needs position-dependent parameters
- independent processing can reuse parameters
Theory:
- Transformer design:
  - Attention = global routing
  - FFN = local feature transformation
- similar to spatial conv + 1x1 conv in CNNs

Analogy:

meeting (Attention): exchange info
thinking (FFN): digest and organize
alternating works best

Empirical evidence:

papers show this division works well
all-FFN or all-Attention designs perform worse

Q10: Code understanding

Explain each step in the following code:

python

def forward(self, x):
    return self.down_proj(
        F.silu(self.gate_proj(x)) * self.up_proj(x)
    )

Show reference answer

Step-by-step:

python

def forward(self, x):
    # x: [batch, seq, hidden_dim]
    # e.g., [32, 512, 512]

    # Step 1: gate signal
    gate = self.gate_proj(x)
    # gate: [batch, seq, intermediate_dim]
    # e.g., [32, 512, 2048]

    # Step 2: SiLU activation
    gate_activated = F.silu(gate)
    # SiLU(x) = x * sigmoid(x)
    # smooth activation, preserves some negative info
    # gate_activated: [32, 512, 2048]

    # Step 3: value signal
    up = self.up_proj(x)
    # up: [32, 512, 2048]

    # Step 4: gating
    hidden = gate_activated * up
    # elementwise multiply
    # gate_activated controls the flow of up
    # hidden: [32, 512, 2048]

    # Step 5: compress back
    output = self.down_proj(hidden)
    # output: [32, 512, 512]

    return output

Key points:

Two parallel paths: gate_proj and up_proj
Gating: SiLU(gate) controls up
Dimension flow: 512 → 2048 → 512
No bias: all Linear layers use bias=False

Why this style?

concise: one-line formula
efficient: frameworks optimize this pattern
direct mapping to SwiGLU formula

✅ Completion check

After finishing all questions, check whether you can:

[ ] Get Q1–Q7 all correct: solid basics
[ ] Provide 2+ debugging steps in Q8: troubleshooting ability
[ ] Explain design rationale in Q9: architecture understanding
[ ] Explain code step-by-step in Q10: implementation clarity

If anything is unclear, return to teaching.md or rerun the experiments.

🎓 Advanced challenge

Want to go deeper? Try:

Modify experiment code:
- compare ReLU, GELU, SiLU activation distributions
- visualize gating patterns
- test different intermediate_size values
Read papers:
- GLU Variants Improve Transformer - GLU family comparison
- Searching for Activation Functions - Swish activation
Implement variants:
- implement GeGLU (GELU gating)
- implement standard FFN (compare results)
- implement MoE FeedForward

Next: go to 05. Residual Connection to learn residual connections.

FeedForward Quiz

🎮 Interactive Quiz (Recommended)

Why does FeedForward use "expand-compress"?

How many projection matrices does SwiGLU use?

What is the formula for SiLU?

What does the gating mechanism do in SwiGLU?

What is the main difference between FeedForward and Attention?

Why is SwiGLU often better than ReLU?

In MiniMind, intermediate_size is typically how many times hidden_size?

🎯 Comprehensive Questions

Q8: Practical scenario

Q9: Conceptual understanding

Q10: Code understanding

✅ Completion check

🎓 Advanced challenge

FeedForward Quiz ​

🎮 Interactive Quiz (Recommended) ​

Why does FeedForward use "expand-compress"?

How many projection matrices does SwiGLU use?

What is the formula for SiLU?

What does the gating mechanism do in SwiGLU?

What is the main difference between FeedForward and Attention?

Why is SwiGLU often better than ReLU?

In MiniMind, intermediate_size is typically how many times hidden_size?

🎯 Comprehensive Questions ​

Q8: Practical scenario ​

Q9: Conceptual understanding ​

Q10: Code understanding ​

✅ Completion check ​

🎓 Advanced challenge ​

FeedForward Quiz

🎮 Interactive Quiz (Recommended)

🎯 Comprehensive Questions

Q8: Practical scenario

Q9: Conceptual understanding

Q10: Code understanding

✅ Completion check

🎓 Advanced challenge