MiniMindLearning Guide
Principles + experiments + practice
Principles + experiments + practice
Focus: Pre-LN vs Post-LN, why normalization matters Time: 1 hour | Status: Complete
Focus: RoPE and position encoding choices Time: 1.5 hours | Status: Complete
Focus: Q/K/V, multi-head attention Time: 2 hours | Status: Complete
Focus: FFN design and SwiGLU Time: 1 hour | Status: Complete
Focus: assemble components into a Transformer block Time: 2.5 hours | Status: In progress
Each experiment takes 5–10 minutes on CPU. Quickly grasp the essentials behind LLM training.
Observe gradient vanishing and see how RMSNorm stabilizes training.
Start ExperimentsCompare absolute position encoding and learn why RoPE extrapolates better.
Start ExperimentsValidate gradient flow issues in deep nets and see the power of residuals.
Start Experimentsgit clone https://github.com/joyehuang/minimind-notes.git
cd minimind-notes
source venv/bin/activate# Experiment 1: Why normalization?
cd modules/01-foundation/01-normalization/experiments
python exp1_gradient_vanishing.py
# What you will see:
# ❌ No normalization: activation std drops (vanishing gradients)
# ✅ RMSNorm: activation std stays stable# Read teaching notes for the why/what/how
cat modules/01-foundation/01-normalization/teaching.md✅ Principles first
Run experiments first, then read theory. Focus on why each design choice exists.
🔬 Experiment-driven learning
Each module includes experiments that answer: “What breaks if we don’t do this?”
💻 Low barrier
TinyShakespeare (1MB) or TinyStories (10–50MB) run on CPU in minutes. GPU is optional for learning.
Upstream projectjingyaogong/minimind
Learning roadmapRoadmap
Executable examplesLearning materials
Learning notesLearning log · Knowledge base