Normalization
Example
Why It Matters
Normalization is one of those unglamorous techniques that makes everything else work. Without it, neural networks train slowly or not at all, gradient-based optimization gets stuck, and features with larger scales dominate unfairly. It's a required step in virtually every ML pipeline.
How It Works
Data normalization has two common forms. Min-max scaling transforms values to a fixed range (usually 0-1). Standardization (z-score normalization) transforms values to have zero mean and unit variance. Standardization is generally preferred because it handles outliers better and works well with gradient-based optimization.
Neural network normalization operates during the forward pass. Batch normalization (BatchNorm) normalizes activations across the batch dimension, computing mean and variance from all examples in the current mini-batch. It was a breakthrough in 2015, enabling much deeper networks to train reliably. However, it has issues with small batch sizes and doesn't work well in recurrent networks.
Layer normalization (LayerNorm) normalizes across the feature dimension within each individual example, independent of batch size. This is what transformers use and it's the default for LLM architectures. RMSNorm is a simplified variant (normalizing by root mean square only) used in models like LLaMA.
Group normalization and instance normalization are variants for specific use cases: group norm works well in vision tasks with small batches, and instance norm is used in style transfer.
Pre-norm vs. post-norm placement matters in transformers. Original transformers used post-norm (normalize after the residual connection), but modern models prefer pre-norm (normalize before the sub-layer) because it provides more stable training for very deep models.
Common Mistakes
Common mistake: Fitting the normalization parameters (mean, std) on the entire dataset including the test set
Fit normalization statistics only on the training set, then apply those same parameters to validation and test data to avoid data leakage.
Common mistake: Using batch normalization with very small batch sizes, causing noisy statistics
Switch to layer normalization or group normalization when batch sizes are small (less than 16). These don't depend on batch statistics.
Career Relevance
Normalization is a fundamental skill for any ML practitioner. It's part of every data pipeline and every neural network architecture. Interviewers expect candidates to understand when and how to normalize, and production issues often trace back to normalization mistakes.
Related Terms
Stay Ahead in AI
Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.
Join the Community →