Gradient Descent
Example
Why It Matters
Gradient descent is how every neural network learns. While prompt engineers don't implement it directly, understanding the basics explains why models behave the way they do, why training can fail, and what fine-tuning actually does under the hood.
How It Works
Gradient descent works in three steps, repeated millions of times. First, the model makes a prediction and the loss function measures how wrong it is. Second, backpropagation calculates the gradient: how much each weight contributed to the error. Third, the optimizer updates each weight by a small amount in the direction that reduces the error. The size of each update is controlled by the learning rate.
In practice, pure gradient descent (computing gradients over the entire dataset) is too slow. Stochastic gradient descent (SGD) computes gradients on small random batches, which is noisy but much faster. Modern optimizers like Adam combine adaptive learning rates with momentum (remembering the direction of recent updates) to converge faster and more reliably.
The learning rate is the most critical hyperparameter. Too high, and the model overshoots the optimal weights, oscillating wildly. Too low, and training takes forever or gets stuck in a poor local minimum. Learning rate schedules (starting high and decreasing over time) are standard practice. This is why fine-tuning typically uses a much smaller learning rate than pre-training: you want to make small adjustments, not overwrite what the model already knows.
Common Mistakes
Common mistake: Setting the learning rate too high, causing training instability
Start with a small learning rate (1e-5 for fine-tuning) and increase gradually. Use learning rate schedulers for automatic adjustment.
Common mistake: Training for too many epochs, causing overfitting
Monitor validation loss during training. Stop when validation performance plateaus or starts degrading.
Common mistake: Assuming gradient descent always finds the best solution
Gradient descent finds local optima, not guaranteed global optima. This is why training with different random seeds can produce different results.
Career Relevance
While prompt engineers don't code gradient descent, understanding it is expected in technical interviews for AI roles. It helps you communicate effectively with ML engineers and understand training reports, fine-tuning parameters, and why models sometimes produce unexpected results.
Related Terms
Stay Ahead in AI
Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.
Join the Community →