Model Training

Gradient Descent

Quick Answer: The core optimization algorithm used to train neural networks.

Gradient Descent is the core optimization algorithm used to train neural networks. It works by calculating how much each model weight contributes to errors, then adjusting those weights in small steps to reduce the error. Think of it as rolling a ball downhill to find the lowest point in a landscape of possible errors.

Example

During training, the model predicts 'cat' but the correct answer is 'dog.' The loss function calculates the error. Gradient descent computes which weights to adjust and by how much, nudging the model's predictions closer to 'dog' for similar inputs next time.

Why It Matters

Gradient descent is how every neural network learns. While prompt engineers don't implement it directly, understanding the basics explains why models behave the way they do, why training can fail, and what fine-tuning actually does under the hood.

How It Works

Gradient descent works in three steps, repeated millions of times. First, the model makes a prediction and the loss function measures how wrong it is. Second, backpropagation calculates the gradient: how much each weight contributed to the error. Third, the optimizer updates each weight by a small amount in the direction that reduces the error. The size of each update is controlled by the learning rate.

In practice, pure gradient descent (computing gradients over the entire dataset) is too slow. Stochastic gradient descent (SGD) computes gradients on small random batches, which is noisy but much faster. Modern optimizers like Adam combine adaptive learning rates with momentum (remembering the direction of recent updates) to converge faster and more reliably.

The learning rate is the most critical hyperparameter. Too high, and the model overshoots the optimal weights, oscillating wildly. Too low, and training takes forever or gets stuck in a poor local minimum. Learning rate schedules (starting high and decreasing over time) are standard practice. This is why fine-tuning typically uses a much smaller learning rate than pre-training: you want to make small adjustments, not overwrite what the model already knows.

Common Mistakes

Common mistake: Setting the learning rate too high, causing training instability

Start with a small learning rate (1e-5 for fine-tuning) and increase gradually. Use learning rate schedulers for automatic adjustment.

Common mistake: Training for too many epochs, causing overfitting

Monitor validation loss during training. Stop when validation performance plateaus or starts degrading.

Common mistake: Assuming gradient descent always finds the best solution

Gradient descent finds local optima, not guaranteed global optima. This is why training with different random seeds can produce different results.

Career Relevance

While prompt engineers don't code gradient descent, understanding it is expected in technical interviews for AI roles. It helps you communicate effectively with ML engineers and understand training reports, fine-tuning parameters, and why models sometimes produce unexpected results.

Related Terms

Stay Ahead in AI

Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.

Join the Community →