Model Parameters

Cross-Entropy

Quick Answer: A mathematical measure of the difference between a model's predicted probability distribution and the actual distribution of outcomes.
Cross-Entropy is a mathematical measure of the difference between a model's predicted probability distribution and the actual distribution of outcomes. In language models, cross-entropy loss measures how well the model predicts each next token. Lower cross-entropy means better predictions and a more capable model.

Example

If the true next word is 'cat' and the model assigns 80% probability to 'cat,' the cross-entropy for that token is low (good prediction). If the model only assigns 5% to 'cat,' the cross-entropy is high (bad prediction). Training minimizes this across trillions of tokens.

Why It Matters

Cross-entropy is the objective function that LLMs are trained to minimize. Understanding it explains why models sometimes generate high-probability but incorrect text (hallucinations) and why temperature adjustments change output quality.

How It Works

Cross-entropy measures the difference between two probability distributions: what the model predicted and what actually happened. For language models, it measures how surprised the model is by each token in the training data. The goal of training is to minimize this surprise across trillions of tokens.

The formula computes the negative log probability assigned to the correct token at each position. If the model assigned high probability to the correct token, the cross-entropy for that position is low. If it assigned low probability, the cross-entropy is high. Averaging across all positions gives the model's overall loss.

Cross-entropy connects to perplexity through a simple relationship: perplexity = 2^(cross-entropy). This means a model with cross-entropy loss of 3.32 has a perplexity of 10. Understanding this relationship helps interpret training curves and model comparisons.

Common Mistakes

Common mistake: Confusing training loss (cross-entropy) with model quality for downstream tasks

Lower training loss means better next-token prediction, not necessarily better task performance. Models are typically evaluated on downstream tasks, not training loss.

Common mistake: Expecting cross-entropy to decrease monotonically during training

Loss curves have noise, and validation loss may increase while training loss decreases (overfitting). Monitor validation loss and use early stopping when it starts rising.

Career Relevance

Cross-entropy understanding is fundamental for ML engineers and researchers working on model training. It's the objective function that drives all language model development, making it important background knowledge for anyone in the AI field.

Stay Ahead in AI

Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.

Join the Community →