Model Training

Model Distillation

Model Distillation (Knowledge Distillation)

Quick Answer: A training technique where a smaller 'student' model learns to replicate the behavior of a larger 'teacher' model.
Model Distillation (Knowledge Distillation) is a training technique where a smaller 'student' model learns to replicate the behavior of a larger 'teacher' model. The student model is trained on the teacher's outputs rather than raw data, capturing the larger model's knowledge in a more compact, faster, and cheaper form factor.

Example

A company uses GPT-4 to classify 100,000 customer support tickets into 15 categories with 95% accuracy. They then fine-tune a small open-source model (Llama 8B) on GPT-4's classifications. The distilled model achieves 92% accuracy at 1/100th the inference cost and 10x the speed, suitable for real-time classification.

Why It Matters

Distillation is how companies move from expensive prototype to affordable production. Most AI applications prototype with a large model (GPT-4, Claude Opus) then distill into a smaller, faster model for deployment. This pattern reduces inference costs by 90-99% while retaining most of the quality.

How It Works

Distillation works on a simple insight: the probability distributions generated by a large model contain richer information than raw labels alone.

When GPT-4 classifies a support ticket as 'billing issue,' it also assigns lower probabilities to related categories like 'account access' and 'payment failure.' These soft probability distributions, called 'soft labels' or 'dark knowledge,' capture nuanced relationships between categories that binary labels miss.

The distillation process involves three steps:

First, generate training data using the teacher model. Run your inputs through GPT-4 or Claude Opus and collect both the outputs and (if available) the probability distributions.

Second, train the student model on this data. The student learns to mimic the teacher's behavior, including the subtle probability patterns, not just the final answers.

Third, evaluate the student against a held-out test set. The quality gap between teacher and student is called the 'distillation gap.' A well-executed distillation achieves 85-95% of the teacher's quality.

Common distillation targets include moving from GPT-4 to GPT-4o-mini, from Claude Opus to a fine-tuned Haiku, or from any large model to an open-source model you can self-host.

Common Mistakes

Common mistake: Distilling on too little data, producing a student that memorizes examples rather than learning patterns

Use at least 10,000-50,000 examples for distillation. More data produces better students. Quality matters too: ensure the teacher's outputs are diverse and representative.

Common mistake: Comparing the student model to human performance rather than to the teacher model

The student's ceiling is the teacher's performance. If GPT-4 achieves 93% accuracy on your task, a distilled model achieving 90% is excellent. Expecting it to exceed the teacher is unrealistic.

Common mistake: Ignoring the tradeoff between model size and quality for your latency requirements

Benchmark multiple student sizes. Sometimes a 3B model at 88% accuracy with 50ms latency beats a 7B model at 91% accuracy with 120ms latency. Let your application's latency budget guide the choice.

Career Relevance

Distillation is a core MLOps and AI engineering skill. Companies hiring AI engineers expect candidates to understand when and how to distill large models into production-ready smaller models. It is the bridge between prototyping with expensive APIs and deploying cost-effective AI at scale.

Frequently Asked Questions

What is the difference between distillation and fine-tuning?

Fine-tuning trains a model on task-specific data (human-labeled examples). Distillation trains a smaller model on a larger model's outputs. Distillation is a form of fine-tuning, but the training data comes from a model rather than from humans. You can combine both: fine-tune with human data, then distill.

Is it legal to distill from commercial APIs like GPT-4?

This depends on the provider's terms of service. As of 2026, OpenAI's terms restrict using API outputs to train competing models. Anthropic has similar restrictions. Always check current terms before distilling. Using outputs to train models for your own internal use is generally permitted.

Stay Ahead in AI

Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.

Join the Community →