Model Distillation
Model Distillation (Knowledge Distillation)
Example
Why It Matters
Distillation is how companies move from expensive prototype to affordable production. Most AI applications prototype with a large model (GPT-4, Claude Opus) then distill into a smaller, faster model for deployment. This pattern reduces inference costs by 90-99% while retaining most of the quality.
How It Works
Distillation works on a simple insight: the probability distributions generated by a large model contain richer information than raw labels alone.
When GPT-4 classifies a support ticket as 'billing issue,' it also assigns lower probabilities to related categories like 'account access' and 'payment failure.' These soft probability distributions, called 'soft labels' or 'dark knowledge,' capture nuanced relationships between categories that binary labels miss.
The distillation process involves three steps:
First, generate training data using the teacher model. Run your inputs through GPT-4 or Claude Opus and collect both the outputs and (if available) the probability distributions.
Second, train the student model on this data. The student learns to mimic the teacher's behavior, including the subtle probability patterns, not just the final answers.
Third, evaluate the student against a held-out test set. The quality gap between teacher and student is called the 'distillation gap.' A well-executed distillation achieves 85-95% of the teacher's quality.
Common distillation targets include moving from GPT-4 to GPT-4o-mini, from Claude Opus to a fine-tuned Haiku, or from any large model to an open-source model you can self-host.
Common Mistakes
Common mistake: Distilling on too little data, producing a student that memorizes examples rather than learning patterns
Use at least 10,000-50,000 examples for distillation. More data produces better students. Quality matters too: ensure the teacher's outputs are diverse and representative.
Common mistake: Comparing the student model to human performance rather than to the teacher model
The student's ceiling is the teacher's performance. If GPT-4 achieves 93% accuracy on your task, a distilled model achieving 90% is excellent. Expecting it to exceed the teacher is unrealistic.
Common mistake: Ignoring the tradeoff between model size and quality for your latency requirements
Benchmark multiple student sizes. Sometimes a 3B model at 88% accuracy with 50ms latency beats a 7B model at 91% accuracy with 120ms latency. Let your application's latency budget guide the choice.
Career Relevance
Distillation is a core MLOps and AI engineering skill. Companies hiring AI engineers expect candidates to understand when and how to distill large models into production-ready smaller models. It is the bridge between prototyping with expensive APIs and deploying cost-effective AI at scale.
Frequently Asked Questions
What is the difference between distillation and fine-tuning?
Fine-tuning trains a model on task-specific data (human-labeled examples). Distillation trains a smaller model on a larger model's outputs. Distillation is a form of fine-tuning, but the training data comes from a model rather than from humans. You can combine both: fine-tune with human data, then distill.
Is it legal to distill from commercial APIs like GPT-4?
This depends on the provider's terms of service. As of 2026, OpenAI's terms restrict using API outputs to train competing models. Anthropic has similar restrictions. Always check current terms before distilling. Using outputs to train models for your own internal use is generally permitted.
Related Terms
Learn More
Stay Ahead in AI
Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.
Join the Community →