The biggest gap between demo-quality AI and production-quality AI is evaluation. Most teams can write a prompt that works for 10 test inputs. Far fewer can prove it works for 10,000.
Evaluation isn't optional. Without it, every prompt change is a gamble. You might improve one case while breaking three others. You won't know until users complain.
This guide gives you a framework that scales from a solo developer testing prompts in a notebook to a team shipping AI features to millions of users.
Why Evaluation Is Hard for LLMs
Evaluating traditional software is straightforward: the function returns the expected output or it doesn't. LLM evaluation is harder for three reasons.
Multiple valid outputs
Ask an LLM to summarize an article and there are hundreds of correct summaries. You can't just string-match against an expected answer. You need to assess quality on dimensions like accuracy, completeness, and conciseness, where reasonable people might disagree.
Subjective quality dimensions
Is the tone right? Is the response helpful? Is it too verbose? These are judgment calls. Different evaluators will score the same output differently. You need evaluation methods that account for subjectivity without being useless.
Failure modes are subtle
A model that hallucinates doesn't throw an error. It confidently produces wrong information that looks right. Catching these failures requires domain knowledge and careful testing, not just automated checks.
The Three-Layer Evaluation Framework
Use three layers of evaluation, each catching different types of issues.
Layer 1: Automated Deterministic Checks
These are pass/fail checks you can run automatically on every response.
- Format compliance: Does the output match the expected structure? If you asked for JSON, is it valid JSON? If you asked for a list of 5 items, are there exactly 5?
- Length constraints: Is the response within acceptable length bounds? Too short suggests the model skipped content. Too long suggests it ignored instructions.
- Required content: Does the response include specific elements you requested? If the prompt says "include a confidence score," check that it's there.
- Forbidden content: Does the response avoid things it shouldn't include? Check for PII leakage, competitor mentions, or off-topic tangents.
- Factual anchors: For responses with verifiable facts (dates, numbers, names), spot-check against ground truth data.
These checks catch 40-60% of issues and cost essentially nothing to run. Build them first.
Write these as simple Python functions that take the model output and return True/False with a reason string. Run them as a post-processing step after every LLM call. Log failures. Alert when failure rate exceeds your threshold (start with 5%).
Layer 2: LLM-as-Judge
Use a stronger or different model to evaluate outputs from your primary model. This catches quality issues that deterministic checks miss.
How it works: send the original prompt, the model's response, and a scoring rubric to an evaluator model. Ask it to score on specific dimensions and provide reasoning.
Example rubric dimensions:
- Relevance (1-5): Does the response address the user's actual question?
- Accuracy (1-5): Are the claims factually correct based on provided context?
- Completeness (1-5): Does the response cover all aspects of the question?
- Clarity (1-5): Is the response easy to understand?
- Conciseness (1-5): Does the response avoid unnecessary repetition or tangents?
Use a different model than the one you're evaluating to avoid self-bias. Claude evaluating GPT outputs (or vice versa) produces more honest scores.
Include the scoring rubric in the evaluation prompt. Don't just ask "is this good?" Ask for specific scores on specific dimensions with specific criteria for each score level.
Require the evaluator to provide reasoning before the score. This improves scoring accuracy through chain-of-thought and gives you insight into failure patterns.
Calibrate by having humans score 50-100 examples and comparing with LLM-as-judge scores. Adjust your rubric until agreement is above 80%.
Layer 3: Human Evaluation
Humans evaluate a random sample of production outputs on a regular cadence. This is the ground truth that validates your automated systems.
Structure human evaluation carefully:
- Sample size: 50-100 outputs per evaluation round is sufficient for most applications
- Cadence: Weekly for new features, monthly for stable features
- Multiple evaluators: Have 2-3 people score each output to account for subjectivity. Measure inter-annotator agreement.
- Blind evaluation: Evaluators shouldn't know which prompt version produced each output. This prevents bias.
- Calibration sessions: Before scoring, have evaluators discuss and align on rubric interpretation using 5-10 example outputs.
Human eval is expensive and slow, but it catches things automated systems miss: subtle tone issues, culturally insensitive responses, technically correct but misleading answers.
Building Your Test Suite
A good test suite is the foundation of everything above. Here's how to build one.
Start with Real User Inputs
Don't invent test cases from scratch. Collect real user queries from your application (anonymized if needed). Real inputs have the messiness, ambiguity, and variety that synthetic inputs lack. If you don't have production data yet, recruit 10-20 people to use your system and log their inputs.
Cover the Distribution
Your test suite should match the distribution of real queries:
- Common cases (60%): The straightforward queries that make up most of your traffic
- Edge cases (25%): Unusual but valid inputs (very long queries, multi-part questions, non-English text)
- Adversarial cases (15%): Intentionally tricky inputs (prompt injection attempts, out-of-scope questions, contradictory instructions)
Size Your Test Suite Appropriately
For a new project: start with 50-100 test cases. That's enough to catch major regressions without being overwhelming to manage. Grow to 200-500 as your system matures. Enterprise applications with high stakes may need 1,000+.
Each test case should include: a unique ID, the input (user message + any context), the expected behavior (not an exact expected output, but criteria the output must meet), tags for categorization (common/edge/adversarial, topic area), and a difficulty rating. Store test cases as JSON or CSV for easy automation.
Maintain and Update
Test suites decay. Add new test cases when you discover production failures. Remove cases that are no longer relevant. Review the full suite quarterly to ensure it still represents your actual user base. Treat your test suite like code: version it, review changes, and don't let it rot.
Evaluation Metrics That Matter
Pick metrics that align with your application's goals. Here are the most useful ones.
Task Completion Rate
What percentage of queries result in a response that fully addresses the user's need? This is the single most important metric for most applications. Measure it through human evaluation or user feedback signals (thumbs up/down, follow-up questions).
Accuracy / Correctness
For factual applications: what percentage of claims in the output are verifiable and correct? Measure by spot-checking against ground truth data. For RAG systems, you can automate this by checking if the response is supported by the retrieved documents.
Consistency
Does the same input produce similar quality outputs across multiple runs? Run each test case 3-5 times and measure variance. High variance means your prompt is fragile and will produce unpredictable results in production.
Latency
Time from request to response. Set a p95 target (e.g., 95% of responses under 3 seconds) and monitor it. Latency affects user experience directly and is easy to measure automatically.
Cost Per Query
Total token cost for each query (input + output tokens). Track this per query type. Some queries should cost more than others, but spikes indicate prompt inefficiency or runaway generation.
Common Evaluation Pitfalls
Testing only the happy path
If your test suite is all clean, well-formatted, straightforward queries, it doesn't represent reality. Real users send typos, incomplete sentences, ambiguous questions, and occasionally try to break your system. Your evals need to cover this.
Optimizing for one metric at the expense of others
Maximizing accuracy might make responses longer and slower. Maximizing conciseness might lose important details. Track multiple metrics and watch for tradeoffs when you make prompt changes.
Evaluating too infrequently
Running evals once before launch and never again is a recipe for silent quality degradation. Model updates, query distribution shifts, and knowledge changes all affect output quality over time. Automate your evals and run them continuously.
Not involving domain experts
Engineers can evaluate format and structure. But for a medical Q&A system, you need doctors evaluating accuracy. For a legal document generator, you need lawyers. Domain expertise in evaluation is non-negotiable for specialized applications.
Ignoring inter-annotator disagreement
If your human evaluators disagree on 40% of scores, your rubric is too vague. Refine it until agreement is above 75-80%. Disagreement isn't noise to average out. It's a signal that your quality criteria need clarification.
Putting It All Together
Here's the evaluation stack I recommend for most production LLM applications:
- Every request: Automated deterministic checks (format, length, required content)
- Every prompt change: Full test suite run with LLM-as-judge scoring
- Weekly: LLM-as-judge evaluation on a sample of production outputs
- Monthly: Human evaluation on 50-100 production outputs
- Quarterly: Test suite review and update, rubric calibration
Start with Layer 1 (automated checks). It takes a day to implement and catches the most obvious issues. Add Layer 2 (LLM-as-judge) in week two. Layer 3 (human eval) can start as informal reviews and formalize over time.
The teams that ship reliable AI products aren't the ones with the best prompts. They're the ones with the best evaluation systems. Build yours early and maintain it continuously.
For related reading, see our guide on prompt engineering best practices and the prompt engineering glossary entry.