How do you evaluate LLM outputs?

Use a three-layer framework. Layer 1: automated deterministic checks for format, length, and required content (run on every response). Layer 2: LLM-as-judge using a different model to score quality on rubric dimensions like relevance, accuracy, and clarity. Layer 3: periodic human evaluation on random production samples to validate automated scores and catch subtle issues.

What is LLM-as-judge evaluation?

LLM-as-judge uses a language model to evaluate the outputs of another model. You provide the original prompt, the response, and a scoring rubric, then ask the evaluator model to score on specific dimensions (accuracy, relevance, clarity) with reasoning. Use a different model than the one being evaluated to avoid self-bias. Calibrate against human scores to ensure reliability.

How many test cases do I need for LLM evaluation?

Start with 50-100 test cases for a new project. This covers major regressions without being unmanageable. Grow to 200-500 as your system matures. Enterprise applications may need 1,000+. Allocate roughly 60% common cases, 25% edge cases, and 15% adversarial inputs. Quality and coverage matter more than raw quantity.

How often should I evaluate my LLM application?

Run automated format checks on every request. Run your full test suite with LLM-as-judge scoring on every prompt change. Perform LLM-as-judge evaluation on production output samples weekly. Conduct human evaluation on 50-100 outputs monthly. Review and update your test suite quarterly. More evaluation is always better, but this cadence balances thoroughness with cost.

How to Evaluate LLM Outputs: A Practical Framework

The biggest gap between demo-quality AI and production-quality AI is evaluation. Most teams can write a prompt that works for 10 test inputs. Far fewer can prove it works for 10,000.

Evaluation isn't optional. Without it, every prompt change is a gamble. You might improve one case while breaking three others. You won't know until users complain.

This guide gives you a framework that scales from a solo developer testing prompts in a notebook to a team shipping AI features to millions of users.

Why Evaluation Is Hard for LLMs

Evaluating traditional software is straightforward: the function returns the expected output or it doesn't. LLM evaluation is harder for three reasons.

Multiple valid outputs

Ask an LLM to summarize an article and there are hundreds of correct summaries. You can't just string-match against an expected answer. You need to assess quality on dimensions like accuracy, completeness, and conciseness, where reasonable people might disagree.

Subjective quality dimensions

Is the tone right? Is the response helpful? Is it too verbose? These are judgment calls. Different evaluators will score the same output differently. You need evaluation methods that account for subjectivity without being useless.

Failure modes are subtle

A model that hallucinates doesn't throw an error. It confidently produces wrong information that looks right. Catching these failures requires domain knowledge and careful testing, not just automated checks.

The Three-Layer Evaluation Framework

Use three layers of evaluation, each catching different types of issues.

Layer 1: Automated Deterministic Checks

These are pass/fail checks you can run automatically on every response.

Format compliance: Does the output match the expected structure? If you asked for JSON, is it valid JSON? If you asked for a list of 5 items, are there exactly 5?
Length constraints: Is the response within acceptable length bounds? Too short suggests the model skipped content. Too long suggests it ignored instructions.
Required content: Does the response include specific elements you requested? If the prompt says "include a confidence score," check that it's there.
Forbidden content: Does the response avoid things it shouldn't include? Check for PII leakage, competitor mentions, or off-topic tangents.
Factual anchors: For responses with verifiable facts (dates, numbers, names), spot-check against ground truth data.

These checks catch 40-60% of issues and cost essentially nothing to run. Build them first.

Implementation Tip

Write these as simple Python functions that take the model output and return True/False with a reason string. Run them as a post-processing step after every LLM call. Log failures. Alert when failure rate exceeds your threshold (start with 5%).

Layer 2: LLM-as-Judge

Use a stronger or different model to evaluate outputs from your primary model. This catches quality issues that deterministic checks miss.

How it works: send the original prompt, the model's response, and a scoring rubric to an evaluator model. Ask it to score on specific dimensions and provide reasoning.

Example rubric dimensions:

Relevance (1-5): Does the response address the user's actual question?
Accuracy (1-5): Are the claims factually correct based on provided context?
Completeness (1-5): Does the response cover all aspects of the question?
Clarity (1-5): Is the response easy to understand?
Conciseness (1-5): Does the response avoid unnecessary repetition or tangents?

LLM-as-Judge Best Practices

Use a different model than the one you're evaluating to avoid self-bias. Claude evaluating GPT outputs (or vice versa) produces more honest scores.

Include the scoring rubric in the evaluation prompt. Don't just ask "is this good?" Ask for specific scores on specific dimensions with specific criteria for each score level.

Require the evaluator to provide reasoning before the score. This improves scoring accuracy through chain-of-thought and gives you insight into failure patterns.

Calibrate by having humans score 50-100 examples and comparing with LLM-as-judge scores. Adjust your rubric until agreement is above 80%.

Layer 3: Human Evaluation

Humans evaluate a random sample of production outputs on a regular cadence. This is the ground truth that validates your automated systems.

Structure human evaluation carefully:

Sample size: 50-100 outputs per evaluation round is sufficient for most applications
Cadence: Weekly for new features, monthly for stable features
Multiple evaluators: Have 2-3 people score each output to account for subjectivity. Measure inter-annotator agreement.
Blind evaluation: Evaluators shouldn't know which prompt version produced each output. This prevents bias.
Calibration sessions: Before scoring, have evaluators discuss and align on rubric interpretation using 5-10 example outputs.

Human eval is expensive and slow, but it catches things automated systems miss: subtle tone issues, culturally insensitive responses, technically correct but misleading answers.

Building Your Test Suite

A good test suite is the foundation of everything above. Here's how to build one.

Start with Real User Inputs

Don't invent test cases from scratch. Collect real user queries from your application (anonymized if needed). Real inputs have the messiness, ambiguity, and variety that synthetic inputs lack. If you don't have production data yet, recruit 10-20 people to use your system and log their inputs.

Cover the Distribution

Your test suite should match the distribution of real queries:

Common cases (60%): The straightforward queries that make up most of your traffic
Edge cases (25%): Unusual but valid inputs (very long queries, multi-part questions, non-English text)
Adversarial cases (15%): Intentionally tricky inputs (prompt injection attempts, out-of-scope questions, contradictory instructions)

Size Your Test Suite Appropriately

For a new project: start with 50-100 test cases. That's enough to catch major regressions without being overwhelming to manage. Grow to 200-500 as your system matures. Enterprise applications with high stakes may need 1,000+.

Test Case Format

Each test case should include: a unique ID, the input (user message + any context), the expected behavior (not an exact expected output, but criteria the output must meet), tags for categorization (common/edge/adversarial, topic area), and a difficulty rating. Store test cases as JSON or CSV for easy automation.

Maintain and Update

Test suites decay. Add new test cases when you discover production failures. Remove cases that are no longer relevant. Review the full suite quarterly to ensure it still represents your actual user base. Treat your test suite like code: version it, review changes, and don't let it rot.

Evaluation Metrics That Matter

Pick metrics that align with your application's goals. Here are the most useful ones.

Task Completion Rate

What percentage of queries result in a response that fully addresses the user's need? This is the single most important metric for most applications. Measure it through human evaluation or user feedback signals (thumbs up/down, follow-up questions).

Accuracy / Correctness

For factual applications: what percentage of claims in the output are verifiable and correct? Measure by spot-checking against ground truth data. For RAG systems, you can automate this by checking if the response is supported by the retrieved documents.

Consistency

Does the same input produce similar quality outputs across multiple runs? Run each test case 3-5 times and measure variance. High variance means your prompt is fragile and will produce unpredictable results in production.

Latency

Time from request to response. Set a p95 target (e.g., 95% of responses under 3 seconds) and monitor it. Latency affects user experience directly and is easy to measure automatically.

Cost Per Query

Total token cost for each query (input + output tokens). Track this per query type. Some queries should cost more than others, but spikes indicate prompt inefficiency or runaway generation.

Common Evaluation Pitfalls

Testing only the happy path

If your test suite is all clean, well-formatted, straightforward queries, it doesn't represent reality. Real users send typos, incomplete sentences, ambiguous questions, and occasionally try to break your system. Your evals need to cover this.

Optimizing for one metric at the expense of others

Maximizing accuracy might make responses longer and slower. Maximizing conciseness might lose important details. Track multiple metrics and watch for tradeoffs when you make prompt changes.

Evaluating too infrequently

Running evals once before launch and never again is a recipe for silent quality degradation. Model updates, query distribution shifts, and knowledge changes all affect output quality over time. Automate your evals and run them continuously.

Not involving domain experts

Engineers can evaluate format and structure. But for a medical Q&A system, you need doctors evaluating accuracy. For a legal document generator, you need lawyers. Domain expertise in evaluation is non-negotiable for specialized applications.

Ignoring inter-annotator disagreement

If your human evaluators disagree on 40% of scores, your rubric is too vague. Refine it until agreement is above 75-80%. Disagreement isn't noise to average out. It's a signal that your quality criteria need clarification.

Putting It All Together

Here's the evaluation stack I recommend for most production LLM applications:

Every request: Automated deterministic checks (format, length, required content)
Every prompt change: Full test suite run with LLM-as-judge scoring
Weekly: LLM-as-judge evaluation on a sample of production outputs
Monthly: Human evaluation on 50-100 production outputs
Quarterly: Test suite review and update, rubric calibration

Start with Layer 1 (automated checks). It takes a day to implement and catches the most obvious issues. Add Layer 2 (LLM-as-judge) in week two. Layer 3 (human eval) can start as informal reviews and formalize over time.

The teams that ship reliable AI products aren't the ones with the best prompts. They're the ones with the best evaluation systems. Build yours early and maintain it continuously.

For related reading, see our guide on prompt engineering best practices and the prompt engineering glossary entry.

About the Author

Rome Thorndike is the founder of the Prompt Engineer Collective, a community of over 1,300 prompt engineering professionals, and author of The AI News Digest, a weekly newsletter with 2,700+ subscribers. Rome brings hands-on AI/ML experience from Microsoft, where he worked with Dynamics and Azure AI/ML solutions, and later led sales at Datajoy (acquired by Databricks).