Evaluating AI Features: How Do You Know If Your LLM Is Actually Good?

Your team just shipped an AI feature. The demo went well. Leadership is excited. But someone on the team asks the uncomfortable question: "How do we know this is actually good?" You realize you don't have an answer. There are no unit tests. There's no pass/fail. The model gives a different response every time. Welcome to the world of probabilistic quality assurance—and most teams are woefully unprepared for it.

The paradigm shift: Deterministic vs. probabilistic testing

Traditional software testing is deterministic. Given input X, expect output Y. If f(2, 3) doesn't return 5, the test fails. This mental model is so deeply ingrained that teams instinctively try to apply it to LLM outputs—and it doesn't work.

LLM outputs are probabilistic. The same input can produce different outputs. "Good" is subjective—a response can be factually correct but poorly worded, or beautifully written but missing key information. The same response might be excellent for one user and terrible for another.

This doesn't mean you can't evaluate AI features rigorously. It means you need different tools and frameworks. The field calls this practice "evals"—and it's rapidly becoming the most important discipline in AI product development.

Testing Spectrum: Deterministic to Probabilistic

The three levels of evaluation

Level 1: Automated metrics (necessary but insufficient)

These are quantitative measurements you can run without human judgment. They tell you something about quality, but they don't tell you everything.

Factual accuracy (for RAG systems): Does the response align with the retrieved context? You can automate this with an "LLM-as-judge" approach: use a second model to evaluate whether the response is faithful to the source documents.
Format compliance: If the model should return JSON, does it? If it should include citations, does it? These are binary and easy to automate.
Toxicity / safety: Run outputs through safety classifiers. This is table stakes for any user-facing AI.
Latency and cost: Quantitative, easily measured, and directly tied to user experience and business viability.
Response length: Is the model being appropriately concise or verbosely padding responses? Track token counts and compare against your target range.

Level 2: LLM-as-judge evaluation

This is one of the most powerful patterns in modern AI evaluation. You use a separate LLM (often a more capable one) to evaluate the outputs of your production model. It sounds circular, but it works surprisingly well in practice.

Here's how to set it up:

Define a rubric with specific criteria: relevance, accuracy, completeness, tone, helpfulness.
For each criterion, define a 1-5 scoring scale with clear descriptions for each level.
Send the original query, the context (if RAG), and the model's response to the judge model.
The judge returns scores and brief explanations for each criterion.

Research shows that GPT-4-class judges agree with human evaluators about 80-85% of the time on quality assessments—roughly the same rate at which two human evaluators agree with each other. This doesn't make LLM judges perfect, but it makes them useful at scale.

PM takeaway: LLM-as-judge can run on every output in production, giving you continuous quality monitoring. Human evaluation can't scale like this. Use automated judges for breadth and human review for depth.

Evaluation Framework Pyramid: Unit Evals to Production Monitoring

Level 3: Human evaluation (the gold standard)

Ultimately, AI quality is determined by human judgment. Automated metrics and LLM judges are proxies. For high-stakes features, you need humans reviewing outputs regularly.

Practical approaches:

Weekly sample review: Pull 50-100 random production interactions per week. Have 2-3 team members rate them on a rubric. Track scores over time. This takes 2-3 hours/week and gives you ground truth.
Adversarial testing: Assign someone (or a rotating group) to actively try to break the AI. Jailbreak attempts, edge cases, ambiguous queries, domain-specific gotchas. Document failures in a shared registry.
User feedback loops: Thumbs up/down on responses, "Was this helpful?" prompts, or explicit feedback forms. These are noisy but valuable at scale. A sudden drop in thumbs-up rate correlates with quality degradation.
A/B testing: When comparing models, prompts, or system prompt changes, run controlled A/B tests with real users. Track completion rates, follow-up questions (a proxy for response quality), and user satisfaction.

Building an eval suite: The practical playbook

Step 1: Define your golden dataset

Create a curated set of 100-500 test cases that represent the full range of your use case. Each test case should include the input (query + context), one or more reference "good" answers, and scoring criteria. This is your regression suite. Run it whenever you change models, prompts, or system configuration.

Step 2: Automate what you can

Build a pipeline that runs your golden dataset through the model, evaluates outputs using automated metrics and LLM-as-judge, and produces a quality scorecard. Run this in CI/CD. If the score drops below a threshold, block the deployment.

Step 3: Monitor production continuously

Log every production interaction (input, output, latency, token count). Run LLM-as-judge on a sample of production outputs daily. Set up alerts for quality degradation. Models can degrade silently—a provider update, a changed system prompt, a subtle data pipeline bug—and without monitoring, you won't know until users complain.

Step 4: Close the feedback loop

Route human evaluations and user feedback back into your golden dataset. When you discover a new failure mode, add it as a test case. Your eval suite should grow over time, becoming an increasingly comprehensive representation of what "good" looks like for your product.

Metrics that actually matter

Not all metrics are equally useful. Here's a prioritized list for most AI products:

Task completion rate: Did the user accomplish what they came to do? This is the North Star metric. If users are abandoning mid-conversation, your AI isn't working—regardless of how eloquent the responses are.
Factual accuracy: For any product that provides information, this is critical. Measure it through human review and LLM-as-judge evaluation against source documents.
User satisfaction: Thumbs up/down ratio, NPS, CSAT. These are lagging indicators but they capture the holistic experience.
Escalation rate: For support use cases, how often does the AI fail and need to hand off to a human? This is both a quality metric and a cost metric.
Hallucination rate: What percentage of responses contain fabricated information? Sample and measure weekly.
Cost per successful interaction: Not just cost per API call, but cost per interaction that actually helped the user. This combines cost optimization with quality measurement.

Common mistakes in AI evaluation

Testing only with "happy path" queries: Your demo dataset is not representative of production traffic. Real users ask ambiguous, misspelled, multi-part, adversarial, and out-of-scope questions.
Evaluating on vibes: "It seems good" is not a quality strategy. Define rubrics, measure scores, track trends.
Over-indexing on benchmarks: Public benchmarks (MMLU, HumanEval, etc.) measure general capability. They don't measure performance on your specific use case with your specific data. Always evaluate on your own golden dataset.
Ignoring regression: You change the system prompt "just a little" and don't re-run evals. Two weeks later, support tickets spike. Always run your eval suite before deploying prompt changes.
One-time evaluation: Evaluating at launch and never again. Models change, data changes, user behavior changes. Evaluation must be continuous.

The eval maturity model

Where does your team fall?

Level 0 — No evals: "We tried a few prompts and it seemed fine." (Most teams start here. Move fast.)
Level 1 — Ad hoc testing: Manual spot-checking before releases. Better than nothing, but inconsistent.
Level 2 — Golden dataset: A curated test suite with automated scoring. Run before each release.
Level 3 — Continuous monitoring: Production outputs are evaluated continuously. Alerts fire when quality drops.
Level 4 — Eval-driven development: Evals are written before features, like TDD for AI. Prompt changes are evaluated automatically in CI/CD. The golden dataset grows from production feedback.

Most teams are at Level 0 or 1. Getting to Level 2 is achievable in a sprint. Getting to Level 3 takes a quarter. Level 4 is aspirational but worth pursuing for any serious AI product.

Method	Speed	Cost	Accuracy	Best For
Automated (LLM-as-judge)	Fast	Low	Moderate	Scale testing, regression
Human evaluation	Slow	High	High	Nuanced quality, UX
A/B testing	Slow	Medium	High	User preference, engagement
Unit evals (assertions)	Instant	Free	Binary	Format validation, guardrails