LLM Evaluation: How to Know If Your AI Is Actually Working

Building an LLM-powered application is the easy part. Knowing whether it is working well — and detecting when it regresses — is the hard part. LLM evaluation (evals) is an underdeveloped practice in most teams building with AI. Here is what to understand.

Why Evals Are Hard

Evaluating LLM output is harder than evaluating traditional software for a fundamental reason: LLM outputs are non-deterministic and high-dimensional. A bug in traditional code either exists or it doesn’t; you run a test and get pass or fail. An LLM response is correct, wrong, partially correct, stylistically appropriate or not, appropriate in one context and not another — and it varies across runs. Traditional testing paradigms don’t fit: you cannot write assert output == “expected answer” because the correct answer can be expressed in infinitely many ways. You cannot run 10,000 deterministic integration tests because each run costs money and takes time. Common mistakes: evaluating on the examples you used to prompt-engineer (overfit to your test set); using only human evaluation (doesn’t scale); evaluating on wrong metrics (length, confidence, fluency — rather than accuracy and task completion); and not testing edge cases and adversarial inputs.

Types of Evals

Exact match: for tasks with ground truth — question answering with known correct answers, classification, extraction. Does the model output match the expected answer? Works for a limited class of tasks. Criteria-based rubrics: define the criteria a response must meet — factual accuracy, completeness, appropriate tone, absence of hallucination — and score responses against these criteria. Can be done by human raters or by LLM judges. LLM-as-judge: use a capable LLM (often GPT-4 or Claude) to evaluate the output of another LLM against defined criteria. Pros: scalable, faster than human evaluation, reasonably reliable for well-defined criteria. Cons: LLM judges have known biases (prefer longer answers, prefer their own output style, can be manipulated by the prompt). Human evaluation: ground truth for subjective quality — relevance, helpfulness, tone. Essential for calibrating automated metrics but does not scale. A/B comparison: present two outputs to human raters or an LLM judge and ask which is better, without showing which system produced which (blind comparison). Reference-based: compare model output against a reference answer using a semantic similarity metric (BLEU, ROUGE, BERTScore) — limited to tasks where reference answers exist. Red-teaming / adversarial testing: systematically attempt to make the model fail — hallucinate, produce harmful content, refuse to answer legitimate requests. Should be a standard part of any production eval suite.

Building a Practical Eval System

Start with a golden dataset: 50–200 examples that represent your actual use cases, with expected outputs (or criteria) labelled. This is the core of any eval system. Add cases as the model fails: every production failure should become an eval case. Automate your evals on every model change: if you change the system prompt, the model version, or the retrieval system, run your evals automatically and compare scores. Track regression not just absolute quality: what matters is whether the change made things better or worse, not just what the absolute score is. Measure what your users care about: a legal contract drafting assistant should be evaluated on accuracy, completeness, and absence of hallucination — not fluency. A customer support bot should be evaluated on resolution rate, escalation rate, and customer satisfaction. Separate evaluation from training data: if your evals use the same examples you used to tune your prompts, your evals are meaningless — you’re testing whether the model memorised your examples, not whether it generalises.

上一篇 点心:点什么以及它如何运作的指南
下一篇 LLM评估:如何知道你的AI是否真正有效