LLM-as-Judge
What is LLM-as-Judge?
LLM-as-Judge is an evaluation strategy where you use a second LLM to evaluate the output of your primary LLM. Instead of relying solely on human judgment or code-based checks, you send your LLM's output to another LLM along with evaluation criteria, and ask it to score or judge the quality based on specific dimensions.
This approach is particularly useful for evaluating subjective qualities that would be difficult to codify in traditional code—such as whether questions are leading, whether responses are appropriate in tone, or whether content meets nuanced quality criteria.
How does LLM-as-Judge work in practice?
Implementing LLM-as-Judge requires careful attention to context and prompt design. You create a prompt that explains the evaluation criteria to the judge LLM, then send it specific content to evaluate. For example, rather than sending an entire coaching response and asking "Are there any leading questions?", effective implementations extract individual questions and evaluate each one separately.
The effectiveness of LLM-as-Judge depends on controlling what context you provide and iterating on the prompt. Teams typically experiment with different approaches—evaluating all items at once versus one at a time, providing more or less context, adjusting the evaluation criteria—to find what works best for their specific use case.
How do you validate LLM-as-Judge evals?
Since you're using one LLM to judge another, teams face a natural question: How do you know the judge is accurate? This is where human graders become essential. Teams compare the judge LLM's evaluations against human evaluations using the same traces, measuring true positives, false positives, and false negatives.
This validation helps teams understand whether their LLM-as-Judge eval correctly identifies good and bad outputs. Without this human validation, teams risk building on flawed judgments—essentially creating an infinite loop where they can't be sure any evaluation is accurate.
Learn more:
- Building My First AI Product: 6 Lessons from My 90-Day Deep Dive
- How I Designed & Implemented Evals for Product Talk's Interview Coach
Related terms: