Code-Based Eval
What is a code-based eval?
A code-based eval is a type of evaluation strategy that uses traditional deterministic code to assess the quality of an LLM's response. Unlike LLM-as-judge evals that use another AI model to evaluate output, code-based evals use programmatic logic to check for specific patterns, keywords, or structural requirements.
Code-based evals are one of three common eval strategies, alongside golden datasets and LLM-as-judge evals. They're also sometimes called code assertions or code-based assertions.
What do code-based evals check for?
Code-based evals can verify many different aspects of LLM output:
- Format validation: Checking if the LLM is returning valid JSON or other structured data
- Keyword detection: Looking for red flag words that indicate problems (like "typically," "usually," or "generally" in contexts where specific examples are needed)
- Pattern matching: Verifying that responses follow expected patterns or rules
For example, if your AI coach should suggest specific questions rather than general ones, a code-based eval might check for the presence of words like "typical" or "usual" and fail the test if it finds them.
Why use code-based evals instead of LLM-as-judge?
Code-based evals are particularly useful for catching consistent, rule-based errors that can be reliably detected through code. They're deterministic—they'll always produce the same result for the same input—which makes them predictable and easy to debug.
They're also especially important in workflows where a subsequent step needs to parse the LLM's output. For instance, if your next step expects JSON data, a code-based eval can verify the format is correct before it reaches that downstream process.
Code-based evals complement LLM-as-judge evals as part of a comprehensive evaluation strategy. Use code-based evals for clear, rule-based checks and LLM-as-judge for more nuanced quality assessments.
Learn more:
- Behind the Scenes: Building the Product Talk Interview Coach
- Building My First AI Product: 6 Lessons from My 90-Day Deep Dive
- AI Evals & Discovery - All Things Product Podcast with Teresa Torres & Petra Wille
Related terms: