AI Evals

Q: What are the three types of AI evals?

There are three common eval strategies: 1) Golden datasets-creating datasets with expected user inputs and defined desired outputs, then comparing actual output to expected output. 2) Code-based assertions-using deterministic code to evaluate LLM response quality, like checking for valid JSON. 3) LLM-as-Judge-having a second LLM evaluate the output of the first LLM.

What are AI evals?

AI evals (short for evaluations) are methods for measuring whether an AI product or feature is performing well, analogous to unit testing and integration testing for deterministic code. Evals give teams confidence that their AI applications are doing what they expect them to do. In traditional software development, unit tests verify individual functions work correctly, and integration tests ensure components work together properly. Evals serve the same purpose for AI systems, helping teams maintain quality and catch issues before they reach users.

What are the three types of AI evals?

There are three common eval strategies teams use to measure AI performance:

Golden datasets — Creating datasets with a wide range of expected user inputs and clearly defined desired outputs. Teams run these inputs against their AI system and compare the actual output to the expected output.
Code-based assertions — Using traditional deterministic code to evaluate the quality of LLM responses. A common example is checking whether the LLM returns valid JSON, particularly important when subsequent steps need to parse the output.
LLM-as-Judge — Having a second LLM evaluate the output of the first LLM. For example, one LLM might assess whether an AI coach suggested any leading questions in its response.

How should teams validate their evals?

In all three eval strategies, human graders should be used to evaluate how well the evals themselves perform. This meta-evaluation ensures that the automated testing is actually measuring what matters. Without human validation, teams can't be certain their evals are catching the right issues or properly assessing AI output quality.

Learn more:

Related terms:

← Back to Ai Glossary

Make better product decisions.

What are AI evals?

What are the three types of AI evals?

How should teams validate their evals?

Make better product decisions.