Test Set

Q: What makes a good test set?

A robust test set has several characteristics: It should be larger than the dev set. While dev sets might contain 25-50 examples for fast iteration, test sets should include hundreds or more examples to provide reliable performance estimates. It should be representative of real-world usage, covering the full range of scenarios, edge cases, and input variations that the system will encounter in production. It must be protected from contamination. Teams should use the test set only for final validation, not for iterative development decisions.

What is a test set?

A test set is a collection of data examples reserved specifically for evaluating the performance of an AI system after development is complete. Unlike the dev set—used during active development for iteration—the test set provides an unbiased measure of how well the system will perform on unseen data.

Best practices dictate keeping the test set larger than the dev set and protecting it from exposure during development to prevent overfitting.

Why keep a separate test set and dev set?

The separation between dev set and test set serves a critical purpose in AI development:

During development, teams iterate rapidly on a small dev set. They run evals frequently, adjust prompts, tune parameters, and test new approaches. This repeated exposure means the team naturally optimizes for the specific examples in the dev set.

Before release, teams run evals on the larger test set to confirm that improvements made during development actually generalize to new examples. The test set acts as a proxy for real-world performance because the system hasn't been specifically tuned for these examples.

This separation catches a common problem: improvements that work on the dev set but fail to generalize to broader use cases.

What makes a good test set?

A robust test set has several characteristics:

Larger than the dev set. While dev sets might contain 25-50 examples for fast iteration, test sets should include hundreds or more examples to provide reliable performance estimates.

Representative of real-world usage. The test set should cover the full range of scenarios, edge cases, and input variations that the system will encounter in production.

Protected from contamination. Teams should use the test set only for final validation, not for iterative development decisions, to maintain its value as an unbiased evaluation tool.

Learn more:

Related terms:

← Back to Ai Glossary

Last Updated: October 25, 2025

Make better product decisions.

What is a test set?

Why keep a separate test set and dev set?

What makes a good test set?

Make better product decisions.