Designing an evaluation set your team will actually run

Mira Halton·Apr 28, 2026·7 min read

Most teams building agents test with vibes. Someone types five questions, the agent answers them, and we ship. A week later something regresses, no one notices, and a customer escalates a topic the agent should have nailed.

An evaluation set fixes that, but most eval suites die in the first month — too long, too academic, too far from real conversations. Here’s the smallest version that actually gets run, drawn from how our team and dozens of Knowiz customers use them.

1. Start with thirty real conversations

Pull the last thirty conversations your agent handled. Skip the hand-picked highlight reel and the hand-picked failures — pick a representative slice. The shape of the eval set should match the shape of production.

For each conversation, write down: the customer’s actual goal (not what they typed), the answer the agent gave, and a one-line judgment of whether it served the goal. That third column is the eval signal.

2. Tag for failure modes, not topics

Topics (“billing”, “shipping”) tell you nothing about how the agent failed. Failure modes do: missed_intent, wrong_citation, over_escalation, policy_violation.

Limit yourself to six modes. More than that and the labels stop being meaningful, less than that and you can’t see patterns. The right six depends on what your agent actually does — list yours, not ours.

3. Run before every prompt change

The eval set is only useful if you run it. Make it cheap: a single command, results in under five minutes, output a table with row counts per failure mode and a diff against last run.

The eval set you actually run beats the perfect one you don’t.

4. Grow it from misses, not from imagination

When a real customer hits a failure mode you didn’t have a row for, add it. When you stop seeing new failure modes from production, your eval set is approximately complete.

Resist the urge to invent edge cases. Every speculative row dilutes signal from the rows that came from real customers.

What we learned the hard way

Our first eval set had 240 rows and took 90 minutes to run. We ran it once. The version we use now is 38 rows, runs in 4 minutes, and is the gating check on every prompt change. It catches more regressions than the 240-row version ever did.