Evaluation
Why evaluate
An AI that works on easy calls can still fail where it matters. Evaluation finds those failures before customers do.
Build an adversarial suite
Most test cases should be hard: ambiguous requests, callers pushing for more than they're owed, and unexpected turns mid-call.
What to measure
Track resolution, escalation, and re-contact together. Any one alone can be gamed; together they tell the truth.
Re-run on every change
Treat the evaluation suite like a test suite — run it whenever you change knowledge, actions, or approved patterns.