CSAT was designed to measure how a customer felt about a human interaction. It tells you almost nothing about whether an AI agent did its job.
At a recent industry panel on autonomous customer support in insurance, the discussion turned to measurement. The consensus was clear: most of the metrics carriers use to evaluate AI-led support were inherited from the human agent era, and they're measuring the wrong things.
This matters because the metrics you choose determine the behavior you optimize for. Measure the wrong things, and you'll build a system that looks good on a dashboard but doesn't actually resolve customer issues.
Every AI vendor reports first response time. It's always under 5-10 seconds. For AI-led interactions, this metric is no longer interesting.
When human agents handled calls, first response time mattered because it correlated with staffing levels, queue management, and customer wait frustration. With AI, the response is instant by default. Reporting it is like reporting that a website loads - it's a baseline expectation, not a performance indicator.
Carriers that still track first response time as a key metric are measuring a constraint that no longer exists.
CSAT scores for AI interactions get influenced by factors that have nothing to do with the AI's performance. A customer who is upset about a premium increase will give a low CSAT score regardless of how well the AI handled the interaction. A customer whose claim was denied will score the experience low even if the AI delivered the denial accurately, with the right regulatory language, and with a clear explanation of next steps.
CSAT measures the customer's emotional state after the interaction. It doesn't measure whether the AI executed the workflow correctly, used the right knowledge source, or arrived at the right resolution. Those are different questions.
This doesn't mean CSAT is useless. It still captures sentiment trends and can flag systemic issues. But using it as the primary metric for AI agent quality is like evaluating a surgeon by patient mood scores - it tells you something, but not the thing that matters most.
Containment rate measures how many interactions the AI handled without escalating to a human. On the surface, higher containment looks like better performance. In practice, it often masks the opposite.
An AI that deflects a customer to a FAQ link has "contained" the interaction. An AI that sends a customer in circles through three menu options before they hang up has "contained" the interaction. An AI that gives a confident but incorrect answer about coverage has "contained" the interaction.
Containment tells you whether the AI kept the customer away from a human. It doesn't tell you whether the customer's problem was actually solved. That distinction - between containment and resolution - is the single most important diagnostic in evaluating any AI support system.
The alternative is a metric built from the ground up for AI-led interactions. We call it the agent score, and it measures execution quality across four dimensions.
Did the customer actually get what they needed? Not "was the interaction contained" - was the issue resolved? If a policyholder called to add a vehicle to their auto policy, is that vehicle now on the policy? If a broker requested a certificate of insurance, was the COI issued correctly? The outcome is binary and verifiable.
Did the AI pull the right information to answer the question? In insurance, this matters enormously. A general response about claims timelines is different from the specific timeline that applies in the customer's state, under their policy type, for their claim category. The AI needs to select the right knowledge source, and we need to measure whether it did.
Did the AI use the correct data to reach its conclusion? If it's pulling policy details, claim history, or account status, is that data current and correctly attributed? A confident answer built on stale or mismatched data is worse than no answer at all.
Did the AI correctly classify the customer's intent, the workflow required, and the resolution path? Misclassification is the silent failure mode in AI support. The customer asks about a coverage question and the AI routes it as a billing inquiry. The interaction might still get "resolved" from a containment standpoint, but the customer didn't get what they needed.
When you measure CSAT and containment, you optimize for keeping customers in the AI channel and making them feel OK about it. The system gets better at sounding helpful.
When you measure agent score, you optimize for getting the right answer, from the right source, using the right data, through the right workflow. The system gets better at actually being helpful.
The difference matters at scale. In production, Notch agents process 20M+ conversations with 70-73% autonomous resolution. That resolution rate isn't a containment number - it's measured against whether the workflow completed end-to-end, the system of record was updated, and no human re-entry was required. The 15-20% CSAT improvement over human agents is a secondary effect. The primary metric is whether the work was done correctly.
There's a secondary benefit to centralized AI measurement that most carriers haven't considered: fraud pattern detection.
When different human agents handle each interaction independently, fraud patterns that span multiple interactions are hard to spot. The adjuster handling claim A doesn't know about the suspicious similarity to claim B that a different adjuster reviewed last week.
When a single AI layer handles first-touch interactions consistently, it becomes a centralized observation point. The system can detect if someone is attempting the same approach across multiple channels, if claim narratives share suspicious similarities, or if interaction patterns match known fraud signatures. This isn't replacing fraud investigation teams - it's giving them a signal layer they couldn't have when every interaction was handled by a different person.
That centralized visibility is a direct consequence of measuring and logging AI execution quality at the interaction level. The agent score framework doesn't just tell you how well the AI performed. It creates the data layer that makes pattern detection possible.
If you're evaluating AI for insurance customer operations, ask your vendor - or your internal team - three questions:
1. How do you define resolution? Is it containment (customer didn't reach a human) or completion (the workflow executed end-to-end and the system of record was updated)?
2. Can you show me the breakdown of knowledge selection accuracy, data accuracy, and classification correctness for a sample of interactions? Not aggregate scores - individual interaction-level traceability.
3. What happens when the AI gets it wrong? Is there a governed feedback loop that identifies the error, corrects the system, and prevents recurrence - without the model drifting on its own?
The metrics you use to evaluate AI agents determine the kind of AI agent you end up with. Choose the ones that measure whether the customer's problem was actually solved.
See how Notch measures agent execution quality across 20M+ conversations - book a demo.