The QA tool gap: How to see the reason behind the result

Michelle Wen

Mar 12, 2026

0 Mins

Why AI Support Escalations Break - And How to Fix Them

Most QA tools can tell you what changed. Almost none can tell you why, or how to fix it.

Your CSAT dropped 8 points last week. Something is clearly wrong, but there's no obvious culprit. So you start the manual slog: sample 50 tickets, read through conversations, try to spot patterns. Three hours later, you think maybe it's the new refund policy. Or the updated agent training. Or that product bug engineering hasn't prioritised yet.

Meanwhile, the undiagnosed problem that tanked your CSAT last week is still hitting customers this week.

This is the QA tool gap. The entire category has spent a decade getting better at measurement, building prettier scorecards and slicker dashboards, while the actual problem - figuring out why something broke and how to fix it - stays manual.

Your sample size is lying to you

Here's a statistical reality that most QA vendors would prefer you not think about too hard. Most support teams sample 2-5% of tickets for quality review. If your real CSAT is 85%, a 5% sample could show you anywhere from 75% to 95% depending on which tickets you randomly pulled. The confidence interval is so wide that you're basically reading tea leaves with a calibrated teacup.

But this isn't really a statistical problem. It's an operational one. By the time you've sampled enough tickets to understand why metrics moved, you're investigating last week's issue while this week's issue spreads unchecked. Legacy QA tools were built for a world where human agents worked slowly enough that sampling could catch problems before they became widespread. That world is gone.

AI agents can introduce problems at scale instantly. One bad update to your knowledge base and suddenly thousands of customers get wrong answers before your morning standup. Your 2-5% sample won't catch it in time, and when it does catch something, you'll spend another three hours figuring out whether the problem was the knowledge base update, a model behaviour change, or a policy the AI is interpreting differently than you intended.

The QA tool landscape

Let's be specific about what's actually available. This isn't exhaustive, but it covers the major categories and players you'll encounter when evaluating QA tooling today.

Sample-based QA platforms

These make manual sampling workflows more efficient with better scorecards, cleaner interfaces, team performance dashboards, and collaboration features. They are, fundamentally, a nicer way to do the thing that doesn't scale.

Zendesk QA (formerly Klaus, acquired 2024). Native integration within Zendesk, which is both its strength and its constraint. If you're on Zendesk and want QA that lives inside the same ecosystem, it's the path of least resistance. But it inherits Zendesk's broader limitation: the platform was architected for human agent workflows, and bolting AI-aware QA onto that foundation shows the seams. Sample-based with manual review. No automated root cause diagnosis. No AI-specific failure detection for drift, hallucination, or compliance gaps.

MaestroQA. Customisable scorecards and team performance tracking. Established player with flexible scoring frameworks and a genuine depth of features for traditional QA programs. The core limitation is structural: manual review doesn't scale to 100% coverage, and the diagnosis step - figuring out why scores changed - stays manual. If you have dedicated QA staff and your support operation is primarily human agents, MaestroQA is a solid, well-built tool for what it does. It just doesn't do the part that matters most.

The category verdict: If you're committed to sample-based QA, these tools make the process less painful. Good scorecards, clear workflows, performance tracking. What they don't do is diagnose root causes automatically, handle 100% coverage, detect AI-specific failures, or implement fixes. You get a better microscope for the 3% of tickets you're already looking at.

Automated scoring platforms

These use AI to score conversations automatically, aiming for higher coverage than manual sampling. It's the right instinct (cover everything, not just a sample) applied incompletely (scoring without diagnosis is still just measurement).

Solidroad. Positions around 100% automated review and scoring, which addresses the sampling gap directly. The current focus is on scoring rather than diagnosis or fix implementation, which means you'll know that quality dropped across all your conversations instead of just a sample, but you'll still be doing the manual work to figure out why and what to do about it. Emerging player worth watching.

Intercom's CX Score. This one deserves its own paragraph because it's genuinely clever marketing wrapped around a structural conflict of interest. Intercom built a proprietary AI-driven metric to replace CSAT, and their AI conveniently scores their AI highly. The metric is a black box that can't be benchmarked externally or verified independently. You can't compare your Intercom CX Score against industry benchmarks, against competitors' scores, or even against your own historical CSAT in any meaningful way. It's the equivalent of a restaurant rating itself five stars and then telling you Yelp is outdated. If your QA vendor is also your AI vendor and they've invented a proprietary metric that only they can calculate, you should be asking some pointed questions about incentive alignment.

The category verdict: Higher coverage than manual sampling, reduced QA headcount requirements, consistent scoring criteria. But most stop at scoring without explaining why scores changed or how to fix problems. Proprietary metrics create vendor lock-in by design. You get a better thermometer, but the thermometer doesn't tell you what's causing the fever.

Diagnostic and fix-oriented tools

This is the emerging category that goes beyond measurement to diagnose why metrics moved and propose or implement fixes.

Lorikeet Coach. This is ours, so I'll keep the assessment brief and let the framework do the positioning (more on limitations below). Agent-based QA with 100% coverage, root cause diagnosis, conversational interface via Slack or Claude or ChatGPT, automated fix proposals, and AI-specific failure detection. Launched January 2026.

The comparison table

Capability	Sample-based (Zendesk QA, MaestroQA)	Automated scoring (Solidroad, Intercom CX Score)	Diagnostic (Coach)
Coverage	2-5% sample	100% scored	100% diagnosed
Root cause diagnosis	Manual	Manual	Automated with evidence
AI failure detection	Limited	Varies	Drift, hallucination, compliance, KB contradictions
Fix implementation	Manual	Manual	Proposed and testable
Works across human + AI	Yes (human focus)	Varies	Yes (unified standards)
Metric independence	Industry-standard metrics	Some proprietary (CX Score)	Industry-standard metrics
Maturity	Established	Mixed	New to market (Jan 2026)

What actually matters when evaluating QA tools

The features lists on QA vendor websites tend to blur together after the third demo. Here's what actually separates tools that help from tools that just measure, and why the distinction matters operationally.

Coverage that's real, not aspirational. There's a meaningful difference between "we can score 100% of conversations" and "we diagnose 100% of conversations." Scoring at scale tells you the shape of the problem. Diagnosis at scale tells you the cause. If your vendor says "100% coverage" ask them: coverage of what, exactly? Scoring? Tagging? Or actual root cause analysis? Most AI issues - a knowledge base article that contradicts your refund policy, a model that's started hallucinating shipping timelines, an agent that handles the first question well but fumbles the follow-up - don't surface from scoring alone. They surface from diagnosis, from something that looks at the conversation and works backwards to why it went wrong.

Diagnosis speed that matches the speed of the problem. AI agents can break thousands of conversations in hours. If your QA tool needs a week of accumulated data before it can tell you what went wrong, you've got a monitoring tool, not a diagnostic one. The question to ask: how quickly after a problem starts can this tool tell me what's causing it and what to do? If the answer involves "after your next QA review cycle," that's a cycle designed for human-speed problems applied to machine-speed ones.

AI-specific failure detection. This is where most legacy QA tools fall down completely, and it's not their fault - they were built before AI agents existed at any real scale. But the failure modes are genuinely different. AI agents drift over time as models update. They hallucinate confidently. They find creative interpretations of policies that technically satisfy the letter but miss the spirit. They can contradict your knowledge base in ways that are syntactically different but semantically identical to the correct answer, which makes them almost impossible to catch with keyword-based rules. If your QA tool can't distinguish between "the agent gave a wrong answer" and "the agent gave a wrong answer because the knowledge base article on returns was updated Tuesday and now contradicts the policy document from March," you're still doing the diagnosis manually.

The gap between knowing and fixing. Most QA tools stop at measurement. The good ones stop at diagnosis. Very few close the loop to fix implementation. Ask your vendor: when your tool identifies a problem, what happens next? If the answer is "we surface it in a dashboard and your team investigates," that's a reporting tool. If the answer is "we diagnose the root cause, propose a fix, let you test it, and implement it," that's a QA tool.

Questions to ask every QA vendor

These are slightly adversarial on purpose. The QA category has spent years selling dashboards as solutions, and you deserve direct answers.

What percentage of conversations do you actually analyse? Not score. Not tag. Analyse for root causes. If it's less than 100%, how do you handle the conversations you miss?
When my CSAT drops 8 points, how quickly can your tool tell me why? Not that it dropped. Why. With evidence.
How do you detect AI-specific failures? Drift, hallucination, compliance gaps, knowledge base contradictions. If the answer is "the same way we detect human agent issues," that's not good enough.
Is your scoring metric proprietary or industry-standard? Can I benchmark it externally? Can I take my data to another vendor and get comparable scores?
What happens after you identify a problem? Dashboard alert? Suggested fix? Tested fix? Implemented fix? The further right on that spectrum, the more useful the tool actually is.
If your company also sells the AI agent, how do you handle the conflict of interest in grading its performance? This one's for Intercom specifically, but the principle applies anywhere the QA vendor and the agent vendor are the same company.

Where Coach fits (and where it doesn't)

I've kept the framework above vendor-neutral because I think it's genuinely useful regardless of what you buy. But we built Coach specifically because we saw this gap with our own customers, so here's the honest assessment.

Coach is limited in ways you should know about. It launched in January 2026, which means it's new to market with the rough edges that implies. Setup requires investment - this isn't a plug-and-play widget. And while Coach works across any support operation, it's strongest when paired with Lorikeet's AI agent because the diagnosis layer has deeper access to the agent's reasoning.

Coach is not a fit if: you're happy with sampling-based QA and your current process works, you don't have AI-specific failure modes to worry about, or your compliance requirements don't mandate complete conversation monitoring. Not every team needs 100% diagnostic coverage, and if you don't, the simpler tools will serve you fine at lower cost and complexity.

The bottom line

The QA tool market is splitting into two philosophies. Measurement-focused tools give you better scoring, higher coverage, prettier dashboards. They tell you what's happening with increasing precision. Action-focused tools give you diagnosis, root cause analysis, fix implementation. They tell you why things are happening and help you fix them.

Sample-based monitoring made sense when humans were the constraint. They're not anymore. When you're evaluating QA tools, demand complete coverage, diagnosis with evidence, fix proposals you can test and implement, AI failure detection that's purpose-built rather than retrofitted, and consistent measurement that you can benchmark independently.

Your customers don't care about your internal QA methodology. They care whether you fix problems fast.

FAQ

Do I really need 100% conversation coverage?

If you're running AI agents at any meaningful scale, yes. The math is straightforward: AI agents can introduce systematic errors across thousands of conversations simultaneously. A 3% sample has roughly the same chance of catching a systematic AI failure in its first hour as you have of guessing which specific customer will complain on Twitter. By the time your sample catches it, the damage is already done. For regulated industries (fintech, insurance, healthcare), the argument is even simpler: your compliance team will eventually ask "how many conversations did you not review?" and you need a better answer than "ninety-seven percent of them."

How does Intercom's CX Score compare to industry-standard QA metrics?

It doesn't compare, and that's the point. CX Score is a proprietary metric designed to replace CSAT, not complement it. You can't benchmark it against industry data, you can't compare it across vendors, and you can't independently verify how it's calculated. Intercom argues this is a feature because CSAT is flawed (which is partially true). But replacing a flawed open standard with a proprietary black box from the same company that sells you the AI agent being measured isn't a solution to the measurement problem. It's a solution to Intercom's competitive positioning problem.

What if my team is all human agents with no AI?

Sample-based QA tools like MaestroQA or Zendesk QA are genuinely good for this use case. The diagnosis gap matters less when problems emerge at human speed rather than machine speed. If you're planning to add AI agents in the next 12 months, factor diagnostic capabilities into your evaluation now so you don't have to rip and replace later, but if AI isn't on the roadmap, the traditional tools work.

What's the minimum setup investment for diagnostic QA?

Honest answer: more than sample-based QA, less than building your own. Coach typically takes 1-2 weeks to configure properly, including defining what "good" looks like for your specific operation, connecting your data sources, and calibrating the diagnosis layer. Sample-based tools can be running in days. The tradeoff is setup time versus ongoing investigation time, and most teams doing 10+ hours per week of manual root cause analysis recoup the setup investment within the first month.

How do diagnostic tools handle false positives?

This is a legitimate concern and one we think about constantly. Coach uses a confidence-scored approach where diagnoses come with supporting evidence (specific conversations, patterns, and statistical backing) rather than binary alerts. The goal is to surface probable root causes ranked by evidence strength rather than a firehose of "something might be wrong" notifications. That said, any system doing automated diagnosis will occasionally identify patterns that aren't real problems. The mitigation is transparency: showing you the evidence and letting you decide, rather than hiding the reasoning behind a score.

Which industries benefit most from diagnostic QA?

Any industry where getting the answer wrong has consequences beyond customer frustration. Fintech (wrong information about fees, accounts, or transactions), insurance (incorrect coverage guidance), healthcare (inaccurate medical information), and regulated e-commerce (compliance with consumer protection laws) all have failure modes where "we sample 3% and hope for the best" isn't a defensible position. That said, even unregulated businesses with high conversation volumes benefit from the operational speed: finding and fixing problems in hours rather than weeks helps regardless of your compliance obligations.

Should I use a dashboard or a conversational interface for QA?

Dashboards are good for scheduled reviews and trend monitoring. Conversational interfaces (asking your QA tool "why did CSAT drop last Tuesday?" and getting an answer with evidence) are better for ad-hoc diagnosis and faster time-to-understanding. The best setup is both: a dashboard for the weekly review cadence, and a conversational interface for the "something broke and I need to know what" moments. If you have to pick one, pick the one that matches how your team actually works. If your QA lead checks a dashboard every Monday morning and that process works, a conversational interface is nice-to-have. If your team spends hours each week manually investigating metric movements, the conversational interface will save significantly more time.

AI Learnings

How to get started using Claude Code as an operations or customer experience leader

AI Learnings

If you don't give your customers AI, they'll get it elsewhere

Product

Industries

Customers

Pricing

Company

Get a demo

Ready to deploy human-quality CX?

Get a demo

Product

Pricing

Customer Stories

Integrations

FAQ

Nominate

Company

About

Careers

Blog

Partnership

Trust Center

ABN: 53 669 390 149

Ready to deploy human-quality CX?

Get a demo

Product

Pricing

Customer Stories

Integrations

FAQ

Nominate

Company

About

Careers

Blog

Partnership

Trust Center

ABN: 53 669 390 149

Ready to deploy human-quality CX?

Get a demo

Product

Pricing

Customer Stories

Integrations

FAQ

Nominate

Company

About

Careers

Blog

Partnership

Trust Center

ABN: 53 669 390 149