How to QA 100% of Support Tickets (Human and AI Agents) in 2026

How to QA 100% of Support Tickets (Human and AI Agents) in 2026

Lorikeet Logo

Lorikeet News Desk

|

Manual QA samples 2-5% of tickets and calls it quality assurance. That works until the ticket your reviewer never opened becomes the one your regulator asks about. Moving to 100% coverage is no longer a luxury, and in 2026 it is cheap enough that sampling is hard to defend.

QA on 100% of support tickets means scoring every interaction your team handles, human and AI, against the same rubric, instead of grading a small random sample. The shift is now practical because automated QA can read a full ticket, check the logic and the facts against your SOPs, and assign a score for a fraction of a cent per ticket. This guide walks through how to make the move: what to score, how to score humans and AI fairly on the same scale, how to prioritize the review queue when a human still needs to look, how to close the coaching loop, and how to model the ROI before you commit.

  • Traditional manual QA reviews 2-5% of tickets, which means roughly 95% of your support quality is unmeasured and most coaching is based on anecdote.

  • Automated QA scores every ticket against the same rubric, so a single missed PII disclosure or wrong refund decision surfaces whether it happened on ticket 12 or ticket 12,000.

  • Human and AI agents should be scored on the same axes (logic and factual accuracy against SOPs), with the differences handled in weighting, not in separate rubrics.

  • The review queue should be prioritized by risk and disagreement, not by recency, so human reviewer time goes to the tickets where a wrong score is most expensive.

  • Lorikeet Coach scores tickets at roughly $0.10 per ticket, which makes 100% coverage cheaper than the manual sampling it replaces at most team sizes.

Last updated: June 2026

Support QA has a measurement problem that has been hidden by tradition. For two decades the standard has been a QA analyst pulling a handful of tickets per agent per week, scoring them against a scorecard, and extrapolating. The math never worked. If you review 3% of tickets, you are blind to 97% of what your customers experienced, and the 3% you saw was not chosen because it was risky, it was chosen because it was convenient. Now that AI agents handle a growing share of volume, the gap is wider still: an AI agent can produce thousands of resolutions a day, and sampling 3% of those tells you almost nothing about the edge cases where it went wrong. The good news is that the same large language models that made AI resolution viable also made 100% QA viable. This guide is the practical playbook for getting there without drowning your team in noise.

100% QA: Scoring every support interaction your team produces against a defined rubric, rather than a random sample. Coverage is total; human review is still selective, but it is applied on top of complete automated scoring instead of replacing it.

Ticket quality score: A composite score assigned to a single resolved ticket that combines logic (did the agent follow the right steps and make the right decision) and factual accuracy (was what the agent said true and consistent with your SOPs and knowledge base).

Why 2-5% Sampling Stopped Being Enough

Sampling was a budget decision dressed up as a methodology. Reviewing every ticket by hand was impossible, so teams reviewed what they could afford and treated the sample as representative. It is not representative of the thing you actually care about. A random sample is good at estimating an average and terrible at catching rare, expensive failures. In support, the rare expensive failures are the whole point: the one ticket where an agent gave incorrect tax guidance, approved a refund that violated policy, or disclosed account details to the wrong person. Those do not show up in a 3% sample often enough to manage, and when they do show up it is usually after the customer has already complained or the regulator has already asked.

The arrival of AI agents changes the urgency. A human agent handles maybe 40-60 tickets a day, so a team of 50 produces a few thousand interactions a day and sampling at least touches every agent. An AI concierge can resolve tens of thousands of tickets a day, and its failure modes are different: it does not get tired, but it can apply a subtly wrong policy consistently across thousands of tickets before anyone notices. Sampling an AI agent at 3% is not risk management, it is hoping the 3% you read happens to contain the systematic error. It usually will not.

The cost argument has also flipped. The reason teams sampled was that human review is expensive, roughly the same loaded cost as handling the ticket in the first place. Automated QA reads the full ticket, evaluates it against your rubric, and returns a score for a small fraction of that. When scoring every ticket costs less than sampling 5% of them used to, the only reason to keep sampling is inertia.

Step 1: Define What You Score (Logic and Factual Accuracy Against SOPs)

Before you automate anything, get the rubric right, because automation makes a bad rubric scale as fast as a good one. Most legacy QA scorecards are a grab bag of 15-30 line items: greeting used, empathy shown, branding correct, hold time announced, closing statement present. Many of those are tone and process compliance, and while they matter, they are not where regulated and complex businesses lose money. The two axes that carry the weight are logic and factual accuracy, both judged against your standard operating procedures.

Logic asks whether the agent did the right thing: did it follow the correct procedure for this ticket type, gather the required verification before taking an action, make the decision your SOP dictates, and escalate when escalation was warranted. A refund issued without the required identity check fails on logic even if the customer was delighted. Factual accuracy asks whether what the agent said was true: were the policy details correct, the account figures right, the next steps accurate, and the claims consistent with your current knowledge base. An agent that confidently quotes last year's fee schedule fails on factual accuracy even if every step it took was procedurally clean.

Anchor both axes to your SOPs and knowledge base rather than to a reviewer's opinion. The single biggest source of unreliable QA is a rubric that depends on what the individual reviewer would have done. If the scoring criterion is "is this consistent with SOP 4.2 and the current refunds article," two reviewers (and an automated scorer) can agree. If the criterion is "was this a good response," they cannot. Write the rubric so that every item points at a documented source of truth. This is also what makes automated scoring defensible: the score is not a vibe, it is a check against a citable policy.

Keep tone, empathy, and brand-voice items, but separate them into their own bucket with their own weight. They are real, and customers feel them, but a tone miss and a wrong-refund miss are not the same severity and should never average into a single undifferentiated number. Severity weighting matters more than the count of items: one critical factual error should be able to fail a ticket on its own, regardless of how clean the rest of it was.

Step 2: Score Human and AI Agents on the Same Scale, Fairly

As soon as you have AI agents and human agents in the same operation, the obvious temptation is to build two QA programs. Resist it. If humans and AI are scored on different rubrics, you cannot compare them, you cannot route work to whichever resolves a ticket type better, and you cannot tell whether your AI is actually outperforming the team it augments. Score both on the same axes (logic and factual accuracy against the same SOPs) so the numbers are comparable.

Fairness lives in the weighting and the interpretation, not in separate scorecards. A few principles keep it honest:

  • Judge both against the SOP, not against each other. The question for every ticket is "did this resolution match what our policy requires," whether the agent was a person or a model. The bar is the policy, not the other population's average.

  • Account for difficulty. AI agents and humans often handle different ticket mixes, and the AI may take the high-volume routine work while humans take the messy escalations. Comparing raw scores without controlling for ticket complexity flatters whoever got the easier queue. Segment by ticket type before you compare populations.

  • Use the same critical-failure definitions for both. A PII disclosure or an out-of-policy financial action is a critical failure whether a human or an AI committed it. Holding AI to a stricter critical-failure bar than humans (or the reverse) breaks the comparison and usually hides a real problem on the lenient side.

  • Score AI on what it could control. If an AI agent escalated correctly because a tool was down, that is a pass on logic, not a failure for not resolving. Penalizing the agent for a correct escalation teaches the wrong lesson and is the same mistake as dinging a human for a justified transfer.

There is one structural advantage to scoring AI agents that humans do not offer: the AI produces a complete, replayable record of its reasoning and every tool call. You are not inferring what the agent was thinking from a transcript, you can see the actual decision path. That makes AI QA both easier and stricter, because the evidence is unambiguous. Use that to hold the AI to a high bar, not to grade it more leniently because it is new.

Step 3: Prioritize the Review Queue by Risk, Not Recency

Scoring 100% of tickets automatically does not mean a human reads 100% of tickets. The point of total automated coverage is that it tells you exactly which tickets a human should read. The wrong default is chronological or random review, because it spends your most expensive resource (a senior reviewer's attention) on tickets that are probably fine. Prioritize the queue so human review lands where a wrong or missed score is most costly.

A practical prioritization stack, in order:

  • Critical-failure flags first. Any ticket the automated scorer flagged as a potential PII disclosure, out-of-policy financial action, or compliance breach goes to a human immediately, regardless of overall score. These are the tickets where a false negative is unacceptable.

  • Low scores on high-risk ticket types next. A low-scoring KYC, dispute, or account-closure ticket is more urgent than a low-scoring "where is my order," because the downside is larger. Weight by the risk of the workflow, not just the size of the score gap.

  • Disagreement and low confidence. When the automated scorer is uncertain, or when its score conflicts with a customer satisfaction signal (a high QA score on a ticket the customer rated one star, or the reverse), a human should adjudicate. These are also your best calibration data.

  • Sampled passes for calibration. Keep a small random sample of high-scoring tickets in the human queue specifically to verify the automated scorer is not rubber-stamping. This is how you catch the scorer drifting, and it is the inverse of old-school sampling: you are auditing the auditor, not the agents.

Done well, this inverts the economics of QA. Instead of a reviewer reading 50 random tickets to find the 2 that mattered, the reviewer reads 50 tickets that were chosen because they are the 50 most likely to matter. The same headcount produces far more caught failures, and the failures it catches are the expensive ones.

Step 4: Close the Coaching Loop

A QA score that does not change behavior is a number you paid for and ignored. The value of 100% coverage is realized in the loop: scores reveal patterns, patterns drive coaching, coaching changes behavior, and the next round of scores confirms whether it worked. With sampling, this loop barely closed because the data was too thin to see patterns and too stale to act on. With full coverage you can see, for a specific human agent, that they consistently miss the verification step on account-change tickets, and coach that exact behavior with the exact tickets as evidence.

For human agents, full-coverage QA changes coaching from "here are two tickets I happened to review" to "here is your pattern across every ticket you handled this month." That is both fairer and more actionable. Agents trust coaching more when it is based on their whole body of work rather than a reviewer's small and possibly unlucky sample, and managers can distinguish a one-off slip from a genuine skill gap.

For AI agents, the coaching loop is a configuration loop, and it is faster. When QA shows the AI systematically mishandling a ticket type (applying an outdated policy, escalating when it should resolve, missing a disclosure), that is not a training conversation, it is a fix to a workflow, a knowledge article, or a guardrail. Because the QA score points at the exact reasoning step that failed, you can change the SOP or the prompt, then re-run the affected scenarios to confirm the fix before it touches another live ticket. The same QA data that grades the AI becomes the spec for improving it.

The discipline that ties it together is treating QA as a closed system. Every coaching action should produce a measurable change in the next scoring period, and if it does not, the coaching (or the rubric) is wrong. Full coverage gives you the statistical power to actually see those changes instead of guessing.

Step 5: Model the ROI Before You Commit

The case for 100% QA is usually easier to make than teams expect, because the comparison is not "free sampling versus paid full coverage," it is "expensive sampling versus cheaper full coverage." Build the model on three lines.

First, the cost of the QA you do today. A manual reviewer scoring tickets carefully manages perhaps 20-40 tickets an hour, and at a loaded cost of a support specialist that puts manual QA in the range of a dollar or more per reviewed ticket. Multiply by your sample volume. That is your current spend to see 2-5% of your operation.

Second, the cost of automated full coverage. Automated QA prices per ticket scored, and at a rate around $0.10 per ticket (where Lorikeet Coach sits), scoring 100% of tickets often costs less in absolute dollars than the manual program that covered a sliver of them. Even where total spend rises, the cost per unit of coverage collapses.

Third, and this is the line that actually justifies the program, the cost of the failures you currently miss. A single out-of-policy refund, a mishandled dispute that becomes a chargeback, or a compliance miss that draws a regulator's attention can cost more than a year of QA tooling. Sampling is designed to miss exactly these, because they are rare. Full coverage is designed to catch them. You do not need to catch many to pay for the program, and in a regulated business catching one can be the difference between a quiet quarter and a bad one.

A note on overclaiming: 100% QA supports your compliance obligations and surfaces failures far earlier, but it does not guarantee you will never have a bad ticket reach a customer. Automated scoring is post-resolution by nature. The honest framing is that it shrinks the window between a failure happening and you knowing about it from weeks (or never) to the same day, and it makes that window the same for ticket one and ticket one million.

How Lorikeet Coach Fits

Lorikeet runs two agents: the Concierge, which resolves tickets across chat, email, voice, SMS, and WhatsApp, and Coach, which handles analytics and QA. Coach is built for the workflow in this guide. It scores 100% of tickets, human and AI, against your rubric, produces a ticket quality score that combines logic and factual accuracy, performs root-cause analysis on failures, and verifies resolution. Internally we describe it as the AI evaluating the AI, and it is deployable standalone at roughly $0.10 per ticket, so you can run Coach over a human or third-party support operation even if your resolution agent is something else.

Coach exists because of how Lorikeet thinks about reliability in regulated industries. The platform's approach is defense in depth: adversarial simulations before launch, message checks on the way in, guardrails on the way out, and 100% post-resolution QA as the final layer. Coach is that final layer. The honest limitation is that Coach scores tickets after they resolve, so it is a detection and improvement system, not a real-time block; the pre-resolution guardrails and simulations are what catch a failure before it reaches a customer, and Coach is what makes sure you learn from the ones that get through and from every ticket besides.

If your QA program today is a reviewer sampling a few percent of tickets and extrapolating, the move to 100% coverage is the highest-leverage change available to a support quality team in 2026. Score everything, score humans and AI on the same honest scale, send people to the tickets that actually matter, and close the loop. See how Lorikeet Coach scores 100% of your tickets.

Key Takeaways

  • Sampling 2-5% of tickets measures an average and misses the rare, expensive failures (out-of-policy refunds, PII disclosures, compliance breaches) that are the entire point of QA in a regulated business.

  • Score humans and AI on the same two axes, logic and factual accuracy against your SOPs, and handle fairness through weighting, difficulty segmentation, and shared critical-failure definitions rather than separate rubrics.

  • 100% automated coverage does not mean humans read every ticket; it tells you which tickets to read, so prioritize the review queue by critical-failure flags, high-risk workflows, and scorer disagreement.

  • The coaching loop is where the value lands: full coverage turns coaching from anecdote into pattern for humans and into a fast configuration-and-re-test loop for AI.

  • At roughly $0.10 per ticket, automated QA usually costs less than the manual sampling it replaces, and catching even one missed high-cost failure pays for the program.