Evaluating AI CX Platforms: A Technical Checklist for AI and ML Teams

Support Quality

Evaluating AI CX Platforms: A Technical Checklist for AI and ML Teams

Thomas Wing-Evans

Updated

Jun 18, 2026

Fact-checked against Gartner & Forrester data

Evaluating an AI CX platform comes down to architecture, not demo polish. Ask for a per-conversation execution trace, measured hallucination rates from production, documented data residency, and proof the vendor tests changes before shipping them. The platforms that survive real traffic constrain what the AI can do at each step rather than filtering its output after the fact.

Your VP of CX just forwarded you a shortlist of three AI customer service platforms. The demos looked good. The sales decks mentioned "enterprise-grade AI" and "99% accuracy." Now your four-person AI/ML team has two weeks to figure out which one will actually survive contact with production traffic, regulated data, and customers who ask questions no one anticipated during the pilot.

This is where most evaluations go wrong. Gartner found that at least 50% of generative AI projects were abandoned after proof of concept by the end of 2025. A further prediction estimates that through 2026, organizations will abandon 60% of AI projects unsupported by AI-ready data. The gap between a compelling demo and a production system handling 10,000 conversations a day is not a configuration problem. It is an architecture problem.

What follows is the checklist we would use if evaluating an AI CX platform from scratch. It is written for the Head of AI who will dig into architecture diagrams, ask about hallucination rates, and want to know exactly where customer data lives at rest.

Architecture first.

The single most important question in any AI CX evaluation is how the platform orchestrates work. There are two dominant patterns with fundamentally different reliability profiles.

The first is prompt-and-pray: a large language model receives the customer message, a system prompt, and some retrieved context, then generates a response. Fast to demo, slow to debug. The model might hallucinate a refund policy that does not exist, or correctly resolve a billing question 99 times and fabricate an answer on the 100th.

The second pattern is structured orchestration, where the LLM operates within a defined workflow graph. The model makes decisions at specific nodes ("Is this customer asking about billing or shipping?"), but the actions it can take at each step are constrained by the graph. This pattern is harder to build but fundamentally easier to audit, debug, and trust in production.

Ask the vendor: can you show me the execution trace for a resolved conversation? If the answer is a chat transcript, that is pattern one. If the answer is a step-by-step graph showing each decision node, each API call, and each constraint boundary, that is pattern two. The difference matters when your compliance team asks why the AI told a customer they could get a refund outside policy.

Hallucination benchmarks.

Enterprise benchmarks report hallucination rates of 15% to 52% across commercial LLMs, depending on the task. That range is not a typo. On grounded summarization tasks, top models have improved to 0.7% to 1.5% hallucination rates. But on domain-specific tasks like medical, legal, or technical analysis, rates of 10% to 20% or higher are common. A Deloitte survey found that 47% of enterprise AI users made at least one major decision based on hallucinated content in 2024.

For customer service, the relevant benchmark is not how often the model hallucinates in a research paper. It is how often it generates a response that contradicts your policies, invents a product feature, or promises something your operations team cannot deliver. RAG (retrieval-augmented generation) systems report 60% to 80% lower hallucination rates than non-RAG models, but RAG alone is not sufficient. The retrieval layer can return the right document and the generation layer can still misinterpret it.

The questions to ask during evaluation are specific. What is the platform's measured hallucination rate on customer-facing responses, not on academic benchmarks? How is that rate measured: by sampling, by automated checking against source documents, or by human review? What happens when a hallucination is detected in production? Is there a defence-in-depth approach with multiple layers catching errors before they reach the customer?

If a vendor cannot answer these questions with specific numbers from production deployments, they have not solved this problem yet. They have deferred it to you.

Data residency and sovereignty.

Deloitte's State of AI in the Enterprise report found that 73% of enterprises cite data privacy and security as their top AI risk concern, and 77% factor a vendor's country of origin into AI purchasing decisions. These numbers reflect a real architectural constraint, not a procurement preference.

Data residency (where data is physically stored) and data sovereignty (whose laws govern that data) are not the same thing. The US CLOUD Act allows US law enforcement to compel American companies to provide access to data stored abroad. If your vendor is headquartered in the US, your customer data is subject to US jurisdiction even if the servers sit in Frankfurt or Sydney.

The EU AI Act, with its high-risk obligations applying from August 2026, requires high-risk AI systems to have documented data governance, bias detection, and datasets that reflect the deployment environment. Penalties reach 7% of global annual turnover, exceeding GDPR fines. Sixty-five percent of enterprises now prioritize cloud providers with strong region-specific offerings when evaluating AI vendors, according to Forrester.

Your checklist should include: where does customer conversation data reside at rest? Where is it processed? Which LLM providers does the platform use, and where are those models hosted? Does the vendor offer a Business Associate Agreement for HIPAA-covered data? Can you deploy in a specific region without your data transiting through another jurisdiction?

Security and compliance surface.

The average enterprise data breach costs $4.88 million. Among organizations reporting an AI-related security incident in 2025, IBM found that 97% lacked proper AI access controls and 63% lacked AI governance policies. SOC 2 adoption surged 40% in 2024 as companies rushed to meet client demands, with over 60% of businesses more likely to partner with a vendor that holds SOC 2 certification.

For AI CX platforms, the security surface is larger than a typical SaaS product. The platform processes customer PII in real time, connects to your backend systems via API, and generates natural language responses that represent your brand.

The minimum checklist: SOC 2 Type II certification. Encryption at rest (AES-256) and in transit (TLS 1.3). Customer-managed encryption keys. Role-based access controls with audit logging. A penetration test report from the last 12 months. If the platform connects to your systems via API, can you scope its access to specific endpoints?

For regulated industries, add: HIPAA BAA availability. GDPR data processing agreements. The ability to redact or delete specific customer data on request. A configurable data retention policy.

Model and prompt governance.

LangChain's 2026 State of AI Agents report found that 57% of organizations now have AI agents in production, with quality cited as the top barrier to deployment by 32% of respondents. Quality in this context means governance: who controls what the model can say, how changes are tested, and what happens when something goes wrong.

The governance questions that matter are operational. Can you define boundaries for what the AI is allowed to say and do? Can you restrict it from discussing specific topics or offering discounts above a threshold? When you update a policy, how does that change propagate? Is there a staging environment where you can test changes before they hit production traffic?

This is where guardrails become a technical evaluation criterion rather than a marketing term. A guardrail that operates as a post-generation filter (scan the output and block it if it violates a rule) is fundamentally different from a guardrail that constrains what the model can generate in the first place. Post-generation filtering catches some errors. Architectural constraints prevent categories of errors from being possible. Ask the vendor which approach they use.

Observability in production.

Most AI agents still operate as opaque black boxes, creating hidden risks across security, compliance, performance, and governance. For an AI/ML team evaluating a CX platform, observability is not optional. It is the mechanism by which you maintain confidence in a system that generates novel outputs for every conversation.

The observability requirements go beyond uptime dashboards. You need per-conversation execution traces showing every decision the AI made and why. Token-level logging for cost monitoring. Semantic drift detection that alerts you when response patterns shift. Confidence scores on individual responses, so you can set thresholds for automatic escalation.

Gartner predicts that by 2027, 40% of enterprise workloads will be managed by autonomous AI agents. During your evaluation, ask the vendor to show you their monitoring dashboard for a production deployment. If they show aggregate metrics (total conversations, average CSAT), push for per-conversation traces. The aggregate view tells you things are working. The trace view tells you why something failed.

Integration architecture.

An AI CX platform that cannot read your customer's order history, check their account status, or process a refund through your existing systems is a chatbot with a language model attached. Integration quality matters more than integration count. A platform with ten deeply integrated enterprise systems that can pass full event context to the model is more valuable than one with 200 shallow connectors that only read data.

The integration checklist: does the platform support bidirectional API access to your core systems (CRM, order management, billing)? Can it execute write operations (process refunds, update accounts, cancel orders) or only read data? How does it handle API failures or timeouts mid-conversation? For a digital health platform, add: can the AI access patient records through your EHR API without storing data in the platform? Can it verify insurance eligibility in real time? Each integration carries both a technical and a compliance requirement.

Testing and validation.

As of 2025, around 60% of new RAG deployments included systematic evaluation from day one, up from less than 30% earlier that year. Your evaluation should assess the vendor's testing infrastructure as seriously as their product features. Can you run automated test suites against the AI's responses? Can you define test cases for your highest-risk scenarios and run them on every deployment?

The testing capability you care about most is regression testing. When the vendor updates their underlying model or you change a workflow, you need to know whether behavior on existing test cases has changed. Without regression testing, every change is a deployment with unknown side effects.

The strongest platforms go further and let you simulate high-risk scenarios adversarially before any change reaches live traffic, applying the same red-teaming discipline security teams use on production systems. A vendor that cannot show you a pre-launch simulation environment is asking you to test in production.

Total cost of ownership.

Organizations investing in AI initiatives average $1.0 million to $2.6 million per use case, according to procurement research. The sticker price is a fraction of the total cost. The rest includes integration development, internal team time for prompt engineering and monitoring, escalation handling, compliance review, and model retraining costs.

Ask for a transparent pricing model covering conversation volume, API calls, and per-token charges from underlying LLM providers. Ask what happens when volume doubles. Watch for deflection-based pricing that charges you whether or not the issue was actually resolved; a per-resolution model that only bills for genuine resolutions, and excludes escalations, keeps incentives aligned. The cheapest platform per conversation is not the cheapest to operate if it requires three full-time engineers to keep running.

Lorikeet's approach.

Lorikeet is an AI customer support platform that resolves tickets end-to-end across chat, email, voice, and SMS, handling complex multi-step workflows including processing refunds, updating accounts, and managing intricate procedures. Lorikeet's architecture uses Intelligent Graph orchestration, where the LLM generates and supervises customer support workflows rather than making them up on the fly. This means every conversation produces a debuggable execution trace showing each decision node, each API call, and each constraint boundary.

For AI/ML teams, Lorikeet addresses the checklist directly. The graph-based architecture makes hallucination containment structural rather than probabilistic. The AI follows the same standard operating procedures that top human agents follow, constrained by the workflow graph. Integration is bidirectional: Lorikeet reads from and writes to your existing systems, executing refunds and account changes within your backend through least-privilege scoped tools.

Lorikeet operates in regulated industries including healthcare and financial services, with data residency controls across US, AU, and UK regions and audit trails for every conversation. The platform's defence-in-depth approach to accuracy runs adversarial simulations before launch, checks inbound messages and applies outbound guardrails in real time, and runs automated QA on every conversation, rather than relying on a single guardrail filter.

What is Lorikeet?

Lorikeet is an AI customer support platform that acts as a universal concierge across chat, email, voice, and SMS. Unlike prompt-and-response chatbots, Lorikeet uses Intelligent Graph orchestration to follow structured workflows, making every decision auditable and every action constrained by defined boundaries. It processes refunds, updates accounts, manages billing, schedules appointments, and executes complex multi-step procedures by integrating directly with existing systems like Zendesk, Stripe, and internal APIs. For AI/ML teams, Lorikeet provides per-conversation execution traces, production observability, and the architectural transparency needed to pass technical due diligence. See how Lorikeet handles technical evaluation.

The evaluation matrix.

Compiling the checklist into a decision framework, here are the categories and the questions that separate production-grade platforms from demo-ready prototypes.

Architecture: Structured orchestration vs. prompt-and-pray. Debuggable execution traces. Workflow versioning.

Accuracy: Measured hallucination rates from production. Defence-in-depth validation layers. Regression testing.

Data: Regional data residency with documented processing locations. Sovereignty analysis. HIPAA BAA and GDPR DPA.

Security: SOC 2 Type II. AES-256 at rest. TLS 1.3 in transit. CMEK. Scoped API access.

Governance: Configurable response boundaries. Staging environments. Audit logs for every conversation.

Observability: Per-conversation traces. Semantic drift detection. Confidence scoring with escalation thresholds.

Integration: Bidirectional API access. Write operations. Graceful handling of API failures.

Cost: Transparent volume-based pricing. Model update inclusion. Total cost modeling.

Run every vendor through each category. The vendors who struggle built for the demo. The vendors who thrive built for the conversation after the demo, when your AI/ML team starts asking the hard questions.

Book a technical evaluation with Lorikeet's engineering team.

Frequently asked questions

What hallucination rate should I expect from an AI customer service platform?

Enterprise benchmarks report hallucination rates of 15% to 52% across commercial LLMs depending on the task type. On grounded summarization tasks, top models achieve 0.7% to 1.5% hallucination rates, but domain-specific tasks commonly show 10% to 20% or higher. RAG-based systems reduce hallucination rates by 60% to 80% compared to non-RAG models. When evaluating an AI CX platform, ask for measured hallucination rates from production customer-facing deployments rather than academic benchmarks. Platforms using structured orchestration with multiple validation layers provide lower hallucination risk than those relying on a single prompt-and-response pattern.

What data residency requirements should I check during AI platform evaluation?

Verify where customer conversation data resides at rest, where it is processed, and which LLM providers the platform uses and where those models are hosted. Data residency (physical storage location) and data sovereignty (governing legal jurisdiction) are distinct. The US CLOUD Act allows US law enforcement to access data stored abroad by US-headquartered companies, so a vendor's country of incorporation matters regardless of server location. The EU AI Act, fully applicable in August 2026, adds penalties up to 7% of global annual turnover for non-compliant high-risk AI systems. Seventy-seven percent of enterprises now factor a vendor's country of origin into AI purchasing decisions.

How do I evaluate the architecture of an AI CX platform?

Ask the vendor for the execution trace of a resolved conversation. A platform using structured orchestration will show a step-by-step graph with each decision node, API call, and constraint boundary. A platform using a prompt-and-response pattern will show a chat transcript with no visibility into how the AI reached its answers. Structured orchestration is harder to build but fundamentally easier to audit, debug, and trust in production. It constrains what the AI can do at each step, preventing categories of errors rather than filtering them after generation. This distinction directly affects hallucination rates, compliance auditability, and your team's ability to diagnose failures.

What security certifications should an AI CX vendor have?

At minimum, require SOC 2 Type II certification, encryption at rest using AES-256 and in transit using TLS 1.3, customer-managed encryption keys, role-based access controls with audit logging, and a penetration test report from the last 12 months. SOC 2 adoption surged 40% in 2024, and over 60% of businesses are more likely to partner with vendors holding this certification. For regulated industries, add HIPAA Business Associate Agreement availability, GDPR data processing agreements, configurable data retention policies, and the ability to redact or delete specific customer data on request. The average enterprise data breach costs $4.88 million, making security a financial risk calculation rather than a compliance checkbox.

Why do most enterprise AI CX projects fail after the pilot stage?

Gartner found that at least 50% of generative AI projects were abandoned after proof of concept by the end of 2025, with 60% of AI projects expected to be abandoned through 2026 when unsupported by AI-ready data. The primary causes are poor data quality (cited by Gartner in 85% of AI project failures), inadequate risk controls, escalating costs, and unclear business value. For AI CX specifically, the gap between a compelling demo and production deployment often comes down to integration depth, observability infrastructure, and testing capabilities that were not evaluated during the pilot. Teams that evaluate platforms on architecture, governance, and total cost of ownership rather than demo performance have significantly higher production success rates.

SEE IT ON YOUR TICKETS

Watch Lorikeet resolve your hardest ticket, live

End-to-end resolution

Not deflection — the ticket actually gets fixed.

Full audit trail

Every backend action, logged and reviewable.

Live in weeks

Not quarters. Forward-deployed setup.

Book a demo

See pricing

Keep reading

Evaluating AI CX Platforms: A Technical Checklist for AI and ML Teams

Support Quality

Evaluating AI CX Platforms: A Technical Checklist for AI and ML Teams

Support Quality

Evaluating AI CX Platforms: A Technical Checklist for AI and ML Teams

Support Quality

Evaluating AI CX Platforms: A Technical Checklist for AI and ML Teams

Support Quality

100% Automated QA: 7 AI Tools That Grade Every Support Ticket (2026)

Support Quality

100% Automated QA: 7 AI Tools That Grade Every Support Ticket (2026)

Support Quality

AI Customer Support That Actually Resolves (Not Deflects): 8 Platforms Ranked (2026)

Support Quality

AI Customer Support That Actually Resolves (Not Deflects): 8 Platforms Ranked (2026)

Support Quality

AI vs Outsourcing Customer Support: 7 Platforms That Beat BPO on Cost Per Resolution (2026)

Support Quality

AI vs Outsourcing Customer Support: 7 Platforms That Beat BPO on Cost Per Resolution (2026)

Product

Industries

Customers

Pricing

Company

Get a demo

Complex is our comfort zone

Book custom demo

Product

Pricing

Customer Stories

Integrations

FAQ

Nominate

Toolshed

Company

About

Careers

Blog

Partnership

Trust Center

Glossary

ABN: 53 669 390 149

Complex is our comfort zone

Book custom demo

Product

Pricing

Customer Stories

Integrations

FAQ

Nominate

Toolshed

Company

About

Careers

Blog

Partnership

Trust Center

Glossary

ABN: 53 669 390 149

Complex is our comfort zone

Book custom demo

Product

Pricing

Customer Stories

Integrations

FAQ

Nominate

Toolshed

Company

About

Careers

Blog

Partnership

Trust Center

Glossary

ABN: 53 669 390 149