Evaluating AI CX Platforms: A Technical Checklist for AI and ML Teams

Evaluating AI CX Platforms: A Technical Checklist for AI and ML Teams

Thomas Wing-Evans

|

Your VP of CX just forwarded you a shortlist of three AI customer service platforms. The demos looked good. The sales decks mentioned "enterprise-grade AI" and "99% accuracy." Now your four-person AI/ML team has two weeks to figure out which one will actually survive contact with production traffic, regulated data, and customers who ask questions no one anticipated during the pilot.

This is where most evaluations go wrong. Gartner found that at least 50% of generative AI projects were abandoned after proof of concept by the end of 2025. A further prediction estimates that through 2026, organizations will abandon 60% of AI projects unsupported by AI-ready data. The gap between a compelling demo and a production system handling 10,000 conversations a day is not a configuration problem. It is an architecture problem.

What follows is the checklist we would use if evaluating an AI CX platform from scratch. It is written for the Head of AI who will dig into architecture diagrams, ask about hallucination rates, and want to know exactly where customer data lives at rest.

Architecture first.

The single most important question in any AI CX evaluation is how the platform orchestrates work. There are two dominant patterns with fundamentally different reliability profiles.

The first is prompt-and-pray: a large language model receives the customer message, a system prompt, and some retrieved context, then generates a response. Fast to demo, slow to debug. The model might hallucinate a refund policy that does not exist, or correctly resolve a billing question 99 times and fabricate an answer on the 100th.

The second pattern is structured orchestration, where the LLM operates within a defined workflow graph. The model makes decisions at specific nodes ("Is this customer asking about billing or shipping?"), but the actions it can take at each step are constrained by the graph. This pattern is harder to build but fundamentally easier to audit, debug, and trust in production.

Ask the vendor: can you show me the execution trace for a resolved conversation? If the answer is a chat transcript, that is pattern one. If the answer is a step-by-step graph showing each decision node, each API call, and each constraint boundary, that is pattern two. The difference matters when your compliance team asks why the AI told a customer they could get a refund outside policy.

Hallucination benchmarks.

Enterprise benchmarks report hallucination rates of 15% to 52% across commercial LLMs, depending on the task. That range is not a typo. On grounded summarization tasks, top models have improved to 0.7% to 1.5% hallucination rates. But on domain-specific tasks like medical, legal, or technical analysis, rates of 10% to 20% or higher are common. A Deloitte survey found that 47% of enterprise AI users made at least one major decision based on hallucinated content in 2024.

For customer service, the relevant benchmark is not how often the model hallucinates in a research paper. It is how often it generates a response that contradicts your policies, invents a product feature, or promises something your operations team cannot deliver. RAG (retrieval-augmented generation) systems report 60% to 80% lower hallucination rates than non-RAG models, but RAG alone is not sufficient. The retrieval layer can return the right document and the generation layer can still misinterpret it.

The questions to ask during evaluation are specific. What is the platform's measured hallucination rate on customer-facing responses, not on academic benchmarks? How is that rate measured: by sampling, by automated checking against source documents, or by human review? What happens when a hallucination is detected in production? Is there a defence-in-depth approach with multiple layers catching errors before they reach the customer?

If a vendor cannot answer these questions with specific numbers from production deployments, they have not solved this problem yet. They have deferred it to you.

Data residency and sovereignty.

Deloitte's State of AI in the Enterprise report found that 73% of enterprises cite data privacy and security as their top AI risk concern, and 77% factor a vendor's country of origin into AI purchasing decisions. These numbers reflect a real architectural constraint, not a procurement preference.

Data residency (where data is physically stored) and data sovereignty (whose laws govern that data) are not the same thing. The US CLOUD Act allows US law enforcement to compel American companies to provide access to data stored abroad. If your vendor is headquartered in the US, your customer data is subject to US jurisdiction even if the servers sit in Frankfurt or Sydney.

The EU AI Act, fully applicable in August 2026, requires high-risk AI systems to have documented data governance, bias detection, and datasets that reflect the deployment environment. Penalties reach 7% of global annual turnover, exceeding GDPR fines. Sixty-five percent of enterprises now prioritize cloud providers with strong region-specific offerings when evaluating AI vendors, according to Forrester.

Your checklist should include: where does customer conversation data reside at rest? Where is it processed? Which LLM providers does the platform use, and where are those models hosted? Does the vendor offer a Business Associate Agreement for HIPAA-covered data? Can you deploy in a specific region without your data transiting through another jurisdiction?

Security and compliance surface.

The average enterprise data breach costs $4.88 million. Among organizations reporting an AI-related security incident in 2025, IBM found that 97% lacked proper AI access controls and 63% lacked AI governance policies. SOC 2 adoption surged 40% in 2024 as companies rushed to meet client demands, with over 60% of businesses more likely to partner with a vendor that holds SOC 2 certification.

For AI CX platforms, the security surface is larger than a typical SaaS product. The platform processes customer PII in real time, connects to your backend systems via API, and generates natural language responses that represent your brand.

The minimum checklist: SOC 2 Type II certification. Encryption at rest (AES-256) and in transit (TLS 1.3). Customer-managed encryption keys. Role-based access controls with audit logging. A penetration test report from the last 12 months. If the platform connects to your systems via API, can you scope its access to specific endpoints?

For regulated industries, add: HIPAA BAA availability. GDPR data processing agreements. The ability to redact or delete specific customer data on request. A configurable data retention policy.

Model and prompt governance.

LangChain's 2026 State of AI Agents report found that 57% of organizations now have AI agents in production, with quality cited as the top barrier to deployment by 32% of respondents. Quality in this context means governance: who controls what the model can say, how changes are tested, and what happens when something goes wrong.

The governance questions that matter are operational. Can you define boundaries for what the AI is allowed to say and do? Can you restrict it from discussing specific topics or offering discounts above a threshold? When you update a policy, how does that change propagate? Is there a staging environment where you can test changes before they hit production traffic?

This is where guardrails become a technical evaluation criterion rather than a marketing term. A guardrail that operates as a post-generation filter (scan the output and block it if it violates a rule) is fundamentally different from a guardrail that constrains what the model can generate in the first place. Post-generation filtering catches some errors. Architectural constraints prevent categories of errors from being possible. Ask the vendor which approach they use.

Observability in production.

Most AI agents still operate as opaque black boxes, creating hidden risks across security, compliance, performance, and governance. For an AI/ML team evaluating a CX platform, observability is not optional. It is the mechanism by which you maintain confidence in a system that generates novel outputs for every conversation.

The observability requirements go beyond uptime dashboards. You need per-conversation execution traces showing every decision the AI made and why. Token-level logging for cost monitoring. Semantic drift detection that alerts you when response patterns shift. Confidence scores on individual responses, so you can set thresholds for automatic escalation.

Gartner predicts that by 2027, 40% of enterprise workloads will be managed by autonomous AI agents. During your evaluation, ask the vendor to show you their monitoring dashboard for a production deployment. If they show aggregate metrics (total conversations, average CSAT), push for per-conversation traces. The aggregate view tells you things are working. The trace view tells you why something failed.

Integration architecture.

An AI CX platform that cannot read your customer's order history, check their account status, or process a refund through your existing systems is a chatbot with a language model attached. Integration quality matters more than integration count. A platform with ten deeply integrated enterprise systems that can pass full event context to the model is more valuable than one with 200 shallow connectors that only read data.

The integration checklist: does the platform support bidirectional API access to your core systems (CRM, order management, billing)? Can it execute write operations (process refunds, update accounts, cancel orders) or only read data? How does it handle API failures or timeouts mid-conversation? For a digital health platform, add: can the AI access patient records through your EHR API without storing data in the platform? Can it verify insurance eligibility in real time? Each integration carries both a technical and a compliance requirement.

Testing and validation.

Sixty percent of new RAG deployments now include systematic evaluation from day one, up from less than 30% in early 2025. Your evaluation should assess the vendor's testing infrastructure as seriously as their product features. Can you run automated test suites against the AI's responses? Can you define test cases for your highest-risk scenarios and run them on every deployment?

The testing capability you care about most is regression testing. When the vendor updates their underlying model or you change a workflow, you need to know whether behavior on existing test cases has changed. Without regression testing, every change is a deployment with unknown side effects.

Total cost of ownership.

Organizations investing in AI initiatives average $1.0 million to $2.6 million per use case, according to procurement research. The sticker price is a fraction of the total cost. The rest includes integration development, internal team time for prompt engineering and monitoring, escalation handling, compliance review, and model retraining costs.

Ask for a transparent pricing model covering conversation volume, API calls, and per-token charges from underlying LLM providers. Ask what happens when volume doubles. The cheapest platform per conversation is not the cheapest to operate if it requires three full-time engineers to keep running.

Lorikeet's approach.

Lorikeet is an AI customer support platform that resolves tickets end-to-end across chat, email, and voice, handling complex multi-step workflows including processing refunds, updating accounts, and managing intricate procedures. Lorikeet's architecture uses Intelligent Graph orchestration, where the LLM generates and supervises customer support workflows rather than making them up on the fly. This means every conversation produces a debuggable execution trace showing each decision node, each API call, and each constraint boundary.

For AI/ML teams, Lorikeet addresses the checklist directly. The graph-based architecture makes hallucination containment structural rather than probabilistic. The AI follows the same standard operating procedures that top human agents follow, constrained by the workflow graph. Integration is bidirectional: Lorikeet reads from and writes to your existing systems, executing refunds and account changes within your backend.

Lorikeet operates in regulated industries including healthcare and financial services, with data residency controls and audit trails for every conversation. The platform's defence-in-depth approach to accuracy uses multiple validation layers rather than a single guardrail filter.

What is Lorikeet?

Lorikeet is an AI customer support platform that acts as a universal concierge across chat, email, voice, and SMS. Unlike prompt-and-response chatbots, Lorikeet uses Intelligent Graph orchestration to follow structured workflows, making every decision auditable and every action constrained by defined boundaries. It processes refunds, updates accounts, manages billing, schedules appointments, and executes complex multi-step procedures by integrating directly with existing systems like Zendesk, Stripe, and internal APIs. For AI/ML teams, Lorikeet provides per-conversation execution traces, production observability, and the architectural transparency needed to pass technical due diligence. See how Lorikeet handles technical evaluation.

The evaluation matrix.

Compiling the checklist into a decision framework, here are the categories and the questions that separate production-grade platforms from demo-ready prototypes.

Architecture: Structured orchestration vs. prompt-and-pray. Debuggable execution traces. Workflow versioning.

Accuracy: Measured hallucination rates from production. Defence-in-depth validation layers. Regression testing.

Data: Regional data residency with documented processing locations. Sovereignty analysis. HIPAA BAA and GDPR DPA.

Security: SOC 2 Type II. AES-256 at rest. TLS 1.3 in transit. CMEK. Scoped API access.

Governance: Configurable response boundaries. Staging environments. Audit logs for every conversation.

Observability: Per-conversation traces. Semantic drift detection. Confidence scoring with escalation thresholds.

Integration: Bidirectional API access. Write operations. Graceful handling of API failures.

Cost: Transparent volume-based pricing. Model update inclusion. Total cost modeling.

Run every vendor through each category. The vendors who struggle built for the demo. The vendors who thrive built for the conversation after the demo, when your AI/ML team starts asking the hard questions.

Book a technical evaluation with Lorikeet's engineering team.