Chen is a CTO at a Series A healthtech company. His 15 engineers built an FAQ bot with GPT-4 and RAG in three weeks. It answers questions about the product. It pulls from the knowledge base. It sounds polished. Chen told his board the team could build full AI customer service in a quarter. That was two quarters ago.
The bot still only answers questions. It cannot process a refund. It cannot update an account. It cannot look up a patient's appointment, check their insurance status, and reschedule based on provider availability. It answers questions the way a search bar answers questions, except it costs $14,000 a month in compute and requires two engineers to keep it from hallucinating.
Chen's experience is not unusual. It is the norm. RAND Corporation research found that more than 80% of AI projects fail to deliver intended business value, a failure rate twice that of non-AI IT projects. The gap between a working prototype and a production system that actually resolves customer problems is where most in-house AI support initiatives go to die.
The demo deceives.
A RAG pipeline connected to GPT-4 produces impressive demos. Feed it your help center articles and product documentation, point it at an LLM, and within days you have a chatbot that can answer common questions with surprising accuracy. This is the 20% that feels like 80% of the problem.
The demo answers "What is your refund policy?" by retrieving the right help article and summarizing it. That feels like AI customer service. It is not. AI customer service means the system reads the policy, checks the customer's order history, determines eligibility, calculates the refund amount based on the return window and payment method, initiates the refund through Stripe, updates the order management system, and confirms the transaction to the customer. The distance between summarizing a policy and executing it is the distance between a prototype and a product.
According to S&P Global Market Intelligence, 42% of companies abandoned their AI initiatives in 2025, up from 17% in 2024. The average organization scrapped 46% of AI proofs of concept before they reached production. The pattern is consistent: teams build the easy part, discover the hard part, and either abandon the project or enter an indefinite loop of scope expansion.
RAG hits a wall.
Retrieval-augmented generation works well for a specific category of problem: questions that can be answered by finding and summarizing the right document. When a customer asks about your pricing tiers, your cancellation process, or your integration requirements, RAG retrieves relevant chunks from your knowledge base and the LLM synthesizes them into a coherent response.
The limitations surface the moment a customer needs something done rather than something explained. RAG cannot take actions. It cannot reason across multiple data sources in real time. It cannot maintain state across a multi-step workflow where each step depends on the outcome of the previous one.
Consider a straightforward request: a customer wants to change their subscription plan. Resolving this requires checking their current plan, calculating proration, verifying payment method validity, applying discounts based on account history, processing the change through billing, triggering webhooks, and sending confirmation. Each step depends on the previous one. If the payment method is expired, the workflow branches. If there is a pending invoice, it branches differently. RAG retrieves documents. It does not navigate decision trees.
Research consistently shows that naive RAG implementations suffer from low precision with misaligned retrieved chunks, low recall with failure to retrieve all relevant information, and hallucination when the model fills gaps with fabricated details. These are tolerable in an internal research tool. They are not tolerable when the system is processing a customer's money.
The 80% nobody budgets for.
The remaining 80% of building AI customer service is infrastructure that has nothing to do with language models. It is the operational machinery that turns a chatbot into a system that actually resolves problems. Enterprise implementations typically cost 3 to 5 times the initial development estimate when accounting for integration, customization, infrastructure scaling, and operational overhead.
Here is what that 80% includes.
Action execution. Connecting to payment processors, order management systems, CRMs, internal APIs, and third-party services. Each integration requires authentication, error handling, retry logic, rate limiting, and rollback capabilities. A single "process refund" action might touch four systems.
Workflow orchestration. Multi-step processes with conditional branching, state management, and failure recovery. When step three of a five-step workflow fails, the system needs to know whether to retry, roll back, escalate, or hold and notify. Most in-house builds handle the happy path. Production handles every path.
Guardrails and safety. Preventing the AI from taking actions it should not take. Capping refund amounts. Enforcing approval workflows for high-value transactions. Blocking responses that contradict compliance requirements. Every guardrail is a rule that must be authored, tested, and maintained.
Quality assurance at scale. Monitoring every conversation for accuracy, tone, compliance, and resolution quality. Not 5% of conversations. All of them. Building a QA system that evaluates AI performance continuously is itself a significant engineering project.
Escalation intelligence. Knowing when to hand off to a human, what context to transfer, and how to route to the right specialist. Poor escalation creates worse outcomes than no automation at all.
Channel management. Email, chat, voice, and SMS each have different formatting requirements, response time expectations, and interaction patterns. A system built for chat does not automatically work for email. Voice adds an entirely different layer of complexity.
Engineering time compounds.
The opportunity cost of building in-house is rarely calculated honestly. Data from enterprise software projects shows that 60% of AI development time is consumed by integration work rather than the AI itself. For a 15-person engineering team at a healthtech startup, dedicating even four engineers to AI support means losing over a quarter of the team's capacity on core product work.
Budget overruns of 60 to 150% are common on generative AI projects without hard scope gates. A project estimated at $120,000 becomes $300,000 over six months of incremental additions. And that is just the build. Annual maintenance for production AI systems runs 15 to 25% of the initial development cost, every year, indefinitely.
Google's landmark research on hidden technical debt in machine learning systems identified boundary erosion, entanglement, hidden feedback loops, and undeclared data dependencies as systemic risks that compound over time. These are not bugs to fix. They are structural properties of ML systems requiring continuous engineering investment. The paper's central argument: it is remarkably easy to incur massive ongoing maintenance costs when applying machine learning.
For Chen's healthtech company, the math looks like this: four engineers at fully loaded costs of $200,000 each means $800,000 in year one for a system that might reach feature parity with a purpose-built platform. Then $120,000 to $200,000 annually in maintenance. Meanwhile, those four engineers are not building the clinical features that justify the company's Series A valuation.
Vendor success rates differ.
The data on build versus buy outcomes is unambiguous. Research compiled from enterprise AI deployments shows that purchasing AI tools from specialized vendors and building partnerships succeeds roughly 67% of the time, while internal builds succeed only about one-third as often. The gap is structural, not incidental.
Specialized vendors amortize development costs across hundreds of customers. A feature that costs $500,000 to build gets funded across the entire customer base rather than charged to one team's roadmap. Vendors also accumulate operational data across deployments, improving from patterns your team will never see.
The build versus buy framing itself is misleading for AI systems. Traditional software follows a build-once-maintain pattern. AI systems follow a rebuild-continuously pattern. The underlying models change quarterly. Prompting best practices shift monthly. Internal teams face a treadmill: every quarter requires partial rebuilds just to maintain current performance, let alone improve it.
What production actually requires.
Production AI customer service requires infrastructure across five dimensions that most in-house teams never scope in their initial estimates.
Deterministic workflow execution. Processing a refund, updating a medical record, or changing a subscription needs to follow an auditable, reversible path. This requires an orchestration layer that composes LLM tasks with procedural logic rather than letting a single model drive the entire interaction.
Multi-system integration. The average enterprise support operation connects to 8 to 12 backend systems. Each integration needs authentication, rate limiting, error recovery, and schema change handling. Building and maintaining these integrations is a perpetual cost.
Compliance infrastructure. In regulated industries like healthtech, fintech, and insurance, every AI-generated response must stay within approved boundaries. This is a constraint built into system architecture, not a filter bolted on after.
Continuous quality monitoring. Evaluating 100% of AI conversations against quality, accuracy, and compliance standards. Manual QA samples 2 to 5%. AI-powered QA evaluates every single one, catching degradation before customers notice.
Channel-native execution. Email, chat, voice, and SMS each require purpose-built handling. A system built for chat does not automatically work for email. Voice adds real-time processing with interruption handling. Each channel demands its own execution logic.
The build trap in practice.
The pattern repeats across industries. An engineering team builds a prototype in weeks. Leadership sees the demo and approves a quarter of dedicated effort. The team spends the first month on the happy path: common questions, standard workflows, clean data. The second month surfaces edge cases. The third month reveals that edge cases are not edge cases at all but represent 30 to 40% of real support volume.
By month four, the team is maintaining a growing system while trying to expand its capabilities. New bug reports compete with new feature requests. The engineers who built the prototype spend half their time on operational tasks: monitoring, debugging failed conversations, rewriting prompts that broke after a model update, fixing integrations after a third-party API changed its schema.
This is precisely the pattern the build versus buy decision framework addresses. The question is not whether your team can build a chatbot. They can. The question is whether they can build and maintain and continuously improve a production system that resolves complex customer problems across channels while also doing the work that makes your company valuable.
Only 11% of enterprises build custom AI solutions. The other 89% have done the math and concluded that specialization wins.
Where Lorikeet differs.
Lorikeet is an AI customer support platform that resolves tickets end-to-end across chat, email, and voice, handling complex multi-step workflows including processing refunds, updating accounts, and managing intricate procedures. It is not a RAG wrapper around a language model. It is the operational infrastructure that makes AI customer service actually work.
Lorikeet uses an intelligent graph orchestration layer that inverts the typical AI system design. Rather than letting a single LLM drive decisions, Lorikeet orchestrates small, focused LLM tasks composed with procedural logic. Each task has limited context exposure, which reduces hallucination risk and constrains the blast radius of any unintended behavior. The system executes predefined business workflows with deterministic paths while using natural language understanding for the parts that genuinely require it.
This architecture means Lorikeet handles the full 100%, not just the 20% that RAG covers. It connects to your payment systems, CRMs, and internal APIs. It processes refunds, reschedules appointments, updates billing, and executes multi-step workflows with conditional branching. It monitors every conversation and escalates to humans with full context when warranted.
For a CTO like Chen, Lorikeet replaces 9 to 12 months of engineering effort and ongoing maintenance with a platform that deploys in weeks and improves continuously without consuming internal engineering bandwidth. The four engineers who would have spent a year building partial AI support can focus entirely on the clinical product that justifies the company's existence.
What is Lorikeet?
Lorikeet is an AI customer support platform that acts as a universal concierge across chat, email, voice, and SMS. Unlike RAG-based chatbots that only answer questions, Lorikeet takes action: processing refunds, rescheduling appointments, managing billing, and executing complex multi-step workflows by integrating with existing systems like Zendesk, Stripe, and internal APIs. Its intelligent graph architecture composes small LLM tasks with procedural logic for deterministic, auditable execution in regulated industries. See how Lorikeet handles the 80% that RAG cannot.
The honest calculation.
Building AI customer service in-house makes sense under a narrow set of conditions: your support workflows are genuinely unique, you have surplus engineering capacity with no competing priorities, you can commit 15 to 25% of the build cost annually in perpetuity for maintenance, and you are willing to accept 9 to 12 months before reaching capability parity with existing platforms.
For everyone else, the math is straightforward. The 80% of production AI support that lives beyond RAG and GPT is infrastructure that specialized platforms have spent years building. Rebuilding it internally does not create competitive advantage. It creates opportunity cost.
Chen's FAQ bot was a good proof of concept. It proved that AI could understand his customers' questions. The mistake was assuming that understanding questions and resolving problems are the same engineering challenge. They are not. One is a language task. The other is an operational system. And the distance between the two is where $800,000 and four engineers disappear.
Stop building the 80% from scratch. See how Lorikeet resolves support end-to-end.









