Latency in AI can make or break CX

Latency in AI can make or break CX

Minh Le

|

Sep 30, 2025

AI customer support was supposed to deliver instant responses, but many customers still find themselves waiting for answers. In high-stakes support situations–when someone's card is declined or their account is locked–those delays don't just inconvenience customers, they actively damage trust in your business.

Latency is the time between a user making a request and receiving a response. The stakes for support latency are different from casual chatbot interactions because people contact support when something is broken and needs fixing. As AI becomes mainstream, customers increasingly expect the fast, streaming responses they get from ChatGPT and other consumer AI products. This creates a new baseline expectation where being accurate isn't enough, the response also needs to be fast.

The expectation problem is particularly acute in voice support, where dead air and slow responses create anxiety that can escalate customer frustration before you even attempt to help them. Poor latency contributes to what we call "AI aversion". Even when the AI successfully solves the customer's problem, slow response times make customers more likely to demand a human agent because they assume something has gone wrong when they don't understand what's happening in the background.

Given how much it matters for the experience of end customers, at Lorikeet we have focused and continue to focus on latency. We’ve achieved some remarkable results: p50 chat latency <5s, and voice latency sub 1s despite multi-step LLM processing. The key thing we’ve learned is achieving low latency is an ongoing process rather than a one-off project. The rest of this post unpacks what we’ve learned that helps inform that process.

Real vs perceived latency

When we talk about latency in customer support, we're actually talking about two different things.

Real latency is the time from when a customer sends a message to when your system generates a response. In AI systems, LLM calls dominate everything else and have the biggest impact on response times. You can optimize your database queries and API calls all you want, but if your LLM calls are orders of magnitude slower than those millisecond improvements won't matter.

Perceived latency is what the customer actually experiences. It's not just about how long it takes to generate a response, it's about whether the customer thinks something is happening. Do they see signs of life? Are there loading indicators? How long before they get any feedback that their message was received?

Perceived latency gives you room to make quality improvements. As you spend more time crafting better responses and handling complex issues, you can use perceived latency techniques to maintain a responsive feel.

Latency trade-offs: pick your poison

Every latency optimization comes with trade-offs. You're constantly balancing four factors: latency, quality, cost, and reliability. Pull one corner out, and you have to sacrifice somewhere else.

Want it faster? You might need to use smaller, less capable models. Want higher quality responses? That probably means more processing time. Need rock-solid reliability? Fast providers often have less predictable availability.

This isn't a one-time architecture decision. It's an ongoing choice you make for every workflow, every model call, every feature. It’s important to ensure you’ve got the right tool for the job, that will support the use e.g. for complex, heavy jobs you may need to sacrifice real latency to ensure quality and rely on perceived latency strategies to maintain a good customer experience. For other use cases, it’s possible to simplify and optimize the jobs to reduce the latency.

Decisions we've made to reduce latency

Reducing latency in AI support requires both technical architecture choices and experience design decisions. Here are the key approaches we've taken at Lorikeet.

Break complex tasks into smaller, optimized components

Instead of sending every customer query to the largest, most capable model, we break workflows into discrete tasks that can run on smaller, faster models. For example, when a customer sends a vague message like "help me," we use a lightweight model to quickly classify whether they need disambiguation prompts before routing to our main conversation engine. Similarly, our message guardrails–subscriber-defined rules like "escalate if customer mentions topic X "–run on fast, smaller models because these binary classification tasks don't require the reasoning capabilities of frontier models.

This architectural approach lets us reserve the expensive, slower models for tasks that actually need their full capabilities while handling routine operations with models that respond 2x to 10x faster .

Prioritize speed over cost when it matters

We deliberately sacrifice cost efficiency to meet latency requirements by issuing multiple concurrent requests and sampling multiple responses for the same task. For latency-critical channels like voice and live chat, we race requests across providers and choose the fastest response. This approach costs more than sending a single request, but the customer experience improvement justifies the additional expense.

Optimize provider mix based on channel requirements

We maintain relationships with every major model provider specifically to avoid latency bottlenecks. Voice and chat conversations only use our fastest models on our most reliable providers, while email support gets routed to whatever capacity is available since response time expectations are different. Some competitors have exclusive partnerships with a single LLM model lab provider, which limits their ability to route traffic based on performance characteristics or leverage faster open-source alternatives when appropriate.

Design for perceived latency, not just actual latency

We recognize that latency is fundamentally an experience problem, not just a technical one. Our voice agents include subtle background office sounds during processing to avoid dead air that makes customers think something has broken. In chat, we show "thinking events" that indicate what the agent is currently doing—reading specific help articles, checking account details, or processing a request. These techniques help customers understand that work is happening even when they're waiting for a response.

The goal isn't to eliminate all wait times, which is impossible, but to manage customer expectations so that necessary delays feel purposeful rather than frustrating.

Design your way out of latency problems

Some of the most effective latency optimizations happen at the design level, not the technical level.

Simpler workflows are faster workflows. Tightly scoped processes with fewer edge cases require less reasoning time and fewer decision branches. While every special case you handle adds potential latency, but if you can solve the customer’s problem, and used perceived latency techniques to show that work is happening in the background, any delay is much more likely to be tolerated. 

Design semantic tools, not waterfall APIs. Instead of separate calls to get customer info, then profile, then linked accounts, then balance, create one semantic tool that takes a customer ID and returns everything you need. 

Preload everything you can. When a customer opens your chat widget or calls your support line, immediately pull their account information in the background. Don't make them wait while you "pull up their account", have it ready before they even ask their question.

What this means for your business

Latency optimization isn't a one-time engineering project you complete and move on from. It requires ongoing decisions about trade-offs between speed, accuracy, cost, and reliability based on what matters most to your customers and your business.

The fundamentals won't change even as models get faster and cheaper, because we're simultaneously asking AI to handle more complex reasoning tasks with larger context windows. The optimization challenge moves but never disappears.

Questions to ask your vendor:

  • Do you have multiple LLM providers, and how do you think about their reliability?

  • How do you handle spikes in my support volume?

  • Do you offer latency SLAs or just availability SLAs?

  • How do you handle capacity constraints and request queuing?

  • What does your latency breakdown look like—LLM calls versus other processing?

  • What are your strategies for managing perceived latency?

  • What latency metrics do you actually measure and report on?

    • p95 latency (not just average response time)

    • Time to first token (TTFT)

    • Time to first response (TTFR)

    • Time to first audio byte (TTFAB) for voice

The vendors who give you the most honest answers about their latency trade-offs are typically the ones whose systems will perform best under the messy, unpredictable conditions of actual customer support operations. Most vendors won't discuss these complexities because it's easier to promise instant responses without acknowledging the engineering reality behind reliable performance.

Understanding these trade-offs upfront helps you set realistic expectations with your team and customers, and choose a solution that can actually deliver the experience you're promising rather than falling apart when you need it most.

Want to see how we handle latency optimization at Lorikeet? We're always happy to walk through our technical approach and discuss what trade-offs make sense for your specific use case.

Recent posts