It's a Tuesday afternoon. A customer calls from a landline in a noisy office to dispute an insurance claim they can barely remember the details of. The voice AI agent has to hear them correctly before it can do anything else.
Voice AI is built on a stack of three technologies:
Speech-to-text (STT), or transcription, models convert audio to text.
Large language models (LLMs) process the text, interpret the intent, and decide what to do.
Text-to-speech (TTS) models convert the response text back into audio.
Most of the industry's attention goes to the middle layer, the LLM. That's where reasoning happens, where tool calls are made, and where hallucinations get caught. But if the text that reaches the LLM is wrong, everything downstream gets muddled too. The model reasons perfectly from bad input.
Streaming is a different game
STT models offer two modes: batch and streaming. Batch transcription processes the full audio file at once and achieves the word error rates you see in vendor benchmarks. Deepgram Nova-3 reports 5.26% word error rate (WER) on batch transcription. AssemblyAI Universal-3 Pro reports 5.6%.
Voice AI doesn't use batch. It uses streaming, because the agent needs to process audio in real time. Streaming STT works on small chunks (sometimes as little as 100ms) and it can't look ahead. The model might transcribe "I would like to" before hearing that the full phrase is "I would like two items." The same models that hit 5% on batch benchmarks jump to 7–8% on streaming. Deepgram Nova-3 goes from 5.26% to 6.84%.
If you need high-accuracy transcripts for analytics or quality assurance, it is worth transcribing the recording again using a batch model. The real-time transcript is optimised for speed, not precision.
Turn off formatting
STT models can format output for human readability, rendering "one hundred and twenty three" as "123." This sounds helpful, but it introduces latency that compounds across a conversation.
Consider "forty two dollars." The formatted output is "$42," but the dollar sign appears before the number, and "dollars" is spoken last. The model has to wait for the full phrase before it can commit to any transcription at all. Multiply that across every number, date, and currency in a support call, and the conversation starts to feel stilted.
Consider disabling formatting and allowing LLMs to handle it downstream. They can understand "forty two dollars" as a dollar amount just fine.
Adjust the keywords dynamically
STT models are trained on general speech: podcasts, videos, conversational corpora built to generalise. The vocabulary distribution on an insurance call looks nothing like a podcast. "Deductible" comes up constantly. "Third party property damage" is a coverage type. A customer reads their policy number off a letter and it's a string of letters and digits the model has never seen before.
Keyterm prompting adjusts the model's probability weighting for specific words. Tell it to expect "excess," "windscreen," and "third party property damage," and its accuracy on those terms measurably improves. Deepgram reports 5–15 percentage point WER gains when adding domain vocabulary. This isn't a workaround, it's the intended mechanism. The gap is that most teams configure a static list once and stop there.
We configure keywords dynamically. Before a phone call connects, we pull the customer's profile, extract the terms likely to appear (their name, their policy type, their product identifiers) and inject them as keyterms before the customer says a word. We use an LLM to identify which terms are likely to be spoken and boost their transcription.
This matters most for names. The worst experience in voice AI is saying your own name and having the agent fail to transcribe it, or butcher the pronunciation when repeating it back. Dynamic keyterm extraction solves this.
Don't maximise noise cancellation
The assumption is intuitive: cleaner audio means better transcription. Remove background noise, give the model the purest signal possible.
It's wrong. STT models are trained on noisy audio, and they expect ambient cues: the acoustic texture of a room, a background hum. Aggressive noise cancellation strips those cues and can actually degrade transcription accuracy. In our testing, dialing noise cancellation back improved results.
We use noise cancellation for turn detection (knowing when someone has stopped speaking), but we calibrate the level carefully. The setting that's best for detecting pauses isn't the same as what's best for transcription.
Accept that telephony is messy
Not all audio is created equal. Old landline phones use 8 kHz codecs. The audio can be hard for humans to parse, let alone a model. Mobile phones generally use 16 kHz. Web calling uses 48 kHz.
This means the same STT model will perform differently depending on how the customer is calling. A voice AI agent needs to handle all three gracefully, and "gracefully" mostly means not pretending the audio is better than it is. Transcription errors on 8 kHz landline calls are inevitable. The system should be designed to recover from them, not to assume they won't happen.
Teach the LLM it's listening
An LLM that doesn't know its input came from speech will treat transcription errors like typos. If the STT transcribes "I need to make a clam" instead of "claim," the LLM might take "clam" at face value and respond about seafood instead of insurance. It doesn't know this was a phonetic error, so it reasons about the wrong word.
A simple prompt fixes this: "The user's message is from a speech-to-text model. If a transcribed word sounds phonetically similar to an expected word, assume the user spoke the expected word."
This tells the LLM to interpret its input as speech, not text. "Clam" becomes "claim." The model stops correcting and starts inferring.
Know when to ask
Every transcript comes with a confidence score, a per-word signal indicating how certain the model is about what it heard. Most implementations ignore it.
We don't. When confidence drops below a threshold, we prompt the LLM to ask a clarifying question rather than proceeding from a guess. "Just to confirm, did you say your last name is Bradley?" recovers the information and sounds competent. The alternative (proceeding confidently from a wrong premise) leads to the kind of conversation where the customer keeps saying "no, I said X" and the agent keeps misunderstanding.
This works because LLMs are smart enough to handle the occasional bad transcription. A garbled word here and there usually doesn't matter. The model infers intent from context. But when confidence is consistently low across an utterance, the model's guesses compound and the human feels misunderstood.
Use the right model for each language
Almost every STT model handles English well. The quality gaps appear when you serve customers across multiple languages. The model with the best English accuracy may be mediocre at Vietnamese, unreliable at Hindi, or weak on regional accents within a single language.
"Which STT model is best?" is the wrong question. The right question is which model is best for each language your customers speak. The answer is usually different vendors for different languages, somewhere between art and science, requiring ongoing evaluation as models improve.
None of these are headline features. You won't find "we calibrated our noise cancellation" on a vendor's landing page. But they're the difference between a voice AI agent that works in a demo and one that works on that Tuesday afternoon call. And if the agent can't hear, it can't help.
Book a call
See what Lorikeet is capable of









