Gladia x Rime I Building better CX agents with STT and TTS
Dec 23, 2025
Jean-Louis Quéguiner
Despite rapid advances in voice agent stacks, including speech-to-text (STT), text-to-speech (TTS), and large language models (LLMs), the real-world promise of fully autonomous voice assistants remains largely unmet.
In a recent webinar, we chatted with Lily Clifford, a speech technology researcher and founder of a TTS-specialized Rime, to unpack what’s working, what isn’t, and what truly matters when building voice agents that deliver an on-call experience that matches user expectations.
You can watch our conversation below or read on practical summary of key learnings from the field.
Why human-like TTS can hurt voice agent performance
Since the earliest days of TTS, the end goal for many teams has been the same: make the voice sound human, warm, and natural. But one of the most surprising insights from the discussion was that this goal can sometimes backfire.
As Lily put it during the webinar:
“Overwhelmingly, every point of data that we have shows that if you hear professional voice actor TTS, you’re more likely to hang up.”
In demos, highly expressive voices often sound impressive. But in real customer interactions, especially over the phone, they can feel unnatural or even suspicious.
Another important point Lily made was about first impressions:
“If you hang up after the first thing the agent says, you never get to experience how good the system actually is.”
In other words, the voice sets expectations instantly. If that expectation feels wrong, the conversation ends before the intelligence of the system ever has a chance to show.
STT and TTS accuracy vs quality: why precision matters more
Another recurring theme from the webinar was the difference between accuracy and quality — two concepts that are often conflated.
When discussing STT and TTS performance, teams usually focus on:
Quality, meaning how natural or fluent the output sounds
Accuracy, meaning whether the output is correct
But in production voice agents, there’s a more important distinction: precision.
As Jean-Louis explained during the webinar:
“You can be 99% accurate, but if the only thing you get wrong is the first name, that’s enough to kill the experience.”
Precision on critical entities matters far more than global averages — and this is something many evaluation methods fail to capture.
Public ASR benchmarks are useful for research and high-level comparison, but they often fall short when applied to real-world voice agents.
Enterprise voice systems don’t operate on clean, labeled datasets. They operate in environments with:
Emerging or proprietary brand names
Customer names and addresses
Email addresses and alphanumeric identifiers
Code-switching between languages
Noisy audio and telephone artifacts
One example discussed in the webinar highlighted how benchmarks can even penalize correct behavior. In bilingual calls, a model may accurately transcribe speech in multiple languages, while the benchmark’s ground truth labels everything as “foreign language.” In this case, the model is right — but the evaluation framework is wrong.
This leads to a dangerous outcome: models that are optimized for benchmarks rather than for customer impact. When benchmarks don’t measure what matters in production, they incentivize the wrong trade-offs.
Evaluation should reflect business outcomes, not just synthetic test scores.
Voice agent design differences between inbound and outbound calls
Not all voice interactions are the same, and one important distinction is whether a call is inbound or outbound.
In inbound calls, the system has little to no prior context. The caller could be reaching out for many different reasons, which means broader vocabularies, more ambiguity, and greater tolerance for exploratory interaction.
Outbound calls are very different. The system usually knows who it is calling and why. In these scenarios, users are far less tolerant of latency or misunderstanding. Precision and speed become critical, especially when confirming names, appointments, or transactions.
Designing effective voice agents requires acknowledging these differences and configuring STT, TTS, and orchestration logic accordingly.
Latency in voice agents is about perception, not just speed
Latency is often treated as a purely technical metric, measured in milliseconds. But in conversational systems, latency is as much about perception as it is about speed.
If a system responds too slowly, users feel like they weren’t heard. If it responds too quickly, it can feel unnatural or interruptive. What matters is conversational rhythm — the timing that aligns with human expectations of turn-taking.
Optimizing latency isn’t about minimizing it at all costs. It’s about making responses feel timely and intentional within the flow of conversation.
Why A/B testing is essential for STT and TTS systems
The most successful voice teams don’t rely on intuition or static configurations. They test.
As Lily shared:
“Some of our customers are running forty different voices in production and seeing which one reduces abandonment.”
Based on her experience, most voice teams today run experiments across:
Different voices and speaking styles
Turn-taking strategies
Latency thresholds
Precision rules for critical entities
Small changes in these parameters can have an outsized impact on conversion rates, call completion, or customer satisfaction. At scale, even a one percent improvement can translate into significant business value.
For voice agents, experimentation isn’t optional — it’s foundational.
Voice agents are user experiences, not just STT and TTS models
It’s tempting to think of voice agents as pipelines: STT feeds into an LLM, which feeds into TTS. But the reality is far more complex.
Voice agents are user experiences. Many of the hardest problems don’t live inside model weights — they live in orchestration, integration, evaluation, and design.
Building better voice agents requires systems thinking, continuous measurement, and a deep understanding of how humans perceive and interact with voice interfaces.
How successful teams build production-ready voice agents
From the patterns discussed in the webinar, high-performing teams tend to:
Prioritize precision over perceived realism
Design differently for inbound and outbound interactions
Test and iterate continuously in production
Measure success through business metrics like conversion and churn
Treat evaluation as a product problem, not just a modeling task
These teams focus less on theoretical perfection and more on real-world performance.
Conclusion: building better voice agents with STT and TTS
If there’s one takeaway from this discussion, it’s this: in voice interfaces, “good enough” is not good enough.
Accuracy, precision, latency, and conversational flow all have disproportionate impact at scale. Building better voice agents with STT and TTS means grounding technical decisions in real user behavior and real business outcomes.
The technology is advancing quickly — but the teams that win will be the ones that design for how voice is actually experienced.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
Gladia x Rime I Building better CX agents with STT and TTS
Despite rapid advances in voice agent stacks, including speech-to-text (STT), text-to-speech (TTS), and large language models (LLMs), the real-world promise of fully autonomous voice assistants remains largely unmet.
Gladia x Thoughtly I Lessons from building AI sales agents that close deals
Voice AI has taken a leap in 2025. From early IVR experiments that barely handled a “hello” to today’s real-time conversational agents running across industries, we've come a long way. But is it truly a solved issue?
Safety, hallucinations, and guardrails: How to build voice AI agents you can trust
As voice agents become a core part of customer and employee experience, users need to know these AI systems are accurate, safe, and acting within boundaries. That’s especially true for enterprise-grade tools, where a rogue voice agent can severely damage relationships and create major legal risks.