What is PII redaction?

A customer calls your contact center and reads out their credit card number to an agent. A prospect joins a sales call and shares their name, work email, and company. Both conversations are recorded and transcribed, and both now contain sensitive personal data sitting in plain text in your database.

Speech-To-Text

Best TTS APIs for developers in 2026: Top 7 text-to-speech services

When choosing a text-to-speech API (TTS), developers face crucial practical questions: Which provider delivers the right balance of latency, voice quality, control, and scalability in real production systems?

Speech-To-Text

Automatic Speech Recognition (ASR): How speech-to-text models work—and which One to Use

Automatic speech recognition (ASR), aka speech-to-text (STT) technology, is a constantly evolving field. Knowing which ASR model is right for your product or service can be challenging. CTC, encoder-decoder, transducer, and speech LLMs—each with distinct tradeoffs. What does it all mean? And what do you choose?!

Gladia x Rime I Building better CX agents with STT and TTS

Dec 23, 2025

Jean-Louis Quéguiner

Gladia x Rime I Building better CX agents with STT and TTS

Despite rapid advances in voice agent stacks, including speech-to-text (STT), text-to-speech (TTS), and large language models (LLMs), the real-world promise of fully autonomous voice assistants remains largely unmet.

In a recent webinar, we chatted with Lily Clifford, a speech technology researcher and founder of a TTS-specialized Rime, to unpack what’s working, what isn’t, and what truly matters when building voice agents that deliver an on-call experience that matches user expectations.

You can watch our conversation below or read on practical summary of key learnings from the field.

Why human-like TTS can hurt voice agent performance

Since the earliest days of TTS, the end goal for many teams has been the same: make the voice sound human, warm, and natural. But one of the most surprising insights from the discussion was that this goal can sometimes backfire.

As Lily put it during the webinar:

“Overwhelmingly, every point of data that we have shows that if you hear professional voice actor TTS, you’re more likely to hang up.”

In demos, highly expressive voices often sound impressive. But in real customer interactions, especially over the phone, they can feel unnatural or even suspicious.

Another important point Lily made was about first impressions:

“If you hang up after the first thing the agent says, you never get to experience how good the system actually is.”

In other words, the voice sets expectations instantly. If that expectation feels wrong, the conversation ends before the intelligence of the system ever has a chance to show.

STT and TTS accuracy vs quality: why precision matters more

Another recurring theme from the webinar was the difference between accuracy and quality — two concepts that are often conflated.

When discussing STT and TTS performance, teams usually focus on:

Quality, meaning how natural or fluent the output sounds
Accuracy, meaning whether the output is correct

But in production voice agents, there’s a more important distinction: precision.

As Jean-Louis explained during the webinar:

“You can be 99% accurate, but if the only thing you get wrong is the first name, that’s enough to kill the experience.”

Precision on critical entities matters far more than global averages — and this is something many evaluation methods fail to capture.

Why STT benchmarks don’t reflect real-world voice agents

Public ASR benchmarks are useful for research and high-level comparison, but they often fall short when applied to real-world voice agents.

Enterprise voice systems don’t operate on clean, labeled datasets. They operate in environments with:

Emerging or proprietary brand names
Customer names and addresses
Email addresses and alphanumeric identifiers
Code-switching between languages
Noisy audio and telephone artifacts

One example discussed in the webinar highlighted how benchmarks can even penalize correct behavior. In bilingual calls, a model may accurately transcribe speech in multiple languages, while the benchmark’s ground truth labels everything as “foreign language.” In this case, the model is right — but the evaluation framework is wrong.

This leads to a dangerous outcome: models that are optimized for benchmarks rather than for customer impact. When benchmarks don’t measure what matters in production, they incentivize the wrong trade-offs.

Evaluation should reflect business outcomes, not just synthetic test scores.

Voice agent design differences between inbound and outbound calls

Not all voice interactions are the same, and one important distinction is whether a call is inbound or outbound.

In inbound calls, the system has little to no prior context. The caller could be reaching out for many different reasons, which means broader vocabularies, more ambiguity, and greater tolerance for exploratory interaction.

Outbound calls are very different. The system usually knows who it is calling and why. In these scenarios, users are far less tolerant of latency or misunderstanding. Precision and speed become critical, especially when confirming names, appointments, or transactions.

Designing effective voice agents requires acknowledging these differences and configuring STT, TTS, and orchestration logic accordingly.

Latency in voice agents is about perception, not just speed

Latency is often treated as a purely technical metric, measured in milliseconds. But in conversational systems, latency is as much about perception as it is about speed.

If a system responds too slowly, users feel like they weren’t heard. If it responds too quickly, it can feel unnatural or interruptive. What matters is conversational rhythm — the timing that aligns with human expectations of turn-taking.

Optimizing latency isn’t about minimizing it at all costs. It’s about making responses feel timely and intentional within the flow of conversation.

Why A/B testing is essential for STT and TTS systems

The most successful voice teams don’t rely on intuition or static configurations. They test.

As Lily shared:

“Some of our customers are running forty different voices in production and seeing which one reduces abandonment.”

Based on her experience, most voice teams today run experiments across:

Different voices and speaking styles
Turn-taking strategies
Latency thresholds
Precision rules for critical entities

Small changes in these parameters can have an outsized impact on conversion rates, call completion, or customer satisfaction. At scale, even a one percent improvement can translate into significant business value.

For voice agents, experimentation isn’t optional — it’s foundational.

Voice agents are user experiences, not just STT and TTS models

It’s tempting to think of voice agents as pipelines: STT feeds into an LLM, which feeds into TTS. But the reality is far more complex.

Voice agents are user experiences. Many of the hardest problems don’t live inside model weights — they live in orchestration, integration, evaluation, and design.

Building better voice agents requires systems thinking, continuous measurement, and a deep understanding of how humans perceive and interact with voice interfaces.

How successful teams build production-ready voice agents

From the patterns discussed in the webinar, high-performing teams tend to:

Prioritize precision over perceived realism
Design differently for inbound and outbound interactions
Test and iterate continuously in production
Measure success through business metrics like conversion and churn
Treat evaluation as a product problem, not just a modeling task

These teams focus less on theoretical perfection and more on real-world performance.

Conclusion: building better voice agents with STT and TTS

If there’s one takeaway from this discussion, it’s this: in voice interfaces, “good enough” is not good enough.

Accuracy, precision, latency, and conversational flow all have disproportionate impact at scale. Building better voice agents with STT and TTS means grounding technical decisions in real user behavior and real business outcomes.

The technology is advancing quickly — but the teams that win will be the ones that design for how voice is actually experienced.