The evolution and impact of Speech AI: An in-depth conversation with Gladia's CEO Jean-Louis

Published on Sep 3, 2024

Once in a while, we like to zoom out of our day-to-day to reflect on the bigger trends affecting our customers to, ultimately, adapt our product accordingly. Today, what are the key shifts happening in voice-first platforms, and how can speech recognition help them to navigate these?

In a recent episode of the Caveminds podcast, Jean-Louis Quéguiner, Co-founder and CEO of Gladia, joined host Benson Sung to discuss the revolutionary world of Speech AI. Throughout the hour-long conversation, they delve into the applications of audio intelligence AI, the intricacies and challenges at the infrastructure-level, and the broader implications of conversational AI for businesses.

Check out the TLDR version below.

The Business Case for Speech AI

One of the driving factors behind Gladia's focus on speech-to-text (STT) and audio intelligence was the sheer volume of valuable data embedded in audio communications like calls and meetings. Jean-Louis points out that a single hour of conversation can generate around 25,000 tokens – a substantial amount of textual information. This data is often underutilized in businesses, despite its potential to provide deep insights into customer interactions and operational efficiencies.

So, what are the specific use cases of this new technology?

Call Centers

Jean-Louis explained that one of the primary applications of speech AI is in call centers which – today – spend over $300 billion annually on operations and have notoriously high turnover rates.

Traditional transcription services are often slow, costly, and inaccurate, hindering the ability of companies to analyze customer interactions effectively. Speech AI addresses these issues by providing rapid and precise transcription, enabling businesses to identify customer issues in real-time and implement swift corrective actions. This can significantly improve customer satisfaction and operational efficiency.

Beyond standard transcription, Jean-Louis envisions a future where conversational AI can handle initial customer interactions and even automate them end-to-end to reduce costs and improve service quality.

Meeting Recorders

Meeting recorders represent another significant use case for audio intelligence AI. By integrating AI into meeting recording tools, companies can automate the transcription and summarization of meetings. The result? Streamlined follow-ups and action items which allows employees to focus on creative problem-solving rather than mundane documentation tasks.

The efficiency gained through accurate AI-driven transcription can lead to higher productivity and better decision-making.

Healthcare Applications

As Jean-Louis and Benson discussed, the potential for speech AI in the healthcare sector is also immense.

Doctors often spend considerable time on documentation, which can be streamlined using AI. Accurate transcription of patient interactions helps maintain comprehensive medical records, improve patient care, and ensure crucial information is not lost during referrals to specialists. This efficiency can also alleviate the administrative burden on healthcare professionals, allowing them to dedicate more time to patient care.

Challenges with Speech AI

While the benefits are clear, several challenges accompany the implementation of STT and audio intelligence AI. Jean-Louis highlighted three key issues: quality and multimodality, catastrophic forgetting, and diarization & speaker operation.

Quality and Multimodality

Jean-Louis emphasizes that for AI to be effective in real-world applications, high-quality transcription is crucial especially when the data is used for further processing by AI systems.

Multimodality – the ability to integrate audio, text, and visual data – is essential for creating a comprehensive understanding of the information. Achieving this level of quality and integration remains a significant challenge. That’s why Gladia focuses on providing exceptional accuracy across 99+ languages, setting a high standard in the industry.

Catastrophic Forgetting

AI systems can suffer from what’s called “catastrophic forgetting”, where models (particularly LLMs) lose previously learned information when acquiring new knowledge. This issue is particularly problematic in long conversations or documents, where the AI might forget crucial details from the middle of the content. Addressing this problem requires continuous advancements in AI model training and memory retention techniques.

Gladia tackles this challenge by refining models to ensure consistency and reliability over extended periods. More on this in our dedicated article on LLM-powered summarization available with our API.

Diarization & Speaker Operation

Diarization, the process of distinguishing between different speakers in an audio recording, is another complex challenge. Accurate diarization is essential for ensuring the right information is attributed to the correct speaker, which is particularly important in multi-participant meetings or call center recordings. Overlapping speech, accents, and code-switching make this especially challenging.

Gladia sets itself apart by focusing on advanced speaker diarization techniques and speaker operation functionalities, providing more accurate and reliable results.

The Future of Speech AI

Speech AI is poised to revolutionize the way we interact with and analyze spoken information, paving the way for a more efficient and productive future. Jean-Louis' vision for the future includes seamless integration of AI into everyday business operations, where AI can provide real-time insights and support, ultimately enhancing human creativity and decision-making…not replacing them.

As Speech AI technology evolves, its applications will continue to expand across various sectors, driving innovation and helping companies transform their operations.

Want to learn more about Gladia’s API? You can try it for yourself for free, or book a demo to learn more. If you’re more interested in the latest news and trends, sign-up for our newsletter.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Real-time agent assist: Unlocking better call center services with speech-to-text

Customer service is evolving fast to meet new challenges. Today's clients expect immediate, accurate answers to increasingly specific queries and complaints. Meanwhile, contact centers need to reduce costs, improve efficiency, and maintain compliance…all while delivering exceptional experiences.

Product News

How custom vocabulary improves STT accuracy

Even the most advanced speech-to-text (STT) systems can make mistakes, especially when they encounter unfamiliar words like brand names, technical acronyms, or non-standard pronunciations. For call centers and customer service platforms, these missteps aren’t just minor glitches. They can lead to broken workflows, misinterpreted customer needs, and frustrating experiences on both ends of the call.

Speech-To-Text

Call center quality assurance: How AI is transforming quality at scale

CCaaS and BPO providers live and die by the quality of the customer experience they deliver. Clients rely on them not just to answer calls, but to do so with consistency, professionalism, empathy, and accuracy every time.