Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.
Mastering real-time transcription: speed, accuracy, and Gladia's AI advantage
TL;DR: Most use cases like meeting assistants, post-call analytics, and note-taking tools don't need real-time transcription. Async delivers higher accuracy and better speaker attribution because the model processes the complete recording. Sub-300ms latency is a functional requirement only for voice agents, live captions, and live agent assist tools where immediate output is non-negotiable. Gladia's Solaria-1 delivers around 270ms average latency with 100+ language support and native code-switching for the use cases that do require it.
Automated call scoring: Best practices for AI-powered QA and performance
TL;DR: Most contact centers manually review only a fraction of calls, leaving coaching decisions based on incomplete data. Automated call scoring closes that gap by combining async transcription with LLM-based evaluation, but every downstream score is bounded by the accuracy of your STT layer. When it fails on accented speakers or multilingual audio, compliance scores, sentiment flags, and coaching alerts all break, making STT engine selection the highest-leverage infrastructure decision in your QA stack.
Generate automated follow-up emails from meeting recordings with Gladia and Claude
TL;DR: The bottleneck in automated meeting follow-ups is not the LLM writing the email. It's the transcription layer feeding it: wrong speaker labels and missed entities produce emails that sound generic or silently corrupt your CRM. Building your own pipeline with Gladia and Claude gives you predictable per-hour billing and strict data controls on paid tiers, backed by Solaria-1's on average 29% lower WER than competing APIs on conversational speech.
AI Model Biases: What went wrong with Whisper by OpenAI?
Published on Sep 1, 2024
When you start working with an AI model, however powerful, you can never be 100% sure of what will happen with it in practice. We've worked with Whisper ASR by OpenAI since its release in 2022 – and what we discovered is nothing short of surprising.
As the title suggest, this post is about (hidden) biases in AI – some of which may surprise you, too. We've given the word to our Founder and CEO, Jean-Louis Quéguiner to break it down below.
Let's start from the beginning. In November 2023, OpenAI released Whisper Large-v3, an ambitious speech-to-text model to add to the Whisper family. Billed as a solution to the problem of “low-resource languages,” it promised unparalleled multilingual support. But did it deliver?
In speech recognition today, many languages suffer from a lack of annotated training data, especially online. Before Whisper Large v3, notable efforts like Mozilla's Common Voice and Meta's "No Language Left Behind" project, spearheaded by Yann LeCun, made strides in addressing this gap.
Large v2, considered the most accurate of the Whisper models before v3, already supported 99 languages with varying levels of word error rate (WER) – a metric used to assess how accurate the model is in capturing speech. The team at OpenAI aimed to push this further with v3 – especially with non-English data, which represented only ⅓ of the original training set.
Despite the original excitement around the release, the new version introduced and/or enhanced the widely reported issues like:
Broken punctuation
Increased hallucinations
(Still) unreliable accuracy in under-represented languages
Having optimized the model at Gladia, I can testify that these issues are very real and affect the model’s performance in real-life use cases.
All of them have to do with the way: a) the original model was trained; b) the latest model was fine-tuned.
Not many people know this, but the fine-tuning happened in a very particular way. It took me almost a year to figure out how. Let's break it down together.
Hallucinations = training bias
Whisper is loved for many things – but it has a bad reputation for hallucinations, which introduce random words and repetitions and can ruin the transcript. Just recently, it made headlines again for hallucinating in violent language.
The reason why it happens stems from its training data: Whisper was essentially trained on YouTube and movies available on the internet. Why? These are large volumes of manually annotated audio with timestamps — which is perfect material for training an ASR model.
As a result, in response to silence in an audio recording, Whisper is likely to hallucinate with endings like classic YouTube endings like "Thank you for watching [till the end]" or "Subscribe to my channel!”
YouTube-inspired hallucinations from GitHub
Degraded punctuation further exacerbates this, as Whisper processes audio in 30-second chunks – meaning it can easily ‘miss’ punctuation in between the chunks. These flaws have been with Whisper since the start.
The fine-tuning controversy of low-resource languages
Now, back to the fine-tuned Whisper V3. Let's say you scrapped all YouTube and potentially movies out there, but you don't have any more human-annotated data (with high-quality ground truth) to train on—especially for low-resource languages, as ⅔ of all the data you have is in English.
The cheapest way to improve despite this limitation is to use your current AI to automatically annotate unannotated data and feed it into the algorithm to increase the weights and representations in your model for these languages. This way, 5-6x times more data was added.
So, this is how Whisper v3 was fine-tuned: by adding this new training data to the original dataset of low-resource languages.
The only problem is that the biases introduced in your original models — with hallucinations and slightly degraded punctuation — will now be replicated in your new "AI auto-labeled” unsupervised dataset. So you end up multiplying the bias 5-6x times for non-English languages!
And this didn't go unnoticed by the users.
Reactions to v3 on GitHub
The misleading WER (and more hidden biases)
So, we end up with a model that performs exceptionally well on paper despite having several hidden biases. How does that happen?
Among the most widely used ways to assess WER today are benchmarks involving datasets like FLEURS. These benchmarks are mostly one-sentence-based, with text being read by the speaker into the microphone in noise-less environments. Performing well against these benchmarks is much simpler than dealing with messy real-life audio.
Having worked with many benchmarks myself, I can say for a fact that WER is misleading, and fails to capture the real-life limitations and reveal biases—not even the punctuation one—because WER is based on normalized ground truth, which doesn't take into account readability or punctuation/casing.
Perfect WER, dirty readability.
Official WER of v3 per language
Among the most widely used ways to assess WER today are benchmarks involving datasets like FLEURS. These benchmarks are mostly one-sentence-based, with text being read by the speaker into the microphone in noise-less environments.
Performing well against these benchmarks is much simpler than dealing with messy real-life audio.
And it gets worse. Based on my experience, many training datasets are gender-biased, age-biased, and lack prosodic diversity. Males typically achieve over 90% accuracy, females around 80%, and children as low as 40%.
If we go back to what I said about using internet data for training, it all starts to make sense: a typical profile of abundant internet audio sources is a male dev/content creator working in a quiet, well-insulated environment with silent air conditioning and using a high-quality $300 headset.
The point of this post is not to criticize Whisper—it's still the leading speech recognition model and a key component of our product at Gladia. I’m thrilled to have contributed to optimizing it for better performance in enterprise use cases, including to mitigate hallucinations and improve recognition of accents with Whisper-Zero.
The point is that there are some inherent limitations to the ways we can currently train models. Having these reflections and addressing these biases is crucial to building more inclusive AI systems, wherever we are in the value chain.
Learn more about Gladia
Want to learn more about Gladia’s API? You can try it for yourself for free, or book a demo to learn more. If you’re more interested in the latest news and trends, sign-up for our newsletter below.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
Mastering real-time transcription: speed, accuracy, and Gladia's AI advantage
Speech-To-Text
Automated call scoring: Best practices for AI-powered QA and performance
Speech-To-Text
Generate automated follow-up emails from meeting recordings with Gladia and Claude
From audio to knowledge
Subscribe to receive latest news, product updates and curated AI content.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.