AI Model Biases: What went wrong with Whisper by OpenAI?

Published on Sep 1, 2024

When you start working with an AI model, however powerful, you can never be 100% sure of what will happen with it in practice. We've worked with Whisper ASR by OpenAI since its release in 2022 – and what we discovered is nothing short of surprising.

As the title suggest, this post is about (hidden) biases in AI – some of which may surprise you, too. We've given the word to our Founder and CEO, Jean-Louis Quéguiner to break it down below.

Evaluating Whisper Large-v3’s speech recognition performance

Let's start from the beginning. In November 2023, OpenAI released Whisper Large-v3, an ambitious speech-to-text model to add to the Whisper family. Billed as a solution to the problem of “low-resource languages,” it promised unparalleled multilingual support. But did it deliver?

In speech recognition today, many languages suffer from a lack of annotated training data, especially online. Before Whisper Large v3, notable efforts like Mozilla's Common Voice and Meta's "No Language Left Behind" project, spearheaded by Yann LeCun, made strides in addressing this gap.

Large v2, considered the most accurate of the Whisper models before v3, already supported 99 languages with varying levels of word error rate (WER) – a metric used to assess how accurate the model is in capturing speech. The team at OpenAI aimed to push this further with v3 – especially with non-English data, which represented only ⅓ of the original training set.

Despite the original excitement around the release, the new version introduced and/or enhanced the widely reported issues like:

Broken punctuation
Increased hallucinations
(Still) unreliable accuracy in under-represented languages

Having optimized the model at Gladia, I can testify that these issues are very real and affect the model’s performance in real-life use cases.

All of them have to do with the way: a) the original model was trained; b) the latest model was fine-tuned.

Not many people know this, but the fine-tuning happened in a very particular way. It took me almost a year to figure out how. Let's break it down together.

Hallucinations = training bias

Whisper is loved for many things – but it has a bad reputation for hallucinations, which introduce random words and repetitions and can ruin the transcript. Just recently, it made headlines again for hallucinating in violent language.

The reason why it happens stems from its training data: Whisper was essentially trained on YouTube and movies available on the internet. Why? These are large volumes of manually annotated audio with timestamps — which is perfect material for training an ASR model.

As a result, in response to silence in an audio recording, Whisper is likely to hallucinate with endings like classic YouTube endings like "Thank you for watching [till the end]" or "Subscribe to my channel!”

*YouTube-inspired hallucinations from GitHub*

Degraded punctuation further exacerbates this, as Whisper processes audio in 30-second chunks – meaning it can easily ‘miss’ punctuation in between the chunks. These flaws have been with Whisper since the start.

The fine-tuning controversy of low-resource languages

Now, back to the fine-tuned Whisper V3. Let's say you scrapped all YouTube and potentially movies out there, but you don't have any more human-annotated data (with high-quality ground truth) to train on—especially for low-resource languages, as ⅔ of all the data you have is in English.

The cheapest way to improve despite this limitation is to use your current AI to automatically annotate unannotated data and feed it into the algorithm to increase the weights and representations in your model for these languages. This way, 5-6x times more data was added.

So, this is how Whisper v3 was fine-tuned: by adding this new training data to the original dataset of low-resource languages.

The only problem is that the biases introduced in your original models — with hallucinations and slightly degraded punctuation — will now be replicated in your new "AI auto-labeled” unsupervised dataset. So you end up multiplying the bias 5-6x times for non-English languages!

And this didn't go unnoticed by the users.

The misleading WER (and more hidden biases)

So, we end up with a model that performs exceptionally well on paper despite having several hidden biases. How does that happen?

Among the most widely used ways to assess WER today are benchmarks involving datasets like FLEURS. These benchmarks are mostly one-sentence-based, with text being read by the speaker into the microphone in noise-less environments. Performing well against these benchmarks is much simpler than dealing with messy real-life audio.

Having worked with many benchmarks myself, I can say for a fact that WER is misleading, and fails to capture the real-life limitations and reveal biases—not even the punctuation one—because WER is based on normalized ground truth, which doesn't take into account readability or punctuation/casing.

Perfect WER, dirty readability.

Official WER of Whisper v3 per language — *Official WER of v3 per language*

Among the most widely used ways to assess WER today are benchmarks involving datasets like FLEURS. These benchmarks are mostly one-sentence-based, with text being read by the speaker into the microphone in noise-less environments.

Performing well against these benchmarks is much simpler than dealing with messy real-life audio.

And it gets worse. Based on my experience, many training datasets are gender-biased, age-biased, and lack prosodic diversity. Males typically achieve over 90% accuracy, females around 80%, and children as low as 40%.

If we go back to what I said about using internet data for training, it all starts to make sense: a typical profile of abundant internet audio sources is a male dev/content creator working in a quiet, well-insulated environment with silent air conditioning and using a high-quality $300 headset.

The point of this post is not to criticize Whisper—it's still the leading speech recognition model and a key component of our product at Gladia. I’m thrilled to have contributed to optimizing it for better performance in enterprise use cases, including to mitigate hallucinations and improve recognition of accents with Whisper-Zero.

The point is that there are some inherent limitations to the ways we can currently train models. Having these reflections and addressing these biases is crucial to building more inclusive AI systems, wherever we are in the value chain.

Learn more about Gladia

Want to learn more about Gladia’s API? You can try it for yourself for free, or book a demo to learn more. If you’re more interested in the latest news and trends, sign-up for our newsletter below.

Contact us

Your request has been registered

A problem occurred while submitting the form.

Real-time agent assist: Unlocking better call center services with speech-to-text

Customer service is evolving fast to meet new challenges. Today's clients expect immediate, accurate answers to increasingly specific queries and complaints. Meanwhile, contact centers need to reduce costs, improve efficiency, and maintain compliance…all while delivering exceptional experiences.

Product News

How custom vocabulary improves STT accuracy

Even the most advanced speech-to-text (STT) systems can make mistakes, especially when they encounter unfamiliar words like brand names, technical acronyms, or non-standard pronunciations. For call centers and customer service platforms, these missteps aren’t just minor glitches. They can lead to broken workflows, misinterpreted customer needs, and frustrating experiences on both ends of the call.

Speech-To-Text

Call center quality assurance: How AI is transforming quality at scale

CCaaS and BPO providers live and die by the quality of the customer experience they deliver. Clients rely on them not just to answer calls, but to do so with consistency, professionalism, empathy, and accuracy every time.