OpenAI Whisper is a cutting-edge Automatic Speech Recognition (ASR) system designed to transcribe spoken language into written text, leveraging deep learning techniques. Released in September 2022, this neural net has by now become a legendary tool in natural language processing, offering unparalleled accuracy and versatility and giving rise to numerous open-source and commercial applications.
As a speech-to-text provider that has specialized in Whisper optimizations since its very release, we’ve put together a comprehensive intro to address the most frequently asked questions about Whisper ASR like: how it works, what it can be used for, key alternatives and factors to consider when deploying the model for in-house projects.
FAQs about OpenAI Whisper
Is Whisper a model or a system?
OpenAI Whisper can be referred to as both a model and a system, depending on the context.
At its core, Whisper is an AI/ML model, specifically an ASR model. The model encompasses neural network architectures designed to process audio input and generate accurate transcriptions. More precisely, Whisper is an umbrella name for several models of different sizes – ranging from 39M to 1.55B parameters – with ‘larger’ models delivering better accuracy at the expense of longer processing times and higher computational costs.
In a broader sense, however, Whisper can also be considered a system because it involves not only the model architecture but the entire infrastructure and processes surrounding it.
What can Whisper do?
Whisper’s core purpose is to transcribe speech into text output. It can also translate speech from any of its supported languages to English text. Beyond these fundamental capabilities, Whisper can be optimized and fine-tuned for specific tasks and capabilities.
For instance, we at Gladia have optimized Whisper to perform additional functions like live-streaming transcription and speaker diarization. The model can also be fine-tuned to (better) recognize and transcribe new languages, dialects, and accents. It can also be made more sensitive to specific domains to recognize industry-specific jargon and terms. This flexibility enables developers to tailor Whisper to their specific use cases.
What has it been trained on?
OpenAI Whisper is trained on a vast dataset of 680,000 hours of supervised data, making it one of the most comprehensive ASR systems available. The dataset, sourced from across the internet and academic resources, encompasses a wide variety of domains and acoustic conditions, ensuring that Whisper can accurately transcribe speech in diverse real-world scenarios. What’s more, 117,000 hours – so roughly ⅓ – of this labeled pre-training data is multilingual, resulting in checkpoints that can be applied to 99 languages, many of which are considered low-resource.
The sheer scale of the training data contributes to Whisper's ability to generalize well and perform effectively across various applications. As a model pre-trained directly on the supervised task of speech recognition, its average level of accuracy is superior to most other open-source models.
That said, given the generalist nature of its initial training dataset, the model is mathematically more biased towards phrases that have nothing to do with professional audio data, which means it would normally require at least some fine-tuning to yield consistently accurate results in business environments.
What makes Whisper so good?
Whisper stands out as a best-in-class ASR system due to its exceptional base accuracy and performance in handling diverse languages. Its adaptability to challenging acoustic conditions, i.e. noisy and multilingual audio, sets it apart from other speech recognition systems. According to the Open ASR Leaderboard, its average word error rate is 8.06%, i.e. it is 92% accurate by default.
Available in several sizes, Whisper also allows developers to balance computational cost, speed, and accuracy as required by the intended use, making it highly versatile and useful across a range of applications.
How long does Whisper take to transcribe?
The average speed of Whisper AI transcription ranges from 8 to 30 minutes, depending on the type of audio, using a GPU. It takes x2 longer if the transcription is run on CPUs only.
How does it work exactly?
OpenAI Whisper is an end-to-end deep learning model based on an encoder-decoder Transformer architecture. Transformer models, first introduced in the seminal ‘Attention is All You Need’ paper in 2017, are distinguished by their ability to keep track of how multiple words and sentences relate to each other, making it possible to take long-range dependencies into account. In other words, Transformers can ‘remember’ what was said previously to contextualize words, which helps boost their transcription accuracy.
In the case of Whisper specifically, transcribes speech in a two-step encoder-decoder process. First, the input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and passed into an encoder to generate a mathematical representation of the audio. It is then decoded using a language model, i.e. the audio is processed through the model's layers to predict the most likely sequence of text tokens — basic units of text used for processing.
As explained by OpenAI, the text captions are then intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.
Is Whisper generative AI?
As stated before, Whisper’s pre-trained transformer architecture enables the model to infer the broader context of sentences transcribed and “fill in” the gaps in the transcript based on this understanding. In that sense, Whisper ASR can be said to leverage generative AI techniques to convert spoken language into written text.
Is there a Whisper API?
In March 2023, OpenAI made the large-v2 model available through our API, which gives faster performance than the open-source model and is priced at $0.006 / minute of transcription. Whisper API is available through transcriptions (transcribes in the source language) or translations (transcribes into English) endpoints and accepts standard audio formats like m4a, mp3, mp4 and wav.
There are also Whisper-based APIs like Gladia, which relies on a hybrid and enhanced Whisper architecture to deliver a more extended set of capabilities and features than the official OpenAI API.
What is Whisper used for exactly?
Whisper is a highly versatile model that can be used to build a variety of voice-enabled apps across industries and use cases, such as:
Building a call center assistant with Whisper ASR, capable of understanding speech and responding to customer inquiries through voice interactions.
Whisper's precise transcription abilities make it a great choice for automating transcription in virtual meetings and note-taking platforms catering to general audio and specific verticals like education, healthcare, journalism, legal, and more.
In media products, Whisper can be used to generate podcast transcripts and video captions, including in live streaming environments, to ensure a better watching experience and accessibility for users worldwide. Combined with text-to-speech,
In sales-optimized apps, Whisper is commonly used to power CRM enrichment tools with transcripts from client and prospect meetings.
If you want to dive deeper, here are some hands-on guides to build transcription-based apps with Whisper:
We have also put together a list of our favorite GitHub Whisper ASR projects for those who want to use open-source code alone.
What are the limitations of Whisper AI?
Vanilla Whisper comes with several limitations. First, its upload file size is limited to 25MB and 30 seconds in duration. The model cannot processURLs and callbacks. Powered by a predecessor of the iconic GPT-3 at the decoding phase, the model is also is infamously prone to hallucinations, resulting in errors in the transcript. In terms of features, it provides speech-to-text transcription and translation to English, without additional audio intelligence features like speaker diarization, summarization, or other. Real-time transcription is likewise not supported.
What are the key alternatives to Whisper ASR?
When it comes to alternatives, there are open-source and commercial alternatives available. Whichever route you choose depends on your use case, budget, and project requirements. You may wish to check out this dedicate article to learn more about the advantages and shortcomings of a Whisper-based API vs. OSS.
Some open-source Whisper alternatives
Mozilla DeepSpeech: An open-source ASR engine that allows developers to train custom models, providing flexibility for specific project requirements.
Kaldi: Widely used in research and industry, Kaldi is a powerful toolkit for speech recognition systems, offering extensive customization options.
Wav2vec: Meta AI’s speech recognition system for self-supervised, high-performing speech processing.
Specialized providers: Gladia, AssemblyAI, and Deepgram are among the leaders here, providing the same features as the big tech, better core performance in speed and accuracy at a more affordable price. To learn more, here is a full list of the best speech-to-text APIs today. Note that Gladia is the only API in there that is Whisper-based.
For a more comprehensive analysis of the leading commercial ASR models, you may want to have a look at our recent market overview of ASR engines.
Is Whisper ASR suitable for my project?
The answer depends on your specific needs. Whisper is most suited for product demos and academic and indie projects with relatively low volumes of audio. The model was not designed as a production-ready enterprise tool – OpenAI themselves had acknowledged that the open source model is ‘hard to run’ at scale.
Case in point, according to an internal survey we did on ASR usage, any professional project that needs over 100 hours of recurrent transcription per month risks running into practical issues with the model. Let’s break these down into key factors to consider about open-source Whisper:
1. Accuracy and speed
People using vanilla Whisper often report its speed and/or accuracy to be insufficient for their use case. With speech recognition models in general, getting the best of both worlds here is tricky due to an inherent technical trade-off in ASR, where an increase in the quality of transcription tends to come with a corresponding decrease in speed. Whisper, alas, is no exception to the rule.
Whisper is an accuracy-first model, though with the smaller Whisper models and open-source projects like Distil-Whisper, you can optimize it for the speed. When it comes to accuracy, while official benchmarks measuring word error rates (WER) can give you a good idea of what to expect, consider running tests or trials to assess its performance in real-world scenarios with the data that reflects your day-to-day transcription needs. This is especially true since some of the most popular benchmark in ASR like Librispeech do not give Whisper its due credit when it comes to accuracy due to the model’s generalist dataset. To give you the general idea of Whisper’s accuracy, we recommend checking this Hugging Face leaderboard featuring its WER across a wide range of benchmarks.
2. Volume of audio
There are several input, output, and processing requirements that make Whisper impractical to use at scale. On the input side, the upload file size is limited to 25MB and a maximum duration of 30 seconds, without URL support. These and other limitations of Whisper make the open-source model hard to scale without at least some AI expertise and substantial hardware expenses.
3. Advanced features
Because open-source Whisper is limited in its core features, any product that requires additional audio intelligence functionalities like live transcription, speaker diarization, and word-level timestamps (not to be confused with phrase-level ones) will not be well served with the original open-source version alone.
4. Language support
Whisper officially supports 99 languages. Bear in mind that 65% of data used to train Whisper was for English speech recognition, while only 17% was for multilingual speech recognition (and the remaining 18% for English translation). This means that you might need additional fine-tuning for non-English languages and accents, especially the less widely spoken ones.
5. Available in-house AI expertise and resources
Given all of the above, you should evaluate whether your engineering team possesses sufficient expertise to deploy Whisper internally, adapt, and enhance to serve your specific needs. There’s also the issue of hardware and hosting capabilities. Alternatively, you may choose to go with an API for an all-batteries-included experience, leveraging a hybrid model architecture, regular updates, and SLAs.
6. Cost considerations
As an open-source model available for free, Whisper may seem like the most budget-friendly option at first glance. As mentioned above, there’s the issue of scalability and maintenance costs that need to be factored in as your project grows. If you’re curious about the total cost of ownership of open-source Whisper at the enterprise level, have a look at this estimation and cost breakdown.
Final remarks
OpenAI Whisper not only stands as a legendary ASR system but also proves to be a versatile and adaptable solution for a wide range of applications. We hope that this article was useful in addressing the key question you may have had about. If you'd like to discuss Whisper or our API further, do not hesitate to reach out here.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
Keeping LLMs accurate: Your guide to reducing hallucinations
Over the last few years, Large Language Models (LLMs) have become accessible and transformative tools, powering everything from customer support and content generation to complex, industry-specific applications in healthcare, education, and finance.
Transforming note-taking for students with AI transcription
In recent years, fuelled by advancements in LLMs, the numbers of AI note-takers has skyrocketed. These apps are increasingly tailored to meet the unique needs of specific user groups, such as doctors, sales teams and project managers.
RAG for voice platforms: combining the power of LLMs with real-time knowledge
It happens all the time. A user submits a query to a large language model (LLM) and swiftly gets a response that is clear, comprehensive, and obviously incorrect.