Last week OpenAI released Whisper their first open source model.

This model has two elements that are making it special:

  • First Whisper is the first multi-lingual open sourced model.
  • Secondly its the first open source model that is smartly combining multimodality : speech recognition and language understanding.

On the inference standpoint this is making a huge difference:


  1. Reduce Latency: on the long run being multi-language reduces latency of Automated Speech Recognition System) as it allows the system to skip the speech language classification step before passing the audio to the language specific speech recognition system.
  2. Less accent sensitivity: having only 1 model for multiple languages make it potentially less sensitive to accent (this seems to be confirmed by first qualitative analysis done by the community).
  3. Context awareness: this topic has been a chimera for years in the speech recognition space. Combining speech recognition with the transformers’ mask technic allow to increase dramatically the accuracy of the speech recognition system in the future.


We wish Whisper is creating traction toward multimodal model applied to speech recognition.


Test Whisper with gladia

Original Audio

Here is the original audio used


Listen to the example:

You can compare the result of multiple models by selecting to model


Capture d’écran 2022-10-07 à 09.00.51.png

Try it from your laptop with your own audio

replace audio.mp3 with your own file to transcribe



Here is a great video about whisper from Louis Bouchard




Prompt used for the illustration :


a photo of a robot writing with a microphone in a shape of a pen


Related Post


Subscribe for updates