Top 5 Whisper GitHub projects: A practical guide for programmers

Published on Jan 30, 2024
Top 5 Whisper GitHub projects: A practical guide for programmers

In September 2022, OpenAI unveiled Whisper, an innovative open-source automatic speech recognition (ASR) model trained on an impressive dataset of 680,000 hours of diverse speech. Since its release, the model has received widespread recognition for its remarkable robustness and accuracy. It rivaled human capabilities in English speech recognition and set a new standard for multilingual transcription and translation.

This groundbreaking model has not only captured the attention of the academic community but has also led to a proliferation of high-quality open-source projects. 

In this guide, we delve into five such projects – whisper.cpp, use-whisper, buzz, whisperX and distil-whisper – for their innovative applications, practical utility, and unique approaches to leveraging Whisper's capabilities. These projects exemplify the versatility of Whisper in various programming environments, from embedded systems to web applications.

1. whisper.cpp by ggerganov

What it does

The project whisper.cpp, developed by ggerganov, plays a pivotal role in integrating OpenAI's Whisper model with the C/C++ programming ecosystem. By adapting the model to a C/C++ compatible format, whisper.cpp significantly speeds up the processing time for speech-to-text conversion. This porting effort significantly enhances the utility of Whisper's advanced speech-to-text capabilities in environments where C/C++ is the language of choice. A key aspect of this initiative is to adapt Whisper's functionalities, recognized for their powerful speech-to-text conversion, into a format compatible with C/C++ projects, while also catering to command line applications.

Project activity and maintenance

Demonstrates high activity with a total of 809 commits.

Features and uses

  • Use cases: Ideal for embedded systems, desktop applications, or integration with existing C/C++ codebases.
  • Platform support: Supports various platforms, including Apple Silicon, Android, and Windows, making it suitable for cross-platform applications.
  • Application areas: Useful in real-time audio processing and systems with limited resources due to its focus on performance and efficiency.          

Why we like it

Whisper.cpp is a testament to the adaptability of AI models in varied programming landscapes. Its integration with Python bindings makes it approachable for a wide range of developers, bringing the power of Whisper to those who prefer working in a C/C++ environment. Its use in real-time audio processing and systems with limited resources showcases its performance and efficiency.

Example use

Here's a quick start guide for whisper.cpp. To begin using whisper.cpp, follow these steps:

Step 1: Clone the repository

Start by cloning the whisper.cpp repository to your local machine. This can be done using the following Git command:


   bash
   git clone https://github.com/ggerganov/whisper.cpp.git

Step 2: Download a Whisper Model in ggml format

Next, download a Whisper model that has been converted to the ggml format. For instance, to download the base.en model, use the following bash script included in the repository:


   bash
   bash ./models/download-ggml-model.sh base.en

  This command will download the `base` English model, which balances performance and accuracy.

Step 3: Optional - convert models yourself

If you prefer to convert Whisper models to ggml format yourself, you can find instructions in the `models/README.md` file within the repository. This step is optional and typically not necessary unless you have specific requirements.

Step 4: Build the main example

Compile the main example application provided in the repository. This is done using the make command:


   bash
   make

Step 5: Transcribe an audio file

Finally, use the compiled application to transcribe an audio file. For example, to transcribe the sample file `jfk.wav`, execute the following command:


    bash
   ./main -f samples/jfk.wav

Following these steps will get you started with whisper.cpp, allowing you to experiment with transcribing audio files using the power of OpenAI's Whisper model in a C++ environment. Alternatively, you can use a wrapper like in Python:


from whispercpp import Whisper

 w = Whisper.from_pretrained("tiny.en")

 import ffmpeg
 import numpy as np
 try:
 	y, _ = (
     	ffmpeg.input("sample.wav", threads=0)
     	.output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sample_rate)
     	.run(
         	cmd=["ffmpeg", "-nostdin"], capture_stdout=True, capture_stderr=True
     	)
 	)
 except ffmpeg.Error as e:
 	raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

 arr = np.frombuffer(y, np.int16).flatten().astype(np.float32) / 32768.0

 w.transcribe(arr)

2. use-whisper by chengsokdara

What it does

Use-whisper, created by chengsokdara, is a React hook designed to seamlessly integrate OpenAI's Whisper model with web applications. It offers features like speech recording and real-time transcription, making it a powerful tool for developers working with React. The hook simplifies the process of adding sophisticated speech-to-text functionality to web interfaces.

Use cases

Use-whisper, created by chengsokdara, is a React hook designed to seamlessly integrate OpenAI's Whisper model with web applications. It offers features like speech recording and real-time transcription, making it a powerful tool for developers working with React. The hook simplifies the process of adding sophisticated speech-to-text functionality to web interfaces.

Features and uses

  • Educational tools: Can be employed in language learning platforms, providing immediate transcription for language practice, pronunciation correction, and other interactive educational activities.
  • Accessibility features: Enhances the accessibility of web applications for users with disabilities. Speech-to-text capabilities can aid users who have difficulties with traditional input methods, such as typing.
  • Real-time communication platforms: Integrating use-whisper in platforms like chat applications, web conferencing tools, or customer service interfaces allows for real-time captioning and transcription, benefiting both users with hearing impairments and those in noisy environments.
  • Content creation tools: Useful for journalists, content creators, and podcasters for real-time transcription of interviews, creating subtitles, or generating written content from spoken words.

Why we like it

The simplicity and effectiveness of use-whisper in providing real-time transcription capabilities to web applications are commendable. Its silence detection feature is a notable enhancement, improving user experience in applications like virtual meetings or language learning tools.

Example use


  // React Example
import { useWhisper } from '@chengsokdara/use-whisper';
function App() {
    const { transcript, startRecording, stopRecording } = useWhisper({
 	apiKey: "YOUR_OPENAI_API_TOKEN",
 	removeSilence: true,
  });
   return (
 	

Transcript: {transcript.text}

); }

3. buzz by chidiwilliams

What It does

Buzz, offers a variety of functionalities that enhance its versatility in speech-to-text conversion. It supports multiple models including Whisper, Whisper.cpp, Hugging Face, Faster Whisper, and OpenAI API, allowing users to choose the most suitable one for their specific needs. Additionally, Buzz has an application available on the App Store, catering to a broader user base. A notable feature of Buzz is its ability to operate entirely offline, ensuring privacy protection by keeping audio and transcriptions on the user's device. This offline functionality is particularly valuable for users concerned about data security and privacy

Use cases

Valuable for individuals with hearing impairments and those concerned with privacy in online tools. There is also an app to download.

Features and uses

  • Nature: An application for transcribing audio offline, using OpenAI’s Whisper.
  • Accessibility: A valuable tool for individuals with hearing impairments, offering more control and independence.
  • Privacy concerns: Addresses privacy issues by functioning offline and not storing conversation contents.
  • Enhanced app experience: The Mac-native Buzz app available on the App Store features a more user-friendly interface, audio playback, drag-and-drop import, transcript editing, and search functionality.
  • Versatility: Tested on various systems, including Ubuntu, indicating wide operating system compatibility.

Generally speaking, users appreciate its offline functionality and independence from third-party cloud solutions. Performance without a CUDA capable GPU is a point of consideration, especially for users with less powerful hardware.

Why we like it

Buzz addresses key concerns like privacy and accessibility. Its offline functionality and independence from cloud solutions are particularly appealing. The Mac-native Buzz app enhances user experience with features like audio playback and transcript editing.

Example use


# installation

pip install buzz-captions
# Command Line Example
 python -m buzz transcribe audio-file.mp3

Here the cli can be used to transcribe a file completely offline and it's easy to install.

4. whisperX by m-bain

What it does

WhisperX, developed by m-bain, is a cutting-edge extension of OpenAI's Whisper model, enhancing it with advanced features like word-level timestamps and speaker diarization. This project stands out for its ability to provide fast and accurate automatic speech recognition, which is crucial for applications requiring detailed and precise transcriptions.

Project activity and maintenance

Active with 336 commits, with the most recent update in November 2023.

Why we like it

WhisperX stands out for its detailed audio transcription capabilities. The addition of speaker diarization and word-level timestamps make it invaluable for tasks requiring high transcription precision.

Example use

To use whisperX from its GitHub repository, follow these steps:

Step 1: Setup environment

Ensure you have Python 3.10 and PyTorch 2.0 installed. You'll also need NVIDIA libraries like cuBLAS 11.x and cuDNN 8.x if you plan to run on a GPU. For example, you can create a Python environment using Conda, see whisper-x on Github for more details


bash
conda create --name whisperx python=3.10
conda activate whisperx
conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Step 2: Install whisperX using pip.

If you want the latest version, you can directly install from the GitHub repository:


 pip install git+https://github.com/m-bain/whisperx.git

Step 3: Once installed, you can use whisperX for transcribing audio files via the command line

Here's an example command:


whisperx sample01.wav --model base --diarize --highlight_words True

Alternatively, you can import whisperx in Python.

Use cases

Ideal for detailed audio transcription tasks where distinguishing between speakers or precise timing is crucial.

5. distil-whisper by huggingface

What it does

Distil-Whisper is a distilled version of the OpenAI Whisper model, developed by Hugging Face. It is designed to provide fast, efficient speech recognition while maintaining high accuracy. This distilled model is notably faster and smaller than the original Whisper model, making it highly suitable for low-latency or resource-constrained environments. Using Distil-Whisper as an assistant to the main Whisper model in speculative decoding accelerates the inference process while aligning the distributions of the assistant and main models.

Project activity and maintenance

Active, started late 2023. 

Example use


# Python Example using Hugging Face Transformers
 from transformers import pipeline
 model_id = "distil-whisper/distil-large-v2"

 asr_pipeline = pipeline("automatic-speech-recognition", model=model_id)
 transcription = asr_pipeline("path/to/audio/file.mp3")
 print(transcription["text"])

Features and uses

  • Nature: This is a more efficient, distilled version of Whisper; however, as of this writing, it only supports English.
  • Efficiency: Offers faster inference and smaller size while maintaining similar accuracy to Whisper.
  • Robustness: Performs well in low signal-to-noise scenarios and shows fewer word duplicates and lower insertion error rates than Whisper.
  • Application: Can be used as an assistant model to Whisper, providing a faster alternative for time-critical applications like live transcription or real-time translation.

Key benefits

Its strong performance across different environments adds to its versatility in both professional and academic settings.

Why we like it

Distil-whisper strikes a balance between efficiency and accuracy, making it suitable for time-critical applications. Its robust performance in various environments adds to its appeal in both professional and academic settings.

Conclusion

The Whisper-inspired projects described above demonstrate the versatility of speech recognition technology in various programming environments. From enhancing web applications with real-time transcription to creating private, offline transcription tools, these projects offer programmers a wealth of possibilities for application development. This guide aims to provide a starting point for exploring these technologies, empowering programmers to integrate advanced speech recognition into their solutions.

At Gladia, we build an enhanced version of Whisper in the form of a single API, optimized for enterprise-grade projects in mind. If you’re curious to know more about the difference between our API and vanilla Whisper, feel free to check the landing page for our latest model, Whisper-Zero. You may also want to check out this blog post on the key factors to consider when choosing the open-source Whisper route vs. using an all-batteries-included API. To try to the API, sign up for free below.

Footnotes

GitHub repositories cited in the article:

https://github.com/ggerganov/whisper.cpp

https://github.com/chengsokdara/use-whisper

https://github.com/chidiwilliams/buzz

https://github.com/m-bain/whisperX

https://github.com/huggingface/distil-whisper

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Speech-To-Text

ASR vs. LLMs – Why voice is among the biggest challenges for AI

When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.

Product News

Ultimate guide to using LLMs with speech recognition is here!

Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.

Speech-To-Text

Should you host an in-house speech-to-text solution or outsource to an API provider?

Businesses across industries are adopting speech-to-text (STT) technology to unlock new use cases and meet growing customer expectations. Whether it’s powering virtual assistants, transcribing conversations, or analyzing audio data for insights, STT has become essential for delivering seamless and engaging experiences.

Read more