How to summarize audio using Whisper ASR and GPT 3.5
Published on Nov 6, 2023
From online meetings to voice memos and media content, the amount of audio data generated by companies daily is as vast as it is valuable.
However, listening to hours of audio content to extract key information is impractical. To enhance productivity and optimize workflows, the capability to quickly generate brief summaries from audio data is crucial.
Luckily, recent advancements in the field of automatic speech recognition (ASR) and natural language processing (NLP) have become robust and accessible enough to make custom audio summarization possible for a range of projects.
In this step-by-step guide, we explain how to summarize audio using OpenAI's Whisper ASR for transcription and GPT 3.5 to generate prompt-based summaries. We will explain how the different components work, and, most importantly, show you how to build your summarization API. We will also present Gladia as an enterprise-grade alternative to the Whisper API for transcription and demonstrate how to use it.
What is speech recognition?
Speech recognition is the process of training an AI system to efficiently process spoken words in any language and translate them into a readable text format. Today, speech recognition systems such as Whisper ASR by OpenAI are trained using a sequence-to-sequence learning approach which consists of an encoder and a decoder block.
More specifically, a standard automatic speech recognition (ASR) system today works by receiving audio input from any recorded audio file, after which the digital audio signals are processed to extract acoustic features (spectrograms) relevant to the speech recognition process. The resulting spectrograms from the audio signals are then passed into an acoustic model (encoder) which is responsible for mapping them to a sequence of words which is usually a direct translation of the input audio file.
The encoder used in ASR is usually implemented using deep learning methods such as Long Short Term Memory (LSTM) networks or Convolutional Neural Networks (CNN). The decoder in the ASR system is responsible for predicting the most probable translation using the output of the encoder. To learn more, have a look at our recent article on how speech-to-text systems work.
Whisper ASR, state-of-the-art model
Whisper ASR was introduced as an open source model by OpenAI in 2022, and was made available as an API in 2023. The model was trained on 680,000 hours worth of labeled audio (audio and their corresponding translations) and can be utilized for tasks such as identifying the language spoken in an audio, transcription of an audio file into text, and translating an audio into another language.
With its transformer architecture based on sequence-to-sequence learning, is considered among the best open-source ASR engines ever build.
Gladia is powered by Whisper ASR, which was designed as a more optimized and feature-rich version of the model adapted to enterprise scale and needs.
Our API was developed to overcome the challenges and limitations of vanilla Whisper, such as hallucinations, long inference time, and usage limitations.
In this tutorial, we will demonstrate how to use both original Whisper API and Gladia’s enterprise version thereof to transcribe text and then summarize it using GPT-3.5.
GPT 3.5 and text summarization
Text summarization is the process of extracting and piecing together useful information from a long text to form a shorter text that retains the main discussion points or action items, while being easier to read and understand.
Types of summarization
When summarizing a text, there are two major methods involved [1]
Extractive summarization: This type of summarization is synonymous with the above definition. In natural language processing, it involves identifying the most important sentences in the text that contribute to the main idea and piecing them together to form a shorter text.
Abstractive summarization: Another technique that uses natural language to understand the general idea of a text. The text is then rewritten in a different and shorter way while still maintaining the original idea of the full text.
GPT 3.5 is a large language model (LLM) developed by OpenAI, trained on billions of text examples. It can perform any task that involves generation, such as text generation, code generation, and text summarization. This model also can follow specific instructions given to it. For example, if you ask it to provide an abstractive summary of a meeting recorded in text format, it will follow those instructions.
As you can see, the model follows the instructions we fed into it and returns a generated text that is accurate in detail concerning the instruction. This process of specifying instructions to the GPT 3.5 model to generate text in the way you want is known as prompt engineering. This will be useful to us later in this tutorial when we build our application. Also, check our blog for more examples of summarization prompts for virtual meetings.
Prerequisites
In this article, we will be building an audio summarization API and it is assumed that you already have a foundation in the Python programming language. First, we will learn how to transcribe audio using Whisper API and the alternative Gladia. Then we will learn how to summarize the transcript using the GPT-3 API after which we will learn how to build a unified endpoint for both transcription and summarization.
You will need the following to code along with the code examples shown in this tutorial:
1. An Integrated Development Environment (IDE) that supports Python e.g. VSCode
If you have already tried to implement an audio summarization system similar to the one discussed in this tutorial using Whisper, GPT 3.5, and FastAPI, one problem you might have faced is a name attribute error that occurs when you try to upload the audio file using FastAPI’s UploadFile endpoint and pass it to the Whisper API. We will address this issue in the subsequent section below.
Note: Here's the GitHub Repository containing all the codes used in this tutorial.
How to transcribe audio using Whisper ASR Webservice
OpenAI provides two ways to access the pre-trained Whisper model. The first is a free, open-source package you can access directly from GitHub which allows for loading the model directly onto your system. However, it is worth noting that the computational expenses of the open-source package increase with the size of the model being used which leads us in this tutorial to use the paid API alternative which is a more efficient solution.
First, we must download the required packages using the pip package manager. The requirements.txt file is available on GitHub.
pip install -r requirements.txt
Once the package is successfully downloaded, you need to create an API key from OpenAI’s website. This is an essential step to access the model and get charged appropriately per usage. Once you have created the API key, pass it into a variable that can be used in the code. Note that it is best practice to use a .env file to store your API keys but for the sake of this tutorial, we will use variables.
The next step is shown in the code sample below. We name this Python file main.py and we import the necessary libraries to be used and the API keys.
import os
from openai import Audio
from fastapi import FastAPI, File, UploadFile
# Store the key in the api_key method from openai
openai.api_key = ''
gladia_key = ''
When trying to create endpoints using FastAPI, the first important step is to set up the server. To do this, you need to initialize an instance of the FastAPI class, which will be used to define the routes for each endpoint that is created.
# Initialize instance of FastAPI class
app = FastAPI()
# Initialize Audio object
audio_object = Audio()
We create an asynchronous function called ‘whisper’ with a route /whisper-transcribe/ and declare a parameter "file" with a UploadFile type. The reason we defined ‘UploadFile’ as the type of file parameter is to inform the app that the whisper function expects a file upload.
# Define route
@app.post(‘/whisper-transcribe/’, status_code=200)
def whisper(file: UploadFile):
# some code here
The next step involves reading the uploaded file into a BytesIO class. You may be wondering why we did this instead of only reading the file using await file.read(). The transcribe method from the Audio() object expects the file being passed into the file parameter to have a name attribute which await file.read() doesn’t provide. To pass the file name of the uploaded file into the name attribute of the BytesIO class, you can read the file into the object and assign the filename to the name attribute.
@app.post(‘/whisper-transcribe/’, status_code=200)
def whisper(file: UploadFile):
# Read the uploaded audio into BytesIO
audio_file = io.BytesIO(await file.read())
# Assign the filename to the name parameter of the BytesIO object
audio_file.name = file.filename
Then, you can create an instance of Audio and pass the model name, file, and API key as parameters into the transcribe method, as shown in the code example below.
@app.post(‘/whisper-transcribe/’, status_code=200)
def whisper(file: UploadFile):
# Read the uploaded audio into BytesIO
audio_file = io.BytesIO(await file.read())
# Assign the filename to the name parameter of the BytesIO object
audio_file.name = file.filename
# Initialize the transcribe method with the audio_file
transcript = audio_object.transcribe(model='whisper-1',
file=audio_file,
api_key=openai_api_key)
# Returns the final transcription
return transcript
Before testing out any new endpoints we create, make sure to add the following line of code at the last lines.
if __name__ == '__main__':
uvicorn.run("main:app", port=6760, log_level="info")
We can test out the new endpoint created by opening a terminal in the directory where the main.py file is located. Once the terminal is launched, run this code below
python -m main
Proceed to edit the URL by adding /docs to open SwaggerUI which is an interface for testing endpoints.
To proceed, select the /whisper-transcribe/ endpoint to reveal more information. Then, select ‘Try it out’. This will reveal an option to upload a file, or an audio recording in this case of the tutorial.
From the response body, we can deduce that the Whisper API can accurately transcribe the text from the audio file we uploaded to test the endpoint created.
How to transcribe audio using Gladia API
As an alternative to using open-source Whisper, Gladia provides a plug-and-play production-ready version of Whisper ASR, which can be used by simply using the requests library in Python.
Here’s a code sample on how to use Gladia API for transcription. The requests library is added to the main.py file to make request calls to Gladia’s API.
import os
import requests # the new import
from fastapi import FastAPI, UploadFile
from openai import Audio
Next, we define a route named gladia using FastAPI and assign it to an asynchronous function defined with a parameter of type UploadFile. When using our API, you can either pass in an audio file or a URL to where the file is located.
In this tutorial, we will be passing in an audio file. To do this, our API expects three parameters: the name of the audio file, the audio file itself, and its content type. The UploadFile type provides access to the name using the .filename attribute and the content type using the .content_type attribute.
@app.post('/gladia/')
async def gladia(file: UploadFile):
# read the uploaded file
audio_file = await file.read()
# set the filename
filename = file.filename
# set the content type
content_type = file.content_type
Subsequently, we define a header parameter using the API key, and a file parameter using the three attributes: filename, file, and content type.
@app.post('/gladia/', status_code=200)
async def gladia(file: UploadFile):
# read the uploaded file
audio_file = await file.read()
# set the filename
filename = file.filename
# set the content type
content_type = file.content_type
# Define API key as a header
headers = {
'x-gladia-key': f'{gladia_key}'
}
# Declare filename, file, and content type as a file for the API
files = {'audio': (filename, audio_file, content_type)}
Once everything needed to make a request is defined, we can send a request from which we will receive a JSON response. To retrieve the full transcription from the response, we need to loop through the response dictionary and retrieve each transcription.
@app.post('/gladia/')
async def gladia(file: UploadFile):
# see code above
# Pass the headers and files and send as a request
response = requests.post('https://api.gladia.io/audio/text/audio-transcription/', headers=headers, files=files)
# Retrieve each transcripted sentence defined by a full stop from the resulting dictionary
sentences = [item["transcription"] for item in response["prediction"]]
# Extract them from a list and join them together to form a full sentence
transcript = " ".join(sentences)
return transcript
To test the endpoint, simply run python -m main in your opened terminal.
The procedure for obtaining a response has been detailed above in ‘How to transcribe audio using Whisper ASR Webservice’.
In contrast to Whisper's API, which charges users for each request, Gladia offers a free plan that allows users to use our API for 10 hours per month without any restrictions on the many features we offer, such as batch transcription, speaker diarization, word-level timestamps, and live transcription.
To learn more about our free, pay-as-you-go and enterprise plans, please visit our pricing page.
Whisper ASR summarization using GPT 3.5
For the audio summarization system developed in this tutorial, we will be making use of OpenAI’s GPT 3.5 API for summarization.
To use the GPT 3.5 API to summarize the transcripts produced by Whisper ASR, we define an endpoint named /summarize/ and define a Python function with a string parameter transcript that expects an audio transcript.
Earlier in the tutorial, we discussed prompt engineering and how it can be used to instruct the model on a desired output for a corresponding input text. In this code, we give the AI several instructions to summarize the transcript while accounting for and correcting mistakes in the transcript. We also made sure to instruct the model to avoid adding unnecessary information to its generated text.
Next, we initialize a method of the ChatCompletion class from the openai package. In this method, we define the GPT 3.5 model in the model parameter and pass a list containing a dictionary with the keys: role and content. The role is set to system to let the model know that the content being provided is a system-level instruction.
According to OpenAI ‘a system level instruction is used to guide your model's behavior throughout the conversation’. Learn more about the parameters available here.
@app.post('/summarize/', status_code=200)
def summarize_gpt(transcript: str):
prompt = f"""
You are an AI agent given the sole task of summarizing an audio transcript which can either be of poor or good quality. The transcript generated from the audio file is given below.
{transcript}.
If the transcript is of poor quality or some words have been poorly transcribed, make sure to guess what the word is supposed to be and return a concise summary that contains all the important information from the transcript.
Make sure that you only provide a summary of the conversation and nothing else. Don't add any additional words that aren't part of the summary.
"""
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo-0301',
messages=[
{"role": "system",
"content": prompt}
],
temperature = 0.0
)
return response.choices[0].message.content
How to create an automatic workflow of the system using FastAPI
An automatic system will be defined as one that allows for audio upload using a FastAPI endpoint, passes the audio into Whisper for transcription, and immediately passes the transcription into GPT 3.5 for summarization which is returned to the end user.
To do this, you will be making little adjustments to the code in the whisper.py, gladia.py, and summarize.py files. The adjustments to be made include the conversion of the endpoints in each of the files into standalone Python functions.
File structure
The first step before making the adjustments involves moving all 3 files into an utils folder, while you also create a main.py file outside the utils folder.
Your file structure should look like this when done creating the file structure.
- whisper-summarizer
- main.py
- requirements.txt
The main folder, whisper-summarizer, should contain all subfolders and files. The requirements.txt file should look like the image below. The purpose of this file is to specify all the libraries or packages that need to be installed when working in a virtual local or cloud environment.
The main folder, whisper-summarizer, should contain all subfolders and files. The requirements.txt file should look like the image below. The purpose of this file is to specify all the libraries or packages that need to be installed when working in a virtual local or cloud environment.
To make our app fully functional, we would be adding standalone functions for each of the services being used e.g Gladia, Whisper, and GPT-3
Add a standalone function for Gladia to the main.py file
def whisper_transcribe(audio_file):
# Initialize Audio object
# Initialize the transcribe method with the audio_file
transcript = audio_object.transcribe(model='whisper-1',
file=audio_file,
api_key=openai_api_key)
# Returns the final transcription
return transcript
Add a standalone function for GPT-3
def summarize_gpt(transcript: str):
prompt = f"""
You are an AI agent given the sole task of summarizing an audio transcript which can either be of poor or good quality. The transcript generated from the audio file is given below.
{transcript}.
If the transcript is of poor quality or some words have been poorly transcribed, make sure to guess what the word is supposed to be and return a concise summary that contains all the important information from the transcript.
Make sure that you only provide a summary of the conversation and nothing else. Don't add any additional words that aren't part of the summary.
"""
response = openai.ChatCompletion.create(
model='gpt-3.5-turbo-0301',
messages=[
{"role": "system",
"content": prompt}
],
temperature = 0.0
)
return response.choices[0].message.content
Creating a single endpoint for an automatic workflow
The code samples we provided above show you how to convert the endpoints in the three files into standalone functions by removing the lines of the code that had to do with uploading and reading a file. This was done to create a unified endpoint that works for both the transcriber and summarizer and handles all forms of data input, such as audio upload or passing in a string.
To proceed with creating a unified endpoint, we will be updating the main.py file. As you have learned so far, we need to import the libraries we would be making use of as seen below.
import io
import openai
from fastapi import FastAPI, File, UploadFile
from fastapi.middleware.cors import CORSMiddleware
from openai import Audio
import requests
import uvicorn
The next step involves creating the server. A description is added to tell users about the purpose of the API and we have also titled the API as Whisper ASR Summarizer. We have also added a middleware using CORS for cross-communication between the Whisper ASR API and a frontend server.
The middleware accepts various parameters such as
1. Origin, which is used to specify any protocols and ports that can communicate the Whisper ASR API. Here we set it to “*” which means all origins are allowed to communicate with the backend.
2. Method, which is used to specify the HTTP methods that are allowed for communication with the backend. Here we set it to ‘POST’ and ‘GET’.
3. Header, which is used to specify the HTTP headers that are allowed for communication with the backend. Here we set it to Content-Type since we’ll be dealing with file uploads.
description = """
The Whisper ASR Summarizer is an API that allows you to upload audio files
and automatically provides you with a summarized version of the audio in text format.
"""
# Creating the server
app = FastAPI(
title='Whisper ASR Summarizer',
description=description,
summary='Summarize audio'
)
# Initializing parameters for middleware
origins = ["*"]
methods = ["POST", “GET”]
headers = ["Content-Type"]
app.add_middleware(
CORSMiddleware,
allow_origins=origins,
allow_credentials=True,
allow_methods=methods,
allow_headers=headers
)
To transcribe using the Whisper ASR API, we first create an endpoint named /upload-audio-whisper/ and an asynchronous function. We then specify the file parameter to be of type UploadFile and add a description to explain what the parameter expects.
Next, we read the uploaded file, set its name with BytesIO, and pass it into the whisper_transcribe function to obtain a transcription. The transcript is then passed into the summarize_gpt function to obtain a summary, and both the transcript and summary are returned.
@app.post('/upload-audio-whisper/')
async def transcribe_summarize_whisper(file: UploadFile):
audio_file = io.BytesIO(await file.read())
audio_file.name = file.filename
# call function to transcribe using whisper
transcript = whisper_transcribe(audio_file)
# call function to transcribe using gpt
summary = summarize_gpt(transcript)
return {
'Transcript': transcript,
'Summary': summary
}
The steps for transcribing using Gladia are the same as above. First, we read the file. Then, we pass the uploaded file into the Gladia function. Next, we parse the output to retrieve the full sentences. Finally, we pass the resulting transcript into the GPT function to get a summary
If you wish to deploy your Webservice, here is a tutorial on how to do so using Render.
Conclusion
In this tutorial, we have shown how Gladia and Whisper can be used to generate audio transcriptions, with GPT-3.5 used subsequently for different types of text generation, while leveraging FastAPI to build an API for audio summarization.
To learn about the many features Gladia provides for audio transcription, visit our developer documentation.
Footnotes
[1] Dutta, S., Das, A. K., Ghosh, S., & Samanta, D. (2022). Data analytics for social microblogging platforms. Elsevier.
At Gladia, we built an optimized version of Whisper in the form of an enterprise-grade API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities and state-of-the-art features.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
ASR vs. LLMs – Why voice is among the biggest challenges for AI
When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.
Ultimate guide to using LLMs with speech recognition is here!
Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.
Should you host an in-house speech-to-text solution or outsource to an API provider?
Businesses across industries are adopting speech-to-text (STT) technology to unlock new use cases and meet growing customer expectations. Whether it’s powering virtual assistants, transcribing conversations, or analyzing audio data for insights, STT has become essential for delivering seamless and engaging experiences.