One of the major obstacles for speech-to-text AI has been identifying individual speakers in a multi-speaker audio stream before transcribing the speech. This is where speaker separation, also known as diarization, comes into play.
Thanks to the latest advances in ASR, diarization mechanisms have evolved significantly over the decade from a simplified acoustic-based recognition of speakers to a sophisticated dual-model approach based on embeddings containing key individual information on each speaker.
Diarization is a core feature of Gladia’s Speech-to-Text API powered by optimized Whisper ASR for companies. By separating out different speakers in an audio or video recording, the features make it easier to make transcripts easier to read, summarize, and analyze.
In this blog, we dive into the mechanics of speaker diarization, present its use cases, and explain how our state-of-the-art diarization API – which has just undergone a major update for improved speed and accuracy – is designed to address those.
What is speaker diarization?
In speech recognition, diarization is a process of automatically partitioning an audio recording into segments that correspond to different speakers. This is done by using various techniques to distinguish and cluster segments of an audio signal according to the speaker's identity.
The result is a transcript that is split neatly into different sections, each corresponding to a different speaker, as shown below.
Key challenges of speaker diarization
Diarization, while a powerful tool for speaker identification, faces several common challenges and constraints that need to be taken into account when picking the right ASR diarization engine for one’s use case.
One common challenge is having overlapping speakers in an audio recording, which can significantly hinder voice separation – we explore this issue in more detail later in the article. Moreover, issues like high VRAM usage associated with diarization exert a strain on computational resources, impacting performance.
Prolonged inference times, especially in real-time applications or with large audio files, can also hamper the system's responsiveness. Addressing these challenges is pivotal for improving diarization efficiency in applications that rely on transcription and/or voice-controlled systems.
In ASR generally, and diarization specifically, it’s a matter of striking the right balance between the key variables. Building an engine that is scalable may compromise performance, measured with diarization error rate (DER) percentage as a key metric.
A dive into the inner workings of diarization
Before the widespread adoption of deep learning techniques in ASR, the legacy systems for diarization relied on traditional signal processing, feature engineering, and machine learning algorithms. These methods required more manual intervention, had limited performance, required a lot of data, and were hardly able to generalize out-of-domain data.
By contrast, modern diarization models are distinguished by the dual-model approach, employing two separate models for segmentation and speaker embeddings.
Utilizing two distinct models, each tailored to its specific task, allows each of the models to focus on what’s most important in its respective task. Moreover, given that diarization engines are notoriously data-hungry, the dual-model novel approach allows to reduce the amount of data required to yield accurate results.
There are a few open-source libraries that do this – with Pyannote as a model of reference in the field [1], which we’ll use as the foundation to explain how state-of-the-art diarization works today.
Here are the key steps of the diarization process
1. Speech activity detection
2. Segmentation
3. Speaker embeddings
4. Clustering
Speech activity detection
In order for diarization to work, the first step is to convert the input audio signal into a format optimized for analysis. This involves extracting relevant features, often using techniques like Mel-frequency cepstral coefficients (MFCCs), and isolating parts of the audio without silences with the help of voice activity detection (VAD) technology. Doing this enables us to reduce the amount of erros in the segment-labelling process that follows.
Segmentation
As the next step, we want to separate the processed audio into chunks of audio corresponding to who’s speaking – but keep in mind that at this stage, we still have no person-specific information regarding those speakers!
That’s where our segmentation model comes into play – it’s trained specifically for the task of detecting when the voice of the speaker is replaced by that of another.
Instead of applying the model to the entire audio – which would be computationally intensive and strain the model significantly – we run the segmentation on small overlapping windows, lasting a few milliseconds each and separated by short intervals. Segmentation allows us to split or distribute the task effectively, ensuring the model remains focused on shorter and more manageable tasks.
For each of these windows, the model is expected to identify up to three distinct speakers. It's important to understand that these speakers are independent of each other within their respective windows. Therefore, the first speaker in one window might differ from the first speaker in the next window. This concept is referred to as "local speakers."
Here is a visualization of those results in a plot: each speaker is assigned a color. The higher the curve, the higher the probability of the speaker to be actually speaking. For readability in this figure, a 2s step is used between windows, but the actual practical step would be smaller.
After the segmentation, the model calculates the probability of each person speaking. Here, we can simplify the results by converting them into binary outcomes – either "is speaking" or "is not speaking" – used for subsequent processing, resulting in the following plot.
Speaker embeddings
As shown before, each local speaker operates independently within their designated window. Therefore, we need a method to determine if two local disparate speakers correspond to the same “global speaker” – and this is where our embedding model comes into play.
In ASR, an embedding serves as a unique representation, akin to a footprint, capturing the speaker's voice pitch, tone, and other distinctive features. What we’re interested in here is to extract an embedding for each local speaker from the audio segments, to make sure that all bits of one’s voice dispersed throughout the audio conversation end up being attributed correctly to the right speaker.
If two local speakers are, indeed, the same global speaker, their respective embeddings will closely resemble each other. Conversely, if the two local speakers are distinct individuals, their embeddings will differ significantly.
To generate those embeddings, we use a model whose primary purpose is to generate a representation of the speakers within a given audio. In order to make its life easier, we need to remove audio in every window with overlapping speakers and concatenate the remaining audio per speaker.
Below, you can see the raw audio signal at the top. In the middle – the probability of each speaker speaking, and at the bottom – the resulting audio. If you compare the top and the bottom charts, we removed a chunk of audio at the beginning of the first window, where two overlapping speakers were detected.
Clustering
Now that we have a representation of all local speakers, the only remaining step is to combine them together and generate the corresponding global speakers.
For that purpose, we use a clustering algorithm, which will compare all the embeddings with one another and will try to create groups resulting in global speakers.
Once the clusters are generated, we can assign the local speakers to their global value, as depicted in the first image. The results of each separate window must be merged, resulting in the second unified plot.
Ultimately, we proceed to convert the outcomes into a more easily understandable and user-friendly format by applying a binarization process, much like the example depicted in the following plot.
This final diarization result serves as a valuable resource that can be employed for various purposes. It can be used to analyze speaker interactions, extract meaningful insights, or enable effective indexing and retrieval of audio content. With this enhanced output, our diarization system proves its utility in a wide range of applications, demonstrating its potential to streamline tasks and enhance the overall audio processing experience.
Whisper speaker diarization, or what makes Gladia’s approach unique
When building a diarization engine, it is paramount to optimize its performance with all kinds of audio. Audio files essentially consist of channels, with the following classification used to distinguish between the different types:
Mono (single channel), aka a track
Stereo (dual-channel), aka two tracks
Multi-channel, aka several tracks
This task of making a diarization API versatile is more tricky than it may appear at first glance, given that stereo and multi-channel audio can contain duplicate signals, resulting in a doubling of channels — which in turn leads to repetitions in the final transcript.
Gladia API is based on Whisper ASR, optimized for enterprise scale and needs. Speaker diarization is among the several core features added to the vanilla model. When building our API, our goal was to make sure it could process all kinds of files without compromising the quality and speed of diarization.
To help our users get the most out of this functionality, we rely on two main approaches:
Mechanical split diarization, most suitable for telecom and call center-type use cases. The speaker separation, in this case, is mechanical because each speaker corresponds to a unique channel (left channel — person 1, right — person 2).
AI-based (footprinting) split diarization, based on Gladia’s proprietary algorithm and suitable for use cases like multi-speaker virtual meetings, video content, and podcasts, where a single channel can contain multiple speakers and the input files are prone to duplicate signals.
As part of the proprietary AI-based diarization, we developed a special method to avoid transcript repetitions by seamlessly detecting unique vs. duplicate channels based on their similarity.
Our API will automatically pick the right diarization method for your audio file and produce a high-quality speaker-based transcript — without sacrificing the speed of transcription (≈120s per hour of audio).
Latency
One of the major updates in the latest diarization version is the implementation of a new diarization model, resulting in improved inference time and reduced latency, making the overall speed of diarization 5x-10x faster. This means that large audio files can be processed more quickly — be they mono, stereo, or multichannel – in a way that is both efficient and scalable enough to accommodate the ever-growing enterprise volumes.
Accuracy
Speaker recognition and attribution have also seen significant improvements. Despite overlaps resulting from simultaneous speech, the new state-of-the-art algorithm can recognize speakers accurately and assign dispersed speech fragments to unique individuals without repetitions.
The new model produces a list of speakers, along with the start and end times of their sentences, ensuring a more iterative and precise attribution of sentences per speaker in the transcribed text. Moreover, thanks to the latest upgrade, the overall accuracy of Gladia’s diarization engine has been greatly improved even in challenging contexts like quick and short exchanges.
Multilingual support
Moreover, the new diarization version offers multilingual support, making it a powerful tool for transcribing multilingual conversations and webinars. It can automatically detectthe 99+ supported languages in the original audio without the need for an initial prompt.
Additionally, Gladia’s diarization API can handle sentence-based code-switching — distinguishing between speakers even as they switch from one language to another during the same conversation. This makes it highly versatile and adaptable to different language contexts — a truly valuable tool for transcription needs in international settings.
Given all of the above, Gladia’s API has an undisputed competitive edge when it comes to diarization.
Benefits of speaker diarization across industries
Diarization is an essential feature for getting the most out of transcribed interviews, focus groups, conference calls, and other situations with multiple speakers. Prior to this feature, transcribing such conversations would be a human task, which tends to be a time-consuming, labor-intensive, and error-prone process.
Given the unmatched speed and accuracy of Gladia’s core speech-to-text API, this add-on feature can help make your transcription process a lot more accurate and efficient.
With diarization, you users will be able to:
• Automatically identify all speakers in a multi-person meeting;
• Detect the same speaker appearing in different channels, ex. different conference rooms, same presenter;
• Recognize the speakers even when they switch to a different language (i.e. code-switching).
In addition to its usefulness for transcription, speaker diarization has other potential applications. For example, it could be used for identifying different voices in a security context, such as identifying who is speaking in a phone call or security footage. It could also be used in the field of voice recognition, where accurate identification of different speakers could improve the accuracy of voice-based authentication systems.
We hope you enjoy this new feature and look forward to seeing even more exciting applications for it as our tech continues to evolve.
Footnotes
[1] Pyannote Audio 2.1. Speaker Diarization Pipeline: Principle, Benchmark, and Recipe. (2023). Hervé Bredin, IRIT, Université de Toulouse, CNRS, Toulouse, France.
About Gladia
At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities and state-of-the-art features, including speaker diarization and word-level timestamps.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
ASR vs. LLMs – Why voice is among the biggest challenges for AI
When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.
Ultimate guide to using LLMs with speech recognition is here!
Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.
Should you host an in-house speech-to-text solution or outsource to an API provider?
Businesses across industries are adopting speech-to-text (STT) technology to unlock new use cases and meet growing customer expectations. Whether it’s powering virtual assistants, transcribing conversations, or analyzing audio data for insights, STT has become essential for delivering seamless and engaging experiences.