How to implement advanced speaker diarization and emotion analysis for online meetings
How to implement advanced speaker diarization and emotion analysis for online meetings
Published on
Sep 2024
In our previous article, we discussed how to unlock some of that data by building a speaker diarization system for online meetings (POC) to identify speakers in audio streams and provide organizations with detailed speaker-based insights into meetings, create meeting summaries, action items, and more.
However, effective communication goes beyond words. A subtle non-verbal cue or underlying emotional state can reveal how something was said and entirely change the meaning.
Analyzing emotions from various voice interactions such as customer service calls, sales meetings, or online interviews can help unlock deeper insights to predict behavior, take data-driven actions based on expected behavior, and improve quality monitoring over time.
For services like contact centers or sales-focused meeting platforms, this can turn into increased sales reps' performance, personalized customer assistance, better understanding of customer satisfaction, and more.
In this tutorial, we give you building blocks for integrating advanced speaker-based summaries for online meetings using speaker diarization and emotion analysis. Let’s dive in.
What is speaker diarization and emotion analysis?
Advanced speaker diarization identifies and segments an audio recording into speech segments, where each segment corresponds to a specific speaker. It then detects changes in speaker identity and groups segments that belong to the same speaker, answering the question who spoke when.
Emotion analysis, as the name suggests, analyzes the emotional undertones of voice, and classifies them into categories such as approval, disappointment, excitement, and curiosity, answering the question how was that said. For emotion analysis in this tutorial, we’ll use a version of Whisper called Whisper-timestamped and Hugging Face emotion detection model.
Sentiment vs emotion analysis
Note: Emotion analysis is often confused with sentiment analysis. Sentiment analysis classifies information as positive, negative, or neutral. But it’s not always capable of identifying emotional nuances such as surprise or fear. Here's where emotion analysis comes in—it analyzes more complex emotions and undertones. Both sentiment and emotion analysis can be text or speech-based.
This POC has various use cases:
Corporate governance and compliance: Automates meeting transcriptions for audit trails and legal documentation in highly regulated sectors like finance and healthcare.
Educational webinars and online classes: Allows students to search transcripts by speaker and helps educators refine their methods through emotion analysis.
Customer support and service reviews: Analyzes customer support calls and performs sentiment assessment to improve staff training and customer satisfaction.
Conference and event summaries: Provides easy access to specific parts of conferences, with emotion analysis providing insights into speaker engagement and audience sentiment.
Project management meetings: Improves understanding of team dynamics and communication flow, helping with conflict resolution and project success.
Challenges of implementing advanced identification and emotion analysis
High computational requirements
Challenge: Processing extensive audio data for diarization, transcription, and emotion analysis requires significant computational resources.
Solution: Leveraging cloud computing resources or optimizing algorithms.
Speaker diarization and identification
Challenge: Achieving high accuracy in speaker diarization can be difficult, especially in noisy environments or with overlapping speech.
Solution: Enhancing audio preprocessing and using advanced machine learning models trained on diverse datasets can improve accuracy.
Privacy and data security
Challenge: Handling sensitive audio data involves significant privacy and security concerns.
Solution: Implementing robust security protocols and complying with data protection regulations.
Limited context
Challenge: The current emotion analysis model often misses out on contextual audio cues like tone, pitch, and energy.
Solution: Future enhancements will include models that analyze emotions directly from audio, improving the understanding of sentiments.
Real-time processing requirements
Challenge:Real-time transcription is essential but challenging due to the heavy computational requirements.
Solution: Implementing Voice Activity Detection (VAD) techniques can process audio segments immediately and help predict the speaker sentiment in real time.
How to implement advanced speaker diarization and emotion analysis
Here are the steps for implementing advanced speaker diarization and emotion analysis POC:
Create speaker embeddings: You need to upload audio samples of known speakers and create a unique digital signature for each participant based on their voice characteristics. These samples are then processed to generate distinct speaker embeddings and serve as a reference for speaker diarization.
Diarization to determine “who spoke when”: The audio file is analyzed to detect different speakers and divide the meeting into parts where each segment represents a single speaker's input.
Speaker diarization to attribute speech segments to corresponding speakers: Each audio segment is compared to the speaker embeddings to identify which segments correspond to which speakers. This comparison matches audio characteristics with known embeddings.
ASR transcription to convert speech within each segment into text: The transcription also includes timestamps that link each piece of text to its specific time in the audio file.
Navigate to Account settings and get your Hugging Face API key. Store it securely.
Optional: While optional,Google Colab is recommended for running notebooks in an environment with a free GPU. This is especially useful for processing large audio files efficiently.
3. Install libraries and packages
We’ll use several libraries and packages in this tutorial.
Pyannote.audio: Speaker diarization with pre-trained models for segmenting and labeling speaker identities. Install it using pip:
!pip install pyannote.audio
SpeechBrain: All-in-one, open-source speech toolkit enabling flexible speech technology experiments. Install it directly from the GitHub repository:
Torchaudio: Gives you access to audio files and transformations:
!pip install torchaudio
SciPy: Used for scientific computing; it helps in operations like computing distances between embeddings:
!pip install scipy
Hugging Face Transformers: A wealth of pre-trained models used for ASR and emotion analysis tasks:
!pip install transformers
Whisper-timestamped: An enhanced ASR model that offers precise transcription with timestamps, critical for synchronizing transcribed text with audio segments:
!pip install whisper-timestamped
4. Audio data
Ensure you have audio recordings for analysis. You can use the audio library featured in this tutorial if you don't have your recordings.
Step-by-step tutorial
Step 1: Create speaker embeddings
Let’s start by creating unique speaker embeddings for known speakers:
import torch
import torchaudio
from speechbrain.inference.speaker import EncoderClassifier
from pyannote.audio import Pipeline, Audio
from scipy.spatial.distance import cdist
# Check if CUDA is available and set the device accordingly
device = torch.device("cuda"if torch.cuda.is_available() else"cpu")
# Load pre-trained model for speaker embedding extraction and move it to the device
# Note: You need to obtain an API key from Hugging Face to use this model.
classifier = EncoderClassifier.from_hparams(source="speechbrain/spkrec-ecapa-voxceleb", run_opts={"device": device})
classifier = classifier.to(device)
# Pre-trained model for speaker diarization
# Note: The speaker diarization model also requires an API key from Hugging Face.
# YOUR_HUGGING_FACE_API_KEY
diarization = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1",
use_auth_token="Your_API_Key")
Now we need to:
Load the audio files containing samples of known speakers and extract their embeddings.
Convert the audio files into waveforms and encode them into speaker embeddings using the previously loaded classifier model.
Assign meaningful labels to each known speaker for later reference.
# Extract embeddings for known speakers and move them to the device
known_speakers = []
known_speaker_ids = [] # To keep track of speaker IDs
for speaker_id, speaker_file in enumerate(["/content/Steve Job's sample.wav", "/content/elon musk sample.wav", "/content/mandela sample.wav"]):
waveform, sample_rate = torchaudio.load(speaker_file)
waveform = waveform.to(device)
embedding = classifier.encode_batch(waveform)
known_speakers.append(embedding.squeeze(1).cpu().numpy()) # Squeeze and move to CPU
# Update speaker labels here
if speaker_id == 0:
known_speaker_ids.append("Steve Jobs")
elif speaker_id == 1:
known_speaker_ids.append("Elon Musk")
elif speaker_id == 2:
known_speaker_ids.append("Nelson Mandela")
Step 2: Diarization
Segment the audio file into different speaker segments to identify "who spoke when."
# Process meeting recording
segments = diarization("/content/mandela elon steve.wav")
Step 3: Speaker diarization
Next we need to identify speakers in each segment of the recording. For each segment, we load the corresponding portion of the audio recording using torchaudio.load() and extract the waveform.
We then pass the waveform to the classifier model to obtain the speaker embedding using the encode_batch() method. The obtained embedding is compared with embeddings of known speakers using cosine distance.
The speaker with the minimum distance is identified as the speaker for the specific segment:
If the minimum distance is below a specified threshold, we print the speaker ID along with the start and end times.
If no matching speaker is found, we print a message indicating that no matching speaker was found.
# Set the threshold for similarity scores
threshold = 0.8# Function to process each segment and identify speakers
def identify_speakers_in_segments(segments, known_speakers, known_speaker_ids, classifier, threshold):
identified_segments = []
for i, (segment, label, confidence) in enumerate(segments.itertracks(yield_label=True)):
start_time, end_time = segment.start, segment.end
waveform, sample_rate = torchaudio.load("/content/mandela elon steve.wav")
start_index = int(start_time * sample_rate)
end_index = int(end_time * sample_rate)
waveform_segment = waveform[:, start_index:end_index].to(device)
# Ensure the waveform segment is appropriately padded if necessary
if waveform_segment.size(1) < 16000: # Assuming 1 second as minimum length
pad_size = 16000 - waveform_segment.size(1)
waveform_segment = torch.nn.functional.pad(waveform_segment, (0, pad_size))
embedding = classifier.encode_batch(waveform_segment)
embedding_2d = embedding.squeeze(1).cpu().numpy()
min_distance = float('inf')
recognized_speaker_id = None
# Iterate over each known speaker and find the minimum distance
for j, speaker_embedding in enumerate(known_speakers):
distances = cdist(embedding_2d, speaker_embedding, metric="cosine") # No need to reshape for cdist
min_distance_candidate = distances.min()
if min_distance_candidate < min_distance:
min_distance = min_distance_candidate
recognized_speaker_id = known_speaker_ids[j] # Get the corresponding speaker ID
if min_distance < threshold:
segment_info = {
"segment_id": i,
"speaker": recognized_speaker_id,
"begin_time": start_time,
"end_time": end_time
}
identified_segments.append(segment_info)
else:
segment_info = {
"segment_id": i,
"speaker": "Unknown",
"begin_time": start_time,
"end_time": end_time
}
identified_segments.append(segment_info)
return identified_segments
# Call the functionwithappropriateparametersidentified_segments=identify_speakers_in_segments(segments, known_speakers, known_speaker_ids, classifier, threshold)
Step 4: Segment transcription with whisper-timestamped
Whisper-timestamped is based on OpenAI Whisper. However, it predicts word timestamps and provides a more accurate estimation of speech segments compared to Whisper models.
A confidence score is assigned to each word and each segment.
import whisper_timestamped as whisper
audio = whisper.load_audio("/content/mandela elon steve.wav")
model = whisper.load_model("large-v3")
result = whisper.transcribe(model, audio)
Use the following snippet codes to assign the transcription segment to a right speaker.
# Function to process the ASR output
def process_asr_output(asr_data):
return asr_data['segments']
# Function to process the diarization and speaker identification output
def process_speaker_segments(speaker_data):
return speaker_data
# Function to assign speakers to transcription segments
def assign_speakers_to_transcriptions(trans_segments, speaker_segments):
for trans in trans_segments:
trans_start = trans['start']
trans_end = trans['end']
trans['speaker'] = None
for speaker in speaker_segments:
if speaker['begin_time'] <= trans_start < speaker['end_time'] or \
speaker['begin_time'] < trans_end <= speaker['end_time']:
trans['speaker'] = speaker['speaker']
breakreturn trans_segments
# Main processing steps
transcription_segments = process_asr_output(result)
speaker_segments = process_speaker_segments(identified_segments)
updated_transcriptions = assign_speakers_to_transcriptions(transcription_segments, speaker_segments)
# Prepare data forJSON output
transcription_w_speakers_data = []
for trans in updated_transcriptions:
transcription_w_speakers_data.append({
"id": trans['id'],
"timestamp": {"start": trans['start'], "end": trans['end']},
"speaker": trans['speaker'],
"text": trans['text']
})
# Convert list to JSON string
json_string = json.dumps(transcription_w_speakers_data, indent=4)
# Print JSON string
print(json_string)
Step 5: Emotion analysis
We’ll use a pre-trained model, "SamLowe/roberta-base-go_emotions", from the Hugging Face Transformers library. This model is trained to recognize a wide range of emotions from text inputs.
from transformers import pipeline
sentiment_analysis = pipeline(
"sentiment-analysis",
framework="pt",
model="SamLowe/roberta-base-go_emotions")
def analyze_sentiment(text):
results = sentiment_analysis(text)
sentiment_results = {
result['label']: result['score'] for result in results
}
return sentiment_results
The analyze_and_append_sentiments function goes through transcription data, analyzes the sentiment of each segment, and assigns the results to the corresponding segments in the JSON data:
# Function to analyze sentiment and update the JSON data
def analyze_and_append_sentiments(transcription_w_speakers_data):
for segment in transcription_w_speakers_data:
text = segment['text']
sentiment_results = analyze_sentiment(text)
segment['sentiment'] = sentiment_results # Append sentiment result to each segment
return transcription_w_speakers_data
# Analyze sentiments and update JSON data
updated_json_data = analyze_and_append_sentiments(transcription_w_speakers_data)
# Print updated JSON data with sentiment results
json_string = json.dumps(updated_json_data, indent=4)
print(json_string)
Here is the final result:
[
{
"id": 0,
"timestamp": {
"start": 0.0,
"end": 3.06 },
"speaker": "Nelson Mandela",
"text": " Sport has the power to change the world.",
"sentiment": {
"optimism": 0.5600042343139648 }
},
{
"id": 1,
"timestamp": {
"start": 3.66,
"end": 4.88 },
"speaker": null,
"text": " It has the power...",
"sentiment": {
"neutral": 0.8447567820549011 }
},
{
"id": 2,
"timestamp": {
"start": 5.3,
"end": 6.32 },
"speaker": "Elon Musk",
"text": " Which I get asked a lot.",
"sentiment": {
"neutral": 0.9046566486358643 }
},
{
"id": 3,
"timestamp": {
"start": 6.98,
"end": 7.48 },
"speaker": "Nelson Mandela",
"text": " ...to inspire.",
"sentiment": {
"neutral": 0.9702799916267395 }
},
{
"id": 4,
"timestamp": {
"start": 8.46,
"end": 14.66 },
"speaker": "Nelson Mandela",
"text": " It has the power to unite people in a way that little else does.",
"sentiment": {
"approval": 0.528308093547821 }
},
{
"id": 5,
"timestamp": {
"start": 15.56,
"end": 18.54 },
"speaker": "Steve Jobs",
"text": " It's a breakthrough internet communications device.",
"sentiment": {
"neutral": 0.4549199938774109 }
},
{
"id": 6,
"timestamp": {
"start": 20.06,
"end": 24.36 },
"speaker": "Nelson Mandela",
"text": " It speaks to youth in a language they understand.",
"sentiment": {
"neutral": 0.8397235870361328 }
},
{
"id": 7,
"timestamp": {
"start": 25.0,
"end": 27.2 },
"speaker": "Elon Musk",
"text": " When you think about the question of where are the aliens,",
"sentiment": {
"neutral": 0.7901335954666138 }
}
]
Wrap-up
The advanced speaker diarization and emotion analysis POC enhances the processing and analyzing of online meetings, making them more accessible, organized, and efficient.
However, this model has its limitations. It currently processes transcribed text without considering the rich audio cues such as tone, pitch, and energy that can alter the emotion of a spoken sentence — the same sentence spoken in a cheerful tone versus an angry tone can convey entirely different emotions.
In our next tutorial, we will show you how to overcome this limitation. You’ll learn how to integrate models that can analyze emotions directly from audio data and leverage auditory cues to provide a more nuanced and accurate analysis. Stay tuned!
About Gladia
Gladia provides a speech-to-text and audio intelligence API for building virtual meeting and note-taking apps, call center platforms, and media products, providing transcription, translation, and insights powered by best-in-class ASR, LLMs and GenAI models.
Keeping LLMs accurate: Your guide to reducing hallucinations
Over the last few years, Large Language Models (LLMs) have become accessible and transformative tools, powering everything from customer support and content generation to complex, industry-specific applications in healthcare, education, and finance.
Transforming note-taking for students with AI transcription
In recent years, fuelled by advancements in LLMs, the numbers of AI note-takers has skyrocketed. These apps are increasingly tailored to meet the unique needs of specific user groups, such as doctors, sales teams and project managers.
RAG for voice platforms: combining the power of LLMs with real-time knowledge
It happens all the time. A user submits a query to a large language model (LLM) and swiftly gets a response that is clear, comprehensive, and obviously incorrect.