How to build a Google Meet Bot for recording and video transcription
Published on Nov 23, 2023
Tools like Google Meet have revolutionized how we connect and conduct meetings remotely. However, it can be very challenging to keep track of all action items and key insights shared during long meetings.
Anyone who's used Google Meet native transcription, knows that relying on Google alone is hardly an option - the quality is poor, and processing time takes around 30 minutes on average. One possible solution is building a custom Google Meet transcription bot that will record and transcribe online calls for you
But crafting such a tool from scratch isn’t as simple as one would think. It must overcome two significant challenges: navigating around Google Meet’s anti-bot mechanisms and managing audio processing in server settings that lack a sound card. In this article, we’ll guide you through the ins and outs of building a Google Meet bot capable of recording and transcribing meetings — even when operating in a containerized environment sans sound card.
What is a Google Meet bot and what does it do?
The Google Meet bot created by our tech team is a specialized tool engineered to record and transcribe your Google Meet sessions, all within the confines of a container. This tool simplifies the complexities of remote meeting capture, making it effortless to manage sessions without the need for additional hardware or complex configurations.
Prerequisites
Before setting up the bot, ensure you have the following:
Setting up the Google Meet Bot is a stimulating but somewhat complex endeavor. The tool offers a range of useful features like automatic recording and transcription, but there are some technical hurdles you’ll need to overcome - we took these into account when designing the bot.
Google Meet uses a variety of methods, such as pixel trackers, to sniff out bots. We tackled this by integrating an undetectable Chrome driver using Selenium. This helps the bot remain inconspicuous and minimizes the likelihood of being flagged. Additionally, using Google’s conventional login procedures further helps in evading detection.
2. Capturing audio on sound card-less servers
The bot runs seamlessly on Docker and local machines where it has access to physical sound cards. The real hurdle comes when you try to deploy it on a server that lacks a sound card. Our workaround for this issue is implementing Pulse Audio, an Ubuntu-specific driver, which essentially creates a virtual sound card and microphone for audio capture.
3. Simulating a user interface for video recording
Capturing video presented another unique challenge. We use a combination of X-Screen and XVFB software to fabricate a virtual screen. This permits a headless Chrome session to operate and record the video of the meeting, all without necessitating an actual user interface.
Setting up the Google Meet bot
Without further ado, let’s get started. This section provides a step-by-step guide to setting up the Google Meet Bot.
Step 1: Clone the GitHub Repository
You’ll need to clone a GitHub repository that contains the Dockerfile for this setup.
git clone https://github.com/gladiaio/gladia-samples.git
cd gmeet-automate
Step 2: Build the Docker container
Build the Docker container using the provided Dockerfile.
docker build -t gmeet -f Dockerfile .
Step 3: Set environment variables and run the container
To run your Docker container effectively, you’ll need to set a range of environment variables. These include crucial information such as the Google Meet link, your Gmail login details, the length of your meeting, and your Gladia API key, among other things. By configuring these variables, you ensure seamless operation straight from the container itself.
Before executing the script, it’s crucial to configure certain environment variables. These variables handle sensitive and fluctuating data, ensuring your script runs smoothly. Here’s a breakdown of each one:
GMEET_LINK: This is the Google Meet link for the meeting you want to record.
GMAIL_USER_EMAIL: Plug in your Gmail email address here.
GMAIL_USER_PASSWORD: This one’s for your Gmail password.
DURATION_IN_MINUTES: Specify the length of the meeting you wish to record in minutes.
GLADIA_API_KEY: Insert your Gladia API key here for smooth integration.
GLADIA_DIARIZATION: Choose whether to enable or disable the diarization feature in Gladia.
MAX_WAIT_TIME_IN_MINUTES: This is the cut-off time for waiting in the Google Meet lobby. Make sure you don’t set it too long or too short.
Setting these environment variables correctly is crucial for the successful execution of the script.
Step 4: Retrieve the recordings
Once the meeting wraps up, you’ll be able to locate the recording and any snapshot images from bots in two separate folders on your local computer: one for recordings and another for screenshots.
How the Gladia Google Meet bot works
The Gladia Google Meet Bot streamlines your Google Meet experience by automating session entry, controlling your audio and video preferences, and even recording the meeting for you. Below is an explanation of the code and how each part works:
Importing libraries
To kick things off, the code begins with a series of import statements. These bring in the necessary libraries and modules that empower the bot to perform its tasks efficiently.
import asyncio
import os
import subprocess
import cv2
import datetime
import requests
import pyaudio
import numpy as np
import io
from PIL import Image
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import undetected_chromedriver as uc
Here’s a breakdown of what each library does:
asyncio: This library is used for writing asynchronous programs in Python. It allows the bot to run multiple operations concurrently without waiting for each one to finish, which can make the bot more efficient.
os: Provides a way of interacting with the operating system. This is used for reading environment variables, working with directories, etc.
subprocess: Allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.
click: This package simplifies the creation of command-line interfaces. It’s not explicitly used in the provided code but could be useful for future functionalities.
cv2: OpenCV (Open Source Computer Vision Library) is an open-source computer vision library that contains various functions to perform operations on pictures or videos.
datetime: Used for working with dates and times.
requests: HTTP library for making requests to the internet. It can be used to make API calls.
pyaudio: Provides Python bindings for PortAudio, the cross-platform audio I/O library. It’s likely used to handle audio streams.
numpy: Stands for ‘Numerical Python,’ it’s used for numerical operations and working with arrays.
io: Provides the Python interfaces to stream handling. The builtin open function is defined in this module.
PIL (Pillow): Python Imaging Library used for opening, manipulating, and saving image files.
selenium.webdriver.common.keys.Keys: Provides keys in the keyboard like RETURN, F1, ALT, etc.
time.sleep: Used to halt the execution of the program for a given time in seconds.
selenium.webdriver.common.by.By: Provides a way to refer to HTML elements in the Selenium Webdriver.
selenium.webdriver.support.expected_conditions (as EC): Provides a set of predefined expected conditions to use with WebDriverWait.
undetected_chromedriver as uc: A library used to operate Google Chrome for web automation in a way that avoids detection mechanisms.
Asynchronous command execution function
The run_command_async function gives you the power to execute shell commands without blocking, thanks to Python's asyncio and subprocess modules. It runs the command and captures its output in real-time, letting you move on to other tasks simultaneously.
async def run_command_async(command):
process = await asyncio.create_subprocess_shell(
command,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE
)
# Wait for the process to complete
stdout, stderr = await process.communicate()
return stdout, stderr
Google sign-in function
This function streamlines the whole login procedure for Google by using the provided email and password. It’s designed with built-in waiting intervals and screenshot capabilities, ensuring the operation is both well-managed and closely observed.
Here’s an asynchronous function called google_sign_in that accepts three arguments: email, password, and driver. The beauty of making it asynchronous is that it doesn't block the rest of your code from executing. While it's doing its thing, other functions can hop in and get their work done too.
driver.get("https://accounts.google.com")
In this step, the Selenium WebDriver takes you directly to Google’s Sign-In page.
The identifies the email input box through its HTML name attribute and proceeds to input the email address using the send_keys() function.
driver.save_screenshot('screenshots/email.png')
Capturing a screenshot of the current browser display serves multiple purposes, including troubleshooting issues and maintaining a record of actions taken.
The script identifies the “Next” button using its unique ID and then clicks on it. This action navigates us to the page where we can enter the password.
Much like it does with the email input, the script locates the password field using its HTML name attribute. Once identified, it proceeds to enter the password.
password_field.send_keys(Keys.RETURN)
This mimics the action of hitting the ‘Return’ key, thereby submitting the form and successfully logging the user in.
The script pauses for a brief 5-second interval, allowing enough time for the sign-in to finalize before capturing a concluding screenshot.
Function to join a Google Meet
At the heart of the Gladia Google Meet Bot is the join_meet function. This multi-faceted function orchestrates a series of actions and seamlessly incorporates various other features. Here's an in-depth look:
async def join_meet():
meet_link = os.getenv("GMEET_LINK", 'https://meet.google.com/dau-pztc-yad')
print(f"start recorder for {meet_link}")
# delete the folder screenshots if it exists even if not empty
print("Cleaning screenshots")
if os.path.exists('screenshots'):
#for each file in the folder delete it
for f in os.listdir('screenshots'):
os.remove(f'screenshots/{f}')
else:
os.mkdir('screenshots')
First, the system initializes the Google Meet link. If there’s an environment variable labeled “GMEET_LINK,” it’ll use that. If not, it falls back to a pre-configured link. Following this, it clears out any previous screenshots from the “screenshots” directory to ensure a clean slate for the new run.
The system kicks off by running shell commands to establish a virtual audio setting. This is accomplished through the use of PulseAudio and pactl, which act as the virtual audio drivers in this environment.
options = uc.ChromeOptions()
options.add_argument("--use-fake-ui-for-media-stream")
options.add_argument("--window-size=1920x1080")
options.add_argument("--no-sandbox")
options.add_argument("--disable-setuid-sandbox")
#options.add_argument('--headless=new')
options.add_argument('--disable-gpu')
options.add_argument("--disable-extensions")
options.add_argument('--disable-application-cache')
options.add_argument("--disable-setuid-sandbox")
options.add_argument("--disable-dev-shm-usage")
log_path = "chromedriver.log"
driver = uc.Chrome(service_log_path=log_path,use_subprocess=False, options=options)
driver.set_window_size(1920, 1080)
email = os.getenv("GMAIL_USER_EMAIL", "")
password = os.getenv("GMAIL_USER_PASSWORD", "")
gladia_api_key = os.getenv('GLADIA_API_KEY', ''),
if email == "" or password == "":
print("No email or password specified")
return
if gladia_api_key == "":
print("No Gladia API key specified")
print("Create one for free at https://app.gladia.io/")
return
print("Google Sign in")
await google_sign_in(email, password, driver)
This code snippet automates Google account login using Selenium’s Chrome web driver. First, it sets up the driver through a ChromeOptions instance, tweaking the browser’s behavior with various arguments.
For example, the flag ‘ — use-fake-ui-for-media-stream’ takes care of webcam and microphone permissions, while ‘ — window-size=1920x1080’ sets the browser window’s dimensions.
When running Chrome with root permissions, you’ll need to disable certain security features, which is what the flags ‘ — no-sandbox’ and ‘ — disable-setuid-sandbox’ are for. Additional flags like ‘ — disable-gpu’, ‘ — disable-extensions’, and ‘ — disable-application-cache’ disable GPU acceleration, Chrome extensions, and the app cache.
After configuring these options, the Chrome web driver is initialized, complete with these settings and a designated log path. Next, the code pulls environment variables for the Gmail email, password, and a Gladia API key. If any of these variables are missing, the code sends a warning message to the console and halts execution.
Wrapping it all up, the asynchronous function google_sign_in is invoked to handle the actual Google account login, using the email, password, and the initialized driver as its parameters.
The Selenium WebDriver uses the URL stored in the meet_link variable to navigate to the desired webpage. It essentially mimics the action of you manually typing the URL into the browser's address bar and hitting enter.
The execute_cdp_cmd method serves as a gateway for sending commands via the Chrome DevTools Protocol. In this specific case, it's employed to seamlessly approve a variety of browser permissions, such as capturing audio and video, for the website you're currently on—which happens to be a Google Meet link.
print("screenshot")
driver.save_screenshot('screenshots/initial.png')
print("Done save initial")
The code takes a snapshot of the webpage when it first loads and stores the image in a folder named ‘screenshots,’ labelling the file as ‘initial.png.’
This snippet aims to locate a button on the web page using its XPath, primarily to close a pop-up window. Upon successfully finding the element, it clicks the button. If the button isn’t found, the code outputs “No popup.”
Here, a 10-second pause is introduced before setting a variable called ‘missing_mic’ to False. This variable will serve a key role down the line, helping us identify whether or not a microphone is absent.
try:
driver.find_element(By.CLASS_NAME, "VfPpkd-vQzf8d").find_element(By.XPATH,"..")
sleep(2)
driver.save_screenshot('screenshots/missing_mic.png')
with open('screenshots/webpage.html', 'w') as f:
f.write(driver.page_source)
missing_mic = True
except:
pass
This try-except block attempts to search for an HTML element defined by a particular class name, along with its parent element, using XPath queries. If successful, it deduces that the microphone icon is absent. Consequently, it takes a screenshot for documentation and updates the ‘missing_mic’ variable to True.
Similar to the previous pop-up handling, this section aims to locate and click the button that grants microphone access, followed by capturing a screenshot.
print("Disable camera")
if not missing_mic:
driver.find_element(By.XPATH,'...').click()
sleep(2)
else:
print("assuming missing mic = missing camera")
driver.save_screenshot('screenshots/disable_camera.png')
print("Done save camera")
This part of the code verifies whether missing_mic is set to False. If it is, the code attempts to turn off the camera by triggering a button click. On the other hand, if the microphone is absent, the code presumes the camera is also unavailable.
This code takes a series of steps aimed at completing the authentication process. It starts by clicking a designated button, then inputs the name as ‘TEST,’ and finishes by clicking another button to advance. After entering the name, a screenshot labeled (give_non_registered_name.png) is captured for record-keeping. On the flip side, the “except” section is set up as a fallback. The assumption here is that if any operation in the “try” section encounters an issue, the authentication must already be complete. Consequently, a screenshot named (authentication_already_done.png)’ is taken before the script moves on to its next task.
# try every 5 seconds for a maximum of 5 minutes
# current date and time
now = datetime.datetime.now()
max_time = now + datetime.timedelta(minutes=os.getenv('MAX_WAITING_TIME_IN_MINUTES', 5))
joined = False
while now < max_time and not joined:
driver.save_screenshot('screenshots/joined.png')
print("Done save joined")
sleep(5)
try:
driver.find_element(By.XPATH,
'/html/body/div[1]/div[3]/span/div[2]/div/div/div[2]/div[1]/button').click()
driver.save_screenshot('screenshots/remove_popup.png')
print("Done save popup in meeting")
except:
print("No popup in meeting")
print("Try to click expand options")
elements = driver.find_elements(By.CLASS_NAME, "VfPpkd-Bz112c-LgbsSe")
expand_options = False
for element in elements:
if element.get_attribute("aria-label") == "More options":
try:
element.click()
expand_options = True
print("Expand options clicked")
except:
print("Not able to click expand options")
driver.save_screenshot('screenshots/expand_options.png')
sleep(2)
print("Try to move to full screen")
if expand_options:
li_elements = driver.find_elements(By.CLASS_NAME, "V4jiNc.VfPpkd-StrnGf-rymPhb-ibnC6b")
for li_element in li_elements:
txt = li_element.text.strip().lower()
if "fullscreen" in txt:
li_element.click()
print("Full Screen clicked")
joined = True
break
elif "minimize" in txt:
# means that you are already in fullscreen for some reason
joined = True
break
elif "close_fullscreen" in txt:
# means that you are already in fullscreen for some reason
joined = True
break
else:
pass
driver.save_screenshot('screenshots/full_screen.png')
print("Done save full screen")
This automates the task of joining a virtual meeting and switching to fullscreen mode in a web browser using Selenium for the automation. It establishes a time limit for the script’s operation, setting it to 5 minutes by default. This can be adjusted via an environment variable. A flag named ‘joined’ is set to False to monitor if the meeting has been joined.
In the core of the script, a while loop executes several operations. It starts by snapping a screenshot, stored as ‘joined.png’, to assist in any future debugging. A pause of 5 seconds follows to give the system some breathing room. Next, the script hunts for a popup button to click. If it fails to find one, a console message gets displayed.
The loop then searches for a “More options” button and clicks it if found. At this point, another screenshot is taken and saved as ‘expand_options.png.’ A small delay occurs before the script searches for the elusive “Fullscreen” button in the newly revealed options. Finding and clicking this button sets the ‘joined’ flag to True, thus terminating the while loop.
Throughout this process, the script maintains a log, capturing essential messages and screenshots for debugging or for keeping records.
The script uses FFmpeg to capture meetings and bases its recording length on an environment variable called DURATION_IN_MINUTES. In the absence of this variable, it defaults to recording for 15 minutes. To prepare for the FFmpeg command, this duration is transformed into seconds by multiplying it by 60.
Before kicking off the recording process, “Start recording” is displayed in the console to signal the beginning. The script then crafts an FFmpeg command, incorporating the calculated duration along with other settings such as video dimensions, frame rate, and audio input options. For command execution, it employs Python’s asyncio library and invokes a built-in function named run_command_async.
This function runs the command in an asynchronous manner, allowing other tasks to proceed without waiting. After the recording concludes, a “Done recording” message appears in the console, serving as a confirmation of the task’s completion.
print("Transcribing using Gladia")
headers = {
'x-gladia-key': os.getenv('GLADIA_API_KEY', ''),
'accept': 'application/json',
}
file_path = 'recordings/output.mp4' # Change with your file path
if os.path.exists(file_path): # This is here to check if the file exists
print("- File exists")
else:
print("- File does not exist")
file_name, file_extension = os.path.splitext(file_path) # Get your audio file name + extension
if str(os.getenv('DIARIZATION')).lower() in ['true', 't', '1', 'yes', 'y', 'oui', 'o']:
toggle_diarization = True
else:
toggle_diarization = False
with open(file_path, 'rb') as f: # Open the file
file_content = f.read() # Read the content of the file
files = {
'video': (file_path, file_content, 'video/'+file_extension[1:]), # Use the file content here
'toggle_diarization': (None, toggle_diarization),
}
print('- Sending request to Gladia API...');
response = requests.post('https://api.gladia.io/video/text/video-transcription/', headers=headers, files=files)
if response.status_code == 200:
print('- Request successful');
result = response.json()
# save the json response to recordings folder as transcript.json
with open('recordings/transcript.json', 'w') as f:
f.write(response.text)
else:
print('- Request failed');
# save the json response to recordings folder as error.json
with open('recordings/error.json', 'w') as f:
f.write(response.text)
print('- End of work');
The code serves as a utility for transcribing videos through the Gladia API. First off, it arranges the essential API request headers and fetches the Gladia API key from the system’s environment variables. Then, it pinpoints the video file you’re interested in transcribing by using its file path. It also double-checks to ensure the file actually exists, providing a feedback message accordingly.
Next, the script pulls out both the file name and its extension. At the same time, it scans the environment variables for a ‘diarization toggle.’ This little switch decides if the transcription should distinguish between multiple speakers in the audio clip.
Once all the preliminaries are out of the way, the code goes ahead and reads the video file in binary format, stashing its contents into a dedicated variable. With this in hand, it compiles the necessary payload for firing off the API request. A POST request is then dispatched to the Gladia API, followed by a quick status code check on the returned response.
Should the request go through successfully — evident by a 200 status code — the transcription data is neatly filed away into a transcript.json file. But if something goes awry, an error message finds its way into an error.json file. Finally, a message pops up on the console, signalling the end of the operation.
Conclusion
Creating a Google Meet bot that handles both recording and transcription could appear overwhelming at first, with challenges like bot detection and server limitations. Yet, when you have the appropriate tools and a solid grasp of the workflow, it becomes completely doable.
As a result, this bot will not only help to streamline the recording process but can also adds value by enabling effective summarization of transcriptions afterwards.
As an alternative to building the bot from scratch, you may consider using Recall, which provides a single API for meeting bots on every platform, including Google Meet. More on how it works in combination with Gladia's transcription here.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
ASR vs. LLMs – Why voice is among the biggest challenges for AI
When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.
Ultimate guide to using LLMs with speech recognition is here!
Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.
Should you host an in-house speech-to-text solution or outsource to an API provider?
Businesses across industries are adopting speech-to-text (STT) technology to unlock new use cases and meet growing customer expectations. Whether it’s powering virtual assistants, transcribing conversations, or analyzing audio data for insights, STT has become essential for delivering seamless and engaging experiences.