Home
Blog
How to set up a Node.js transcription WebSocket with the Gladia live audio transcription API: A step-by-step guide

How to set up a Node.js transcription WebSocket with the Gladia live audio transcription API: A step-by-step guide

How to set up a Node.js transcription WebSocket with the Gladia live audio transcription API: A step-by-step guide
Published on
Mar 2024

Have you ever used an audio-to-text transcription application to convert audio-to-text and wondered how it worked or how to build one? Are you a developer looking to add audio transcription to your next project? This article answers these questions and more.

This article covers real-time audio and video transcription with Gladia API using WebSocket and Node.js. With the knowledge contained in this article, you will be able to transcribe audio calls in real-time, provide subtitles for live audio streams, and build voice-activated assistance and chatbots using Gladia’s audio-to-text transcription feature. Let’s get started!

What is real-time audio transcription

Real-time audio transcription is the process of converting speech into text as words are being spoken. It is a simple and effective technology involving three core steps:

  1. Accept the audio input through a mic.
  2. Convert the audio into text using advanced algorithms and trained models.
  3. Display the transcription text on a screen or print it out.

Today, speech-to-text AI is used by businesses almost on a weekly basis. Virtual meeting platforms rely on live ASR to transcribe, generate summaries, and provide live captions during meetings. Voice assistants and search engines convert requests from audio format to text before processing the request. 

Why is WebSocket used in real-time applications?

Websocket is an advanced web technology that ensures seamless two-way communication between the client (browser) and the server. It enables real-time connection by establishing a handshake that provides all data necessary to ensure it stays open.

WebSocket is used in building real-time applications for the following reasons:

  • Bi-directional: WebSocket provides a feature that ensures the continuous flow of audio data from the client to the server and the continuous flow of the transcribed text from the server to the client. This feature is necessary to ensure effective communication on real-time audio transcription applications.
  • Low latency: This is the minimal delay that occurs during data transmission. WebSocket establishes low latency during data transmission between the client and the server. Minimal latency is crucial in real-time audio transcription applications to ensure that transcribed text is transmitted after the audio data is received.
  • Continuous streaming: Scenarios like live meetings require optimum performance as audio data gets streamed. WebSocket enables transcription applications to process the audio data efficiently.
  • Reduced network load: WebSocket establishes a single connection that keeps open for multiple requests, unlike the traditional HTTP. This feature is essential for an uninterrupted and stable client-server connection.

Why use the Gladia Audio Transcription API

Gladia audio transcription API is a versatile AI model that transcribes audio and video data. The API uses automatic speech recognition (ASR) and natural language processing (NLP) technologies to unlock a range of audio intelligence features for enterprise apps. 

These include speaker diarization, which separates audio transcripts based on speakers; word-level timestamps, which assign timestamps in milliseconds to each word transcribed; and code-switching, which allows to accurately transcribe speech even as speakers switch between two or more languages during a conversation.

Gladia API is based on OpenAI’s Whisper ASR. Because the original version of the model doesn’t support live transcription and WebSockets, our approach consisted of reengineering Whisper to add top-tier real-time transcription while keeping its core functionality and quality intact. Today, we rely on advanced streaming capabilities and use a combination of WebSocket and VAD technologies for low-latency live transcription. 

Setting up Gladia Transcription API with Node.Js and WebSocket

Prerequisites

  • JavaScript (ES6)
  • Node.Js and its framework, Express.Js

Create a Node.Js server with WebSocket

Step 1: Install dependencies

Install the following dependencies. 

  • ws: A WebSocket npm package used for implementing WebSocket connection.
  • dotenv: A package used to load environment variables.
  • mic: A package used for capturing audio data.
  • nodemon: A package to keep the server running.

Now, let’s install these packages.


const http = require('http')
const express = require('express')
const webSocket = require('ws')
const mic = require('mic')
const server = http.createServer(express)
const PORT = 3000
const wss = new webSocket.Server({ server })
wss.on('connection', () => {
  console.log('New WebSocket Connection')
})
server.listen(PORT, () => {
  console.log(`Server is running at port ${PORT}`)
})

Let’s look at the code above. It sets up an HTTP server and listens for any WebSocket connection. 

Firstly, we start by importing all the npm packages. Let’s create a file and name it index.js. Next, we create an HTTP server and initialize a port number. The WebSocket module creates the WebSocket server and further sets up an event listener that listens to a connection from the client side. When connection occurs, the message New WebSocket Connection gets printed in the terminal.

Lastly, the server listens for the specified port number, which is 3000 in this case, and once it is successful, the callback function gets executed, and the message Server running at port:3000 gets printed on the screen.

Connect to the Gladia Audio Transcription API

Step 1: Sign up and retrieve the API key

Sign up to create an account. Upon signing up, you will receive 10 hours of free API calls, which can be upgraded at any time.

Navigate to the Home page and retrieve the API key. You can also generate a new API key.

Step 2: Connect to the Gladia Audio Transcription WebSocket

Back to the code editor, create two files named index.js and .env.

Your file tree should look like this above. Paste the API key in the .env file.


const webSocket = require('ws')
require('dotenv').config()
const gladiaKey = process.env.GLADIA_API_KEY
const gladiaUrl = "wss://api.gladia.io/audio/text/audio-transcription"
const ws = new webSocket(gladiaUrl)

We start by importing the necessary modules. Then, we access the API key saved in the .env file. We also need the WebSocket endpoint of the Gladia API endpoint.


cconst SAMPLE_RATE = 16000
ws.on("open", () => {
  const configuration = {
    x_gladia_key: gladiaKey,
    language_behaviour: "automatic single language",
    sample_rate: SAMPLE_RATE,
    encoding: "WAV",
  }
  ws.send(JSON.stringify(configuration))
})

The WebSocket instance listens for an open event after a connection occurs to the Gladia API. The application sends a configuration message consisting of properties to the API. Let’s look at these properties and what they mean.

- x_gladia_key: This holds the API key as its value.

- language_behaviour: This sets how the API model should detect the audio data.

- sample_rate: Defines the number of sampling times of an audio signal.

- encoding: Defines the audio format data.


ws.on("message", (event) => {
  if (!event) return
  const utterance = JSON.parse(event.toString())
  if (!Object.keys(utterance).length) {
    console.log("Empty ...")
    return
  }

  if (utterance.event === "connected") {
    console.log(`${utterance.event} sucessfully with Connection id: ${utterance.request_id} `)
  }
})

Firstly, the WebSocket instance listens for a message event. The application returns from the function if no event exists.

The application checks if the properties of the message do not exist, and returns from the function. If the type of event is connected, the WebSocket connection is successful.

Integrating Gladia's real-time transcription

Step 1: Sending audio data to the API

After sending the configuration message, the audio data needs to be received through an input device, in this case, a microphone.


const microphone = mic({
    rate: SAMPLE_RATE,
    channels: "1",
  })
  const microphoneInputStream = microphone.getAudioStream()
  microphoneInputStream.on("data", function (data) {
    const base64 = data.toString("base64")
    if (ws.readyState === webSocket.OPEN) {
      ws.send(JSON.stringify({ frames: base64 }))
    } else {
      console.log("WebSocket ready state is not [OPEN]")
    }
  })
  microphoneInputStream.on("error", function (err) {
    console.log("Error in Input Stream: " + err)
  })
  microphone.start()

The mic module creates the microphone objects, which specify the rate and the channels. To get the audio stream from the microphone, the getAudioStream() is called. The microphone input stream listens for new audio data, picks it, and converts it into a base64-encoded string.

Next, the application confirms the WebSocket connection is in readyState before sending the audio data to the API. The application prints an alert message if the WebSocket connection is not on the readyState

The readyState is the current state of the WebSocket connection, and it needs to be in OPEN state before data can be sent or received. 

The input streams listen for an error event and log an error message if any error occurs.  It is important to note that WebSocket can be inconsistent due to errors from either the client or server side of the application. Therefore, it is necessary to make retries anytime errors occur.

Lastly, the microphone captures the audio data after calling the start() method.

Step 2: Receiving transcription results

There are three(3) possible events associated with Gladia API. There are namely connected, transcript, and error. We have looked at the connected event already. In this part, we will look at the transcript event.

The transcript event contains some properties. Let’s understand the major ones.

  • type: This identifies the type of transcription result. Its value can either be final or partial. If it is partial, it means there is more duration of silence after the last utterance before the next. Otherwise, it is considered final.
  • transcription: This contains the transcribed text.
  • language: This identifies the language detected from the audio data.

 else if (utterance.event === "transcript" && utterance.transcription) {
    console.log(
      `${utterance.type}: (${utterance.language}) ${utterance.transcription}`
    )
  }

Back to the message event listener, if the type of event is transcript, and the transcribed text exists, then the result is logged to the console. This result will contain the transcribed text, language, and other information as shown below.

This result will contain the transcribed text, language, and other information as shown below.

Step 3: Error handling

There are many ways errors could occur in our applications. We check for these errors and take necessary actions to handle them.


if (!gladiaKey) {
  console.error("You must provide a gladia key. Go to app.gladia.io")
  exit(1)
} else {
  console.log("using the gladia key : " + gladiaKey)
}

The application checks if the API key exists before proceeding with the rest of the code.

If the API key doesn’t exist, an error message gets logged to the console. Otherwise, a message containing the API key gets printed to the console.


else if (utterance.event === "error") {
    console.error(`[${utterance.code}] ${utterance.message}`)
    socket.close()
  }

In the message event listener of the WebSocket connection, we check if the received data event is of type error, and then an error message is logged to the console.


ws.on("error", (error) => {
  console.log("An error occurred:", error.message)
})

The ws object listens for an error event and logs an error message if any exists.

Access the complete code in the GitHub Repository.

Conclusion

In conclusion, the Gladia live transcription API presents a turnkey solution to audio and video transcription in enterprise applications. For developers looking to harness its potential, understanding the seamless setup of the API is pivotal for optimal functionality in your applications.

With the help of this guide, you now know how to set up real-time audio transcription with Gladia API in a Node.Js environment with WebSocket, integrate the API to get transcription results, and handle errors effectively. 

Additional resources

  • Deep dive: Gladia's real-time transcritpion powered by optimized Whisper
  • Tutorial: Enhancing real time transcription with WebSockets and Golang

About Gladia

At Gladia, we built an optimized version of Whisper in the form of an API,  adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities, and state-of-the-art features.

To learn more about Gladia’s approach to enhancing the Whisper transcription performance for companies, check out our new model, Whisper-Zero, or sign up for the API directly below.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Product News

Our Road to Real-Time Audio AI – with $16M in Series A funding

Real-time audio AI is transforming the way we work and build software. With instant insights from every call and meeting at their fingertips, customer support agents and sales reps will be able to reach new levels of efficiency and deliver a more delightful customer experience across borders.

Product News

Gladia selected to participate in the 2024 AWS Generative AI Accelerator

We’re proud to announce that Gladia has been selected for the second cohort of the AWS Generative AI Accelerator, a global program offering top early-stage startups that are using generative AI to solve complex challenges, learn go-to-market strategies, and access to mentorship and AWS credits.

Tutorials

How to implement advanced speaker diarization and emotion analysis for online meetings

In our previous article, we discussed how to unlock some of that data by building a speaker diarization system for online meetings (POC) to identify speakers in audio streams and provide organizations with detailed speaker-based insights into meetings, create meeting summaries, action items, and more.