Home
Blog
Real-time audio transcription API

Real-time audio transcription API

Real-time audio transcription API
Published on
Oct 2024

We’re happy to announce the general availability of Gladia’s new real-time audio transcription and insights engine.

An easily integrated and multilingual voice API combining speech recognition and generative AI to provide transcription, insights and assistance for contact centers, virtual meetings and editing platforms, in real-time.

Gladia Real-Time transcribes audio at latency as low as 300 milliseconds, supports 100+ languages interchangeably and includes embedded custom vocabulary, named entity recognition and sentiment analysis.

Highly versatile in its applications, real-time transcription is especially valuable for contact center solutions, software providers, voice AI companies and virtual meeting recorders. We’re thrilled to deliver this upgraded and improved product to customers worldwide.

In this blog, we’ll dive into the hidden mechanisms behind real-time transcription, explore its key challenges and use cases, and explain how to get started with real-time transcription using Gladia’s API.

Understanding live transcription

In a nutshell, live transcription operates by capturing audio input from sources like microphones or streaming services, processing the audio using Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) technology, and providing a near-instant, continuous stream of transcribed text as the speaker talk.

Transcribing speech in real time is ripe with technical challenges and requires a hybrid ASR / NLP model architecture to yield accurate results consistently.

Gladia’s API is based on OpenAI’s Whisper ASR. Because the original version of the model doesn’t support real-time transcription and WebSockets, our approach consists of reengineering Whisper to add top-tier transcription in real time while keeping its core functionality and quality intact. Today, the quality of Gladia’s proprietary transcription engine is attributed to a hybrid architecture, where optimization occurs at all key stages of its end-to-end transcription process.

Speech recognition & NLP

First, we implement filtering or other pre-processing techniques to optimize the input audio for real-time processing.

Then, we need the system to accurately transcribe and understand speech. Our enhanced language detection, supporting 100+ languages, comes in handy here by allowing us to automatically determine the language or dialect relevant to your application. We use various NLP techniques to enhance the accuracy of transcription by considering context, grammar, and semantics, as well as adding word-level timestamp metadata if needed.

Our API also includes embedded custom vocabulary, letting you add entries to enhance the precision of transcription, especially for words or phrases that recur often in your audio file. All without compromising on latency. Further, Named Entity Recognition (NER) helps with identifying and extracting keywords and named entities such as organizations, names, locations, events, dates, and many more elements from audio files.

Real-time processing

In a live transcription scenario, audio data is continuously generated as a user speaks. The ability to display the transcript as it’s being said with minimal perceptible delay is a key technical requirement for a satisfying end-user experience.

In ASR, the delay between the time a speaker utters a word or phrase and the time the ASR system produces the corresponding transcription result is known as latency.

The acceptable range for low latency is highly dependent on the specific needs of each application and end-user expectations. Our real-time latency is around 300 milliseconds, making it optimal for most contact center solutions, software providers and AI voice assistants that require real-time control and response.

To ensure a consistent, real-time flow of information, we rely on advanced streaming capabilities and use a combination of WebSocket and VAD technologies.

WebSocket is a protocol that facilitates bidirectional, real-time communication between a client (e.g. a web browser or application) and a server (where our API is hosted), ensuring consistent low-latency audio transmission and updates. Result: immediate access to live transcriptions for end users, with reduced network overhead and resource utilization on both the client and server sides. To learn more about setting up a WebSocket and using it with Gladia, check this Golang tutorial on the topic. Other programming languages are available in the Gladia repository on GitHub, here.

Voice Activity Detection (VAD) is a technology used to determine whether there is significant audio activity (speech) in an audio signal. It analyzes incoming audio data and identifies periods of speech and silence. End-pointing is an especially critical step in VAD, where the system identifies the moment when speech ends or transitions into silence or non-speech sounds to produce more accurate end results. We set a default of 300 milliseconds of “blank” in the voice that will trigger the transcription while allowing the customers to specify the duration in which the voice is being heard.

Combining WebSockets with VAD enabled us to build an efficient and responsive live transcription machine, delivering great results in real-life professional use cases in terms of both accuracy and latency.

Important to know 💡

What is the difference between partials and finals?

Partial recognition, or ‘partials’, involves transcribing portions of spoken words or phrases as they are received, even before the speaker has finished speaking the entire word or sentence. Transcribing speech “as you go” in this way makes for lower-than-average latency, at the expense of accuracy.

In contrast, final recognition, or ‘finals’, occurs when the ASR system has enough information to transcribe a complete word or phrase. It waits for a clear endpoint before providing a transcription and is powered by a bigger model that “rewrites” the script retrospectively. The delay may be slightly longer, but still provides a near-instant experience for the user.

When to use each?

Gladia API uses a hybrid approach that combines both partial and final recognition. Our system transcribes partial segments for real-time feedback and switches to final recognition when it has enough context to transcribe with high accuracy.

As a rule of thumb, we generally recommend prioritizing finals owing to greater accuracy. That said, partials can be incredibly useful for use cases where a real-time UI display is a must.

Scalability and load balancing

Owing to the fact that the bidirectional flow to the WebSocket is constant, the underlying infrastructure needs to be running 100% of the time, which makes it more expensive.

To draw an analogy, audio processed via batch, or asynchronous, transcription can be compared to a ZIP file – since it’s compressed, its storage value for an API provider is significantly lower. With this kind of file, the so-called ‘real-time factor’ of execution is very small (e.g., 1/60 factor in the case of standard hour-long audio without diarization) compared to audio sourced from live streaming scenarios (where it becomes more like 1/1).

As such, the final key challenge of providing a live transcription API consists of finding ways to ease the load on the underlying infrastructure without imposing high costs on the client. To address this, a speech-to-text provider must design an internal infrastructure capable of scaling horizontally.

At Gladia, we implement special load-balancing strategies to distribute transcription requests across multiple servers and instances to handle high volumes of audio input – without making our clients bear an unreasonable cost. 

Use cases for live transcription

Complex as it may be on the technical side, live transcription is an incredibly valuable feature that helps to gain immediate access to speaker insights and enables a delightful user experience.

Real-time transcription is especially useful in scenarios where you need to react to what's being said directly, where very low latency or wait time is required. Conversational bots are another common application, as well as real-time captions for conferences in videos.

Here are some specific use cases we’ve worked with at Gladia so far:

  • Virtual and on-site meetings. Documenting time-sensitive meetings without having to wait for the transcript or generating real-time captions in international meetings.
  • Customer support and call centers. Transcribing customer inquiries and agent responses in real-time to assist customer service representatives in providing more accurate and efficient support and conducting quality assurance.
  • Voice AI. Real-time transcription can be highly useful for companies using speech-to-text and text-to-speech APIs in their product experiences, targeting businesses that require enhanced communication capabilities.
  • Healthcare. Transcription during both in-person and remote medical consultations, as well as for emergency call services, for more effective time-allocation of the medical personnel’s valuable time. Can be used for medical conferences, too.
  • Finance. Providing the stakeholders with immediate access to up-to-date financial information in an industry where speed is key.
  • ‍Media. Making use of the feature during live broadcasting and events for real-time subtitling and dubbing.

Getting started with Gladia live transcription API

To get started with live transcription, you can create a free account on app.gladia.io and consult our developer documentation for more detailed guidance on its implementation. 

About Gladia

At Gladia, we built an optimized version of Whisper in the form of an API, adapted to real-life professional use cases and distinguished by exceptional accuracy, speed, extended multilingual capabilities and state-of-the-art features, including speaker diarization and word-level timestamps.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Speech-To-Text

Key techniques to improve the accuracy of your LLM app: Prompt engineering vs Fine-tuning vs RAG

Large Language Models (LLMs) are at the forefront of the democratization of AI and they continue to get more advanced. However, LLMs can suffer from performance issues, and produce inaccurate, misleading, or biased information, leading to poor user experience and creating difficulties for product builders.

Speech-To-Text

Keeping LLMs accurate: Your guide to reducing hallucinations

Over the last few years, Large Language Models (LLMs) have become accessible and transformative tools, powering everything from customer support and content generation to complex, industry-specific applications in healthcare, education, and finance.

Case Studies

Transforming note-taking for students with AI transcription

In recent years, fuelled by advancements in LLMs, the numbers of AI note-takers has skyrocketed. These apps are increasingly tailored to meet the unique needs of specific user groups, such as doctors, sales teams and project managers.