API Comparison Table

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

Item 1
Item 2
Item 3

Unordered list

Item A
Item B
Item C

Text link

Bold text

Emphasis

^Superscript

_Subscript

Pricing

Request a demo

Sign up

Get started

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

TL;DR: Generic STT engines fail in clinical environments because language model probability overrides correct acoustic detection of medical terms, substituting phonetically plausible but clinically wrong candidates silently. The result corrupts drug names, dosages, and diagnoses before the LLM ever sees them. Before selecting an STT engine for a medical scribe, verify four things: whether vocabulary biasing works at inference time without fine-tuning, whether async diarization accurately separates clinician and patient audio, whether the model holds up on noisy consultation recordings rather than clean read-speech, and whether the vendor's data training policy covers PHI by default on your plan.

Speech-To-Text

Migrating from self-hosted Whisper to a managed speech-to-text API

TL;DR: Self-hosting Whisper's true cost rarely sits in the model weights. GPU idle time, VRAM leaks under parallel load, and the engineering hours spent maintaining CUDA dependencies and diarization pipelines are where the bill compounds. For teams processing under roughly 3,000 hours per month, assuming 20% of one US FTE at $150K loaded annual cost, a managed API is cheaper, though the break-even shifts materially against your actual labor cost. Above that threshold, the decision depends on your DevOps overhead and whether audio accuracy on real-world recordings matters for downstream systems like CRM sync and coaching scores.

Speech-To-Text

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

TL;DR: Switching from AssemblyAI requires four concrete changes: update one auth header, remap batch endpoints, adjust the JSON response schema, and resample audio for WebSocket connections. Multiple customers independently report completing these in under a day with a rollback abstraction layer in place. The bigger structural difference is cost model: a production stack with diarization, sentiment, entities, and summarization runs $0.30/hr on AssemblyAI's Universal-2 tier because each feature is metered separately, versus a bundled base rate. This guide covers the exact parameter mappings, payload diffs, WebSocket reconfiguration, and a zero-downtime cutover strategy.

Introducing Whisper-Zero

Published on Nov 27, 2024

Today, we're thrilled to release a new breakthrough ASR system, Whisper-Zero —a complete rework of Whisper combined with multiple state-of-the-art models, using over 1.5 million hours of diverse audio, including phone-quality and noisy data from real-life environments.

The biggest product milestone for Gladia to date, Whisper-Zero removes virtually all hallucinations from transcription, providing better accuracy, faster speed, enhanced language support, and more features to our users. All in a single production-ready transcription and audio intelligence API.

Our story with optimizing Whisper

Gladia’s core product has been based on Whisper architecture since our conception. Released by OpenAI in 2022, the transformer-based Whisper model set a new standard for automatic speech recognition (ASR) for accuracy and multilingual capabilities. Despite its many advantages, the model came with usage limitations and hardware requirements that made it impractical for enterprise needs and scale.

In the months following Whisper's release, Gladia has transformed the open-source version of the model into a production-grade transcription API for companies. Compared to the original, Gladia delivered better accuracy, extended multilingual support, and additional high-value features like live streaming transcription, translation, speaker diarization, word timestamps and code-switching (i.e., detecting a language change in an audio recording).

There was one pain point we were yet to solve — hallucinations, a phenomenon where an ASR system produces transcriptions that include words or phrases that were not present in the original audio.

Towards hallucinations-free audio transcription

Powered by a predecessor of GPT-3 at the decoding phase, Whisper is notoriously prone to hallucinations, resulting from internal — such as training data and model architecture — and external factors like complex input audio. It's even been reported that the latest version of the model, Whisper v-3, released a few weeks back by OpenAI, is in fact more likely to hallucinate compared to the most accurate of the 'Whispers', the large v-2.

Despite being described by the CEO of OpenAI as the "magic of AI", hallucinations are in reality a huge pain point for any company that relies on transcription to improve its operations and deliver a better user experience. By reducing the overall accuracy of transcription, they make it harder for companies to leverage transcripts to build ASR-powered apps, especially in use cases where the data extracted from transcriptions is used to feed one's database directly, as in the case of automated CRM enrichment, or showcase the transcript in real-time to the final user via live captions.

Gladia has committed to fixing this issue once and for all. In addition to upgrading the existing features set, we have improved the model’s architecture to mitigate Whisper’s hallucination flaw. The resulting word error rate (WER) — a metric used to assess the accuracy of speech recognition systems — is 10-15% more accurate comparing to both Whisper large v2 and v3.

Delivering the best version of enterprise Whisper

Moreover, Whisper-Zero has been optimized specifically for complex environments to account for another Whisper limitation — the fact that the base model was trained on large volumes of data collected from the internet, making it a versatile yet generalist audio model, which is mathematically more biased towards phrases that have nothing to do with professional audio data.

With the fine-tuning and prompt engineering done by Gladia, our customers from online meetings, media, call centers, and otherd domains, can now enjoy better precision in real-life, non-sterile scenarios.

In addition to that, for this release we have put special emphasis on enhancing transcription accuracy in multilingual environments, with Whisper-Zero fine-tuned to recognise a wide variety of accents.

In a nutshell, today we’re offering the market the best enterprise-grade version of Whisper, which removes its biggest limitations, boosts performance, and enhances its capabilities with more features. You can now enjoy the best version of Whisper in the cloud, without limitations, addressing enterprise scale and needs.

More resources

Contact us

Your request has been registered

A problem occurred while submitting the form.

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

GDPR Compliant

HIPAA Compliant

AICPA SOC Type 2

ISO 27001 Compliant

Gladia

Newsletter

Become the Speech AI expert in your organization with content from Gladia right in your inbox, no more than twice a month.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

By continuing your navigation, you apply the use of cookies intended to improve the performance and the functionalities of this site.

No, thanks

Accept

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Heading 1

Heading 2

Heading 3

Heading 4

Heading 5

Heading 6

New model: Solaria-3

Test our real-time and async transcription

2026 Meeting Assistant Report

Read more

Speech-to-text for AI medical scribes: Why clinical vocabulary breaks generic STT

Migrating from self-hosted Whisper to a managed speech-to-text API

Migrating from AssemblyAI to Gladia: A step-by-step switching guide

Introducing Whisper-Zero

Our story with optimizing Whisper

Towards hallucinations-free audio transcription

Delivering the best version of enterprise Whisper

More resources

Contact us

Read more

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.

Gladia

Newsletter

From audio to knowledge

Subscribe to receive latest news, product updates and curated AI content.