What startups should look for in a speech-to-text API
Published on Jan 22, 2025
The revolution in both LLMs and voice technology in recent years has opened up unprecedented opportunities for startups. From virtual meeting assistants to AI voice agents, speech-to-text (STT) capabilities are becoming central to modern applications. However, choosing the right STT API provider involves navigating a complex landscape of technical specifications, features, and trade-offs that can significantly impact your product's success.
Speech-to-text is a core component of your product, but not the one you're ready or willing to build in-house, choosing to focus your existing resources on your core, differentiating product. So, you need a reliable, high-quality API that satisfies your core need, transcription and beyond, at a reasonable cost, even as your volumes scale as your project takes off.
In this blog, we explore the critical factors that will help you make an informed decision on which provider to pick, be it for launching your first MVP or consolidating your PMF, including:
Latency
Accuracy
Features
Language support
Security and regulation
Hosting
Looking for a more general primer on STT technology and what it takes to build a voice app? Check out this comprehensive guide instead.
Understanding the processing paradigms: Async vs. real-time latency
When building a voice platform or app, one of your first questions will be determining the type of transcription best suited to your product and use case.The key distinction between asynchronous (async) and real-time transcription lies in how quickly the transcription is generated and how it processes audio in relation to the speech.
One of the first and most crucial decisions you'll face is choosing between asynchronous (async) and real-time processing. Imagine the difference between a court stenographer typing in real-time and a transcriptionist working from a recording—both serve different purposes and come with distinct advantages.
Real-time transcription delivers instant results with latencies under one second, making it ideal for applications that require immediate interaction. For instance, a customer service bot needs to understand and respond to queries instantly, much like a human agent would. Similarly, live captioning for virtual events demands immediate transcription to keep pace with speakers.
Async processing (also known as 'batch'), on the other hand, takes a more measured approach. While it might take minutes or hours to process audio files, it often delivers higher accuracy and is more cost-effective. This makes it perfect for applications like meeting summaries or podcast transcription services, where immediate results aren't crucial.
Modern technology has begun to bridge this gap, offering solutions that combine real-time response with batch-level accuracy. However, this comes with higher computational demands and, consequently, increased costs. Understanding this trade-off is crucial for planning your startup's resource allocation.
A word on accuracy in speech-to-text
For businesses operating in customer service — for instance contact centers — maintaining accuracy amidst varying audio quality and background noise is essential. A transcription solution must adapt to challenging environments such as low-quality calls, network disruptions, and diverse speaker accents.Word Error Rate (WER) is a widely used metric for assessing the accuracy of ASR technologies. It provides a standardized way to compare different speech-to-text (STT) models and providers, helping organizations evaluate their options.
WER measures the percentage of words in the output that differ from the words in the reference or ground truth text. A lower WER indicates better performance of the system and vice versa. However, WER has notable limitations, particularly when applied to real-world enterprise scenarios. While designed to measure accuracy against an 'ideal' academic benchmark, WER often falls short in reflecting performance in practical use cases.
In professional environments, the critical factor is not overall transcription accuracy but the precision of key elements—such as names, addresses, or other specific data—used for downstream processes like CRM enrichment. Standardized benchmarks like WER rarely capture this nuance. Disregarding a model solely based on its WER score could mean overlooking a solution that, with customization, performs exceptionally well for your specific needs.
When choosing an ASR provider, impeccable accuracy often comes with trade-offs in speed and cost. However, not every use case demands perfect transcription. For example, podcast editing, subtitling, or translation workflows often prioritize accuracy over speed. Conversely, applications like summarization—where AI distills key insights from spoken content—may tolerate minor transcription errors as long as the core message remains clear.It’s essential to go beyond standardized metrics like WER when assessing accuracy.
Instead, focus on how well the ASR system performs in real-world conditions, such as handling background noise, diverse speaking styles, and a variety of languages and accents. Using your own datasets during evaluation can provide a more accurate representation of how the system will function in your specific environment.
When selecting an ASR provider for your startup, it’s crucial to look beyond WER and evaluate how well a model addresses the unique demands of your use case. Balance your accuracy needs with the trade-offs in speed and cost while keeping your use case in mind. By evaluating ASR systems under realistic conditions and tailoring metrics to your requirements, you can ensure the chosen solution aligns with both your operational goals and end-user expectations.
From transcription to insights: Key extra features to consider
On top of the core functionality of transcribing audio to text, STT providers are developing additional features that make transcripts easier to digest and can provide insights from the audio data. Let’s take a closer look at some of the most popular features among LLM-based voice platforms.
Speaker Diarization
Speaker diarization might sound technical, but it's simply the ability to determine who said what in a conversation. Think of it as an invisible secretary taking notes in a meeting, accurately attributing each statement to its speaker. This feature is measured by the Diarization Error Rate (DER), which accounts for confusion between speakers, missed speech, and false detections. The importance of this feature becomes clear in scenarios like meeting transcription services or interview analysis tools. Without accurate speaker attribution, a transcript of a lively team discussion would read like a confusing monologue.
Gladia's partner, pyannoteAI, is an industry leader in this area, with diarization models that boast one of the highest precision levels and state-of-the-art solutions for voice AI, including overlapping speech detection.
Custom Vocabulary
Every industry has its unique terminology. Medical professionals discuss "myocardial infarctions" rather than heart attacks, while tech professionals casually throw around terms like "API" and "SDK." Custom vocabulary features allow you to teach the STT system these specialized terms, significantly improving transcription accuracy in your specific domain.
Named Entity Recognition (NER)
NER extracts and identifies key information from unstructured audio data, such as names of people, organizations, locations, and more.
A common challenge with unstructured data is that this critical information isn’t readily accessible—it's buried within the transcript.To solve this, Gladia developed a structured Key Data Extraction (KDE) approach. By leveraging the generative capabilities of its Whisper-based architecture—similar to LLMs—Gladia’s KDE captures context to identify and extract relevant information directly. This process can be further enhanced with features like custom vocabulary and NER, allowing businesses to populate CRMs with key data quickly and efficiently.
Some other popular features included in most commercial APIs include:
Topic classification: Categorizes content into predefined topics for easier content indexing.
Sentiment analysis: Analyzes the sentiment behind audio recordings to improve customer experiences and sales performance.
Speech moderation: Flags inappropriate or offensive speech, such as hate speech, based on customizable parameters.
Depending on your use case, you might need some or all of these features alongside transcription. You can opt for a one-stop-shop provider that supports all these functionalities or choose a provider with the best core transcription capabilities and customize the additional features you need.
Below is an overview of the features Gladia’s API supports, offering flexibility and efficiency tailored to your business needs.
Language support
One of the biggest challenges in ASR transcription is supporting and detecting speech in several languages.
Transformer-based multilingual ASR systems use language detection as their ears. Language detection leverages deep learning models trained on vast amounts of multilingual audio data, analyzes the incoming speech to identify subtle variations in speech patterns, and tailors the processing for that specific language.
While some models can identify the language on the fly, others – including among the Big Tech providers – require you to specify the input language to transcribe it correctly. However, very few, if any, models have been equally exposed to all languages. Most ASR models lean heavily towards certain languages because of the datasets they were trained on, or because they assign uneven weight to certain parameters in the transcription process. Some of them perform well exclusively with English and the 30 most commonly represented languages in written media, with varying WER indicators across languages.
To address this issue, you can leverage pre-trained models and fine-tune them for specific languages and tasks such as translation, or seek a provider who provides such optimizations as part of their API package.
When selecting an STT provider, ensure it performs well across all relevant languages, accents, and dialects. Providers may claim broad language support, but real-world performance can vary, so thorough internal testing is essential.
If you want to learn more about how ASR models navigate languages – and which ones do it best – check our dedicated article.
Security and compliance
Given the highly confidential nature of enterprise audio data, it’s becoming increasingly important to verify how a provider approaches data privacy.
Enquire with your STT provider about any security-related certifications they have, including SOC 2 Type 1/Type 2, HIPAA, and ISO 27001 or ISO 27701, validating that the company has appropriate security and compliance processes in place.
Furthermore, here are additional security-related techniques that can be applied to protect your audio data:
Encryption. Helps to protect sensitive or confidential information contained in audio files, such as customer data, trade secrets, or intellectual property.
Speech moderation. Allows to automatically identify and flag hate speech or other inappropriate and offensive verbal content based on predetermined parameters, internal protocols, or external regulations.
Anonymization of Personally Identifiable Information (PII). Also known as PII redaction, is used to detect, tag and remove any personally identifying information, such as an address, card number, SSN, phone number, and more. While it’s true that self-hosting is the absolute safest option when it comes to data privacy, the level of security provided by commercial providers is achieving comparable levels.Besides certification and features, hosting architecture options offered by your provider can further protect your data, as explained below.
Hosting
If you want to embed the speech recognition system in your existing tech stack, you need to decide where the underlying network infrastructure should be located, and who you want to own it.
Cloud multi-tenant
All users share the same hardware and software, as well as the same instance of the software, which is provided by a third-party provider that oversees installation, maintenance, software upgrades, and potential patches.
Cloud single-tenant
The concept of cloud single-tenant is similar to multi-tenant, except that it has a dedicated cloud infrastructure per client, managed by an external provider. That means each user has access to their own instance of the software.
On premise
Licensed software is hosted on client-controlled data centers, in an exclusive physical and virtual network that tends to be managed by the company’s IT department or a third-party provider.
Air gap
Works pretty much like on-premise, except that no third-party providers can access the system since it’s completely isolated, even from the internet.
When choosing between on-premise and cloud-based ASR hosting, consider factors like scalability, cost, and security. While on-premise solutions offer control and potentially better latency, they come with high deployment and maintenance costs and limited scalability. Cloud-based options provide greater flexibility, lower upfront costs, and easy scaling, with security concerns easily addressed by reputable providers. The right choice depends on your business’s needs, growth potential, and resources.
Best practices for building voice apps and audio features
For the last two years, we’ve worked closely with hundreds of startups experimenting with the powerful combination of LLMs and ASR models to create pioneering voice platforms and apps. Time and again, they’ve shared how accurate transcription and advanced language modeling working hand-in-hand are critical to achieving exceptional results.
Drawing on our research and insights from real-world implementations, we’ve compiled practical strategies to help you maximize the potential of both technologies. Download our guide below to continue exploring the many possibilities speech and language models have to offer to startups.
Contact us
Your request has been registered
A problem occurred while submitting the form.
Read more
Speech-To-Text
ASR vs. LLMs – Why voice is among the biggest challenges for AI
When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.
Ultimate guide to using LLMs with speech recognition is here!
Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.
What startups should look for in a speech-to-text API
The revolution in both LLMs and voice technology in recent years has opened up unprecedented opportunities for startups. From virtual meeting assistants to AI voice agents, speech-to-text (STT) capabilities are becoming central to modern applications. However, choosing the right STT API provider involves navigating a complex landscape of technical specifications, features, and trade-offs that can significantly impact your product's success.