Home
Blog
From Speech to Knowledge: Gladia’s Audio Intelligence API

From Speech to Knowledge: Gladia’s Audio Intelligence API

From Speech to Knowledge: Gladia’s Audio Intelligence API
Published on
Apr 2024

Gladia is proud to announce the general availability of its groundbreaking Speech-to-Text API, previously in alpha. The revamped enterprise-grade API supports transcription, speaker diarization, word-level timestamp, code-switching, and beta translation in 99 languages.

tl;dr

  • Gladia is proud to announce the general availability of its groundbreaking Speech-to-Text API, previously in alpha.
  • The enterprise-grade API supports transcription, speaker diarization, word-level timestamps, code-switching, and beta translation in 99 languages
  • Following user feedback, we have prioritized accuracy while remaining one of the fastest solutions on the market (60 sec for 1h of audio). Our performance has been proven in real-life client use cases such as emergency services and virtual meetings.
  • Our tech is powered by state-of-the-art models, including the optimized Whisper Large-v2, currently being fine-tuned on 1.5 million hours of professional audio to accommodate a variety of topics, accents, and jargon. 
  • Our pricing is designed for scale at $0.00017/sec. No roundups, no starting costs, and pay-as-you-go flexibility. Generous free tier included.
  • We put privacy first. Our data handling and hosting are 100% GDPR-compliant.   
  • A full package of multilingual Audio Intelligence add-ons, including summarization, sentiment analysis, and keyword extraction, will follow in the upcoming weeks. 
  • Last but not least, we’re thrilled to unveil our brand new visual identity and website, reflective of our long-term AI Knowledge Infrastructure vision. 

From Speech-to-Text Alpha to Audio Intelligence

In February, driven by the ambition to fix the broken market of AI audio transcription, we released our Speech-to-Text (STT) alpha. Powered by a proprietary plug-and-play API, it was able to transcribe 1h of audio in less than a minute — making it one of the biggest breakthroughs in audio transcription tech to date.

The market responded. The tweet, announcing the Gladia API with a quality of transcription that is 5x more accurate and 60x faster than the Big Tech providers, got half a million views in a matter of hours.

But that was only the beginning. In the weeks that followed, we enlisted the help of early adopters to build an Audio AI tool for a range of use cases like virtual meeting transcripts, podcast transcription, video subtitling, call center assistance, and more.

We realized then that of the three infamous dealbreakers in Automatic Speech Recognition (ASR) – accuracy, speed, and price – it is the quality of transcription that mattered the most for our users. So for this release, Gladia prioritized accuracy while remaining one of the fastest on the market (60s per 1h of audio) – and providing accessible pricing in a market that has long failed at making speech-to-text a commodity. 

Today, we’re thrilled to release an enterprise-grade version of our STT product: more robust, scalable, and versatile than ever. In the following weeks, we will release a complete package of Audio Intelligence add-ons – including summarization, topic classification, and emotion detection – to help you turn audio data into valuable business knowledge.

Gladia's audio intelligence API overview
Gladia's Audio Intelligence API

Most accurate in real-life use cases

Gladia API utilizes advanced ASR technology to convert audio, video files, or URL to text format. Powered by the most advanced AI models – including the optimized and fine-tuned version of OpenAI’s legendary Whisper – our transcription tech achieves remarkable accuracy across a broad range of use cases.

For this release, we prioritized internal tests on real-life datasets provided by our clients and annotated manually by our team. Moreover, we have begun to fine-tune the Whisper Large v2 on 3.5 million hours of professional audio to accommodate a variety of topics, accents, and jargon.

Our goal: to achieve first-class performance in truly complex and diverse environments for speech-to-text.

Here are some notable examples.

  • Emergency call assistance: Factors like background noise, emotional distress, and varying accents or dialects make this type of audio very challenging. The urgency and time-sensitive nature of the calls can also lead to rapid speech, overlapping conversations, or fragmented speech patterns, further challenging the accuracy of transcription.
  • Multilingual virtual meetings: The presence of multiple languages being spoken simultaneously, different accents, and varying speech rates makes this kind of audio difficult to capture without errors. 

Designed to accommodate a variety of needs and contexts globally, our core cross-tier Transcription package includes the following features:

  • Speaker diarization - automatically partitions an audio recording into segments corresponding to different speakers
  • Word-level timestamp - refers to the process of associating a specific timestamp with each recognized word in the transcribed text output.
  • Code-switching - the ability to automatically detect the language, recognize speakers, and accurately transcribe the audio even as speakers switch languages throughout the conversation.
  • …as well as the standard supporting functions like noise reduction, custom vocabulary, and smart formatting.

In addition to exceptional accuracy, our API is designed as scalable and versatile, able to process all kinds of files without unreasonable restrictions on input length and file size – unlike the Whisper API, Gladia can process long audio files of over 25MB. We also support SRT and VTT outputs optimized for media captions. 

Deep-dive: Speaker Diarization

Since alpha, we’ve been fascinated with diarization – one of the most notoriously difficult features to nail in STT, especially in multilingual environments.

Today, we are confident to have designed one of the best diarization engines on the market, fuelled by the latest academic research and state-of-the-art models. 

As part of the proprietary AI-based split diarization, our team developed a special method to avoid transcript repetitions by seamlessly detecting unique vs. duplicate channels based on their similarity. Our API will automatically pick the right diarization method for your audio file and produce a high-quality speaker-based transcript. 

Despite overlaps resulting from simultaneous speech, the new diarization model implemented in academic collaboration with IRIT for the last version of the feature can recognize speakers accurately and assign dispersed speech fragments to unique individuals without repetitions. Best part – it’s included as part of our core STT offer!

Inference speed 

When it comes to transcription, speed and accuracy are inversely proportional, meaning that improving one will compromise the other. 

The degree of compromise, however, will depend on your provider. It’s about striking the right balance, which in this market has been hard for consumers to find. 

The cost factor is a result of the choice we make between speed and quality — but more on that later. 

The ASR conundrum: speed, accuracy, price

While testing the alpha product with our early adopters and going through various iterations with them, we learned that, above all else, they expect to get error-free and well-formatted transcripts, with some leeway allowed for speed. 

We believe a good goal to strive for is having a super-accurate transcription of 1 hour in under a minute. Using something that delivers results in 10 seconds but sacrificing quality is not desirable — but neither is waiting 30 minutes for a transcription of 1 hour, as is still the case for the Big Tech providers who, in addition, claim the most generous cut in return.

While largely content with the 1h per 60s milestone – which is among the most impressive performance indicators on the market – we’re committed to further improving latency in the upcoming weeks to bring it closer to the original 10s mark.

Optimized for global teams

Catering its product to international companies, Gladia has optimised and fine-tuned a range of AI models to deliver multilingual capabilities. Unlike any other provider on the market, the API already supports 99 languages for transcription, diarization, and translation.

While currently in beta, Gladia's translation relies on deep learning algorithms and NLP techniques to translate transcribed speech to and from 99 languages. In comparison, the Whisper API – while being equally multilingual – can only translate to English.

And it’s not just translation that will enable us to serve a truly global client base. In fact, all the features included in our Audio Intelligence API are designed to be multilingual – unlike any other commercial STT provider.

Pricing built for growth

Finding the right spot in the 'bermuda triangle' of price-accuracy-speed is not easy. The act of improving accuracy with the help of more sophisticated models and larger datasets adds a significant strain on storage and GPU capacity, which inevitably brings up the costs.

Thanks to a proprietary approach to model optimization – enabling us to fit a rather sophisticated, hybrid software architecture on the same amount of hardware – we have resolved the conundrum.

Faithful to the guiding conviction that transcription should be a commodity, we’ve implemented highly accessible pricingamong the most affordable on the market – with a free tier to give you a sweet taste of our product and power smaller-scale projects.

Pricing tiers for Gladia's speech-to-text overview
Top quality transcription at a price that’s best for you.
When defining our commercial offer, it was important for us to achieve fair and transparent pricing. While the vast majority of STT vendors charge per 15-sec – resulting in roundups for those who consume less, especially with chatbots – we’re committed to charging per second precisely. As our user, you will not be confronted with any hidden costs. 

On our pricing page, you can also access a dedicated calculator per volume of audio, pick an appropriate plan for your needs and scale the volume if needed with our pay-as-you-go billing system.

100% security

As an EU-based company, we take user privacy very seriously. Our data hosting is fully compliant with GDPR and North American regulations (including CCPA), and our portfolio of Audio Intelligence features will include the latest security add-ons like PII redaction to guarantee full compliance with internal protocols and regulations. 

We also support cloud, on-premise, and air-gap hosting to accommodate clients with high-security needs.

In short, you will be in full control of your data and have all the tools at hand to protect your users' anonymity too.

Built for all developers

Making cutting-edge AI accessible to any developer, whatever their level of AI expertise, has been a core part of our DNA since the beginning. 

Gladia API is compatible with all tech stacks and doesn’t require setup costs. We guarantee lower AI infrastructure costs for your company, combined with regular updates, to ensure high-level performance in the long run – even as you scale. 

A brand new version of our app provides a dedicated playground for you to test firsthand what the Gladia API is capable of. You can try it now, free of charge.

Gladia's new web app preview
Gladia's new web app preview

Your Journey to Knowledge AI begins with Audio

Having delivered on the promise to build a production-ready Speech-to-Text API, we’re now ready for what’s coming next. Namely, the Audio Intelligence layer in the following months.

We’re committed to further perfecting the quality of our performance and ensuring multilingual support for all the upcoming Audio Intelligence features. 

In parallel, we’re working on a proprietary LLM, making it robust and adaptive to a range of use cases with a custom approach to data anonymization and fine-tuning.

And let us not forget that for Gladia, audio is but the first step. Because while the voice is the primary way we interact with the world, most of the valuable audio data in companies is either lost or unused. We wanted to help any company tap into the goldmine. And then move on to the next phase.

We believe the future of AI is multimodal and are laying a strong foundation for delivering next-gen knowledge infrastructure platforms to companies, connecting all their internal data and making it easily discoverable in real time.

Our new branding is reflective of this long-term ambition – and it’s rather magnificent, don’t you think?

Now go ahead and give it a try below! As always, you can drop your feature requests and share your experience on our Discord.

Contact us

280
Your request has been registered
A problem occurred while submitting the form.

Read more

Speech-To-Text

Keeping LLMs accurate: Your guide to reducing hallucinations

Over the last few years, Large Language Models (LLMs) have become accessible and transformative tools, powering everything from customer support and content generation to complex, industry-specific applications in healthcare, education, and finance.

Case Studies

Transforming note-taking for students with AI transcription

In recent years, fuelled by advancements in LLMs, the numbers of AI note-takers has skyrocketed. These apps are increasingly tailored to meet the unique needs of specific user groups, such as doctors, sales teams and project managers.

Speech-To-Text

RAG for voice platforms: combining the power of LLMs with real-time knowledge

It happens all the time. A user submits a query to a large language model (LLM) and swiftly gets a response that is clear, comprehensive, and obviously incorrect.