Speech-to-Text Benchmark Results — Word Error Rate (WER %, lower is better)

Methodology: each audio file sent to every provider's production API with default settings. Transcriptions normalized with OpenAI Whisper text normalizer before WER computation. 74 hours of audio across 7 datasets. Full methodology: https://github.com/gladiaio/normalization

Columns: WER = Word Error Rate (%, lower is better) · Perfect = number of files with 0% WER · High WER = files with WER above 50% · RTFx = Real-Time Factor multiplier (higher = faster than real-time)

Switchboard — Conversational Speech

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
Gladia Solaria35.82730
Mistral STT50.12566
AssemblyAI Universal v356.02841
Speechmatics56.03170
ElevenLabs Scribe v262.53160
Soniox v462.92790
AssemblyAI Universal v263.12081
Deepgram v365.22671

Common Voice 24 — Clean Audio (Multilingual)

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
Speechmatics3.8106722
AssemblyAI Universal v33.9104110
ElevenLabs Scribe v23.9105936
Mistral STT5.198426
AssemblyAI Universal v25.296821
Gladia Solaria6.790421
Soniox v47.290271
Deepgram v37.980802

VoxPopuli — Formal Discourse (European Parliament)

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
ElevenLabs Scribe v21.741805
AssemblyAI Universal v32.139602
Mistral STT2.139405
Gladia Solaria2.239303
AssemblyAI Universal v22.237703
Speechmatics3.032606
Deepgram v33.236307

Earnings22 Full — Financial Calls (Long-form, single file)

ProviderWER (%)RTFx
ElevenLabs Scribe v29.435
Speechmatics10.017
AssemblyAI Universal v311.071
AssemblyAI Universal v211.182
Mistral STT11.6135
Gladia Solaria11.828
Deepgram v314.5348

Earnings22 Cleaned AA — Financial Calls (Curated by Artificial Analysis)

ProviderWER (%)RTFx
Soniox v45.735
AssemblyAI Universal v26.959
AssemblyAI Universal v37.064
Mistral STT7.557
Speechmatics7.724
Gladia Solaria7.939
ElevenLabs Scribe v27.932
Deepgram v312.7234

Multilingual LibriSpeech — Audiobooks (5 languages, average)

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
ElevenLabs Scribe v23.7565315
AssemblyAI Universal v34.750831
Soniox v45.637836
Gladia Solaria5.836733
AssemblyAI Universal v26.236923
Deepgram v37.527047

Multilingual LibriSpeech — WER by Language (%)

ProviderGerman (DE)Spanish (ES)French (FR)Italian (IT)Portuguese (PT)
Gladia Solaria5.04.04.89.95.3
AssemblyAI Universal v33.53.22.69.74.4
ElevenLabs Scribe v23.13.22.96.13.0
Soniox v45.44.45.08.84.3
AssemblyAI Universal v23.44.05.811.95.9
Deepgram v36.94.66.28.811.3

Pipecat STT Benchmark — Real-Time Streaming

ProviderWER (%)Perfect transcriptionsHigh WER filesRTFx
AssemblyAI Universal v32.053102
ElevenLabs Scribe v22.251208
AssemblyAI Universal v22.549401
Mistral STT2.648505
Gladia Solaria2.748204
Speechmatics2.747600
Soniox v42.948002
Deepgram v33.144908

Speaker Diarization Benchmark — DIHARD III — Diarization Error Rate (DER %, lower is better)

ProviderBroadcastMeetingWeb VideoSocio FieldCourtClinicalRestaurantSocio LabCTSMaptaskSimple AvgWeighted Avg
Gladia (solaria-1)9.429.944.412.33.913.341.35.57.74.517.216.6
NVIDIA NeMo Sortformer10.333.043.513.024.114.450.98.614.18.222.020.4
pyannoteAI Community-110.535.848.717.911.623.849.913.912.310.223.523.0
Speechmatics17.255.655.628.915.024.958.418.620.123.431.830.1
AWS Transcribe16.451.460.325.216.727.363.120.231.222.933.533.8
Soniox STT-async-preview-v124.858.357.530.139.335.167.428.029.227.639.737.8
ElevenLabs Scribe-v125.650.563.429.723.147.757.430.322.945.239.639.5
OpenAI GPT-4o Transcribe26.457.864.128.830.040.859.726.534.841.042.8
AssemblyAI Universal30.946.468.433.124.551.459.433.133.142.142.243.9
Deepgram v327.059.783.035.525.644.875.232.235.545.946.446.9
New Try our blind STT comparison tool →

Open benchmark for speech-to-text

We evaluated Gladia Solaria against 8 leading providers across 7 datasets and 74 hours of audio. The full methodology is open-sourced so results can be independently reproduced.

ALL RESULTS AT A GLANCE

WER comparison across datasets

Lower WER is better. Filter by dataset to focus on what matters to you.

OPEN METHODOLOGY

How we benchmark

7
Evaluation datasets
74+
Hours of audio
8
Providers compared

Each audio file was sent to every provider's production API using default settings. No custom model tuning or prompt engineering was applied. All providers were tested on identical audio files.

Transcription outputs were normalized using the OpenAI Whisper text normalizer before WER computation. Diarization Error Rate (DER) is measured on the DIHARD III challenge datasets using standard protocols.

The full benchmarking framework is open-sourced to enable transparent, reproducible evaluation of speech recognition systems.

Transparent benchmarks,
open source

Full methodology and evaluation framework available. Reproduce every result independently.