Name: Gladia Speech-to-Text Benchmark Results
Creator: Gladia
License: https://github.com/gladiaio/normalization

Speech-to-Text Benchmark Results — Word Error Rate (WER %, lower is better)

Methodology: benchmarks run on the same framework as the Solaria-1 campaign. Solaria-3 and Solaria-1 compared against 9 STT providers. Each provider tested on identical audio files via production APIs with default settings. Real customer audio = Gladia internal production dataset, annotated by humans. Soniox and Pipecat STT Benchmark excluded on some datasets pending data availability. Transcriptions normalized with OpenAI Whisper text normalizer before WER computation. Diarization Error Rate (DER) measured on DIHARD III using standard protocols. 74+ hours of audio across 8 evaluation datasets. Full open benchmark methodology: https://github.com/gladiaio/normalization

Columns: WER = Word Error Rate (%, lower is better) · Perfect = number of files with 0% WER · High WER = files with WER above 50% · RTFx = Real-Time Factor multiplier (higher = faster than real-time)

Real Customer Audio — English (Gladia Internal Dataset)

Provider	WER (%)
Solaria-3	9.6
ElevenLabs Scribe v2	9.9
AssemblyAI	10.0
Deepgram Nova-3	10.7
Mistral Voxtral	12.2
Solaria-1	12.9

Switchboard — Conversational Speech

Provider	WER (%)	Perfect transcriptions	High WER files	RTFx
Solaria-3	33.9	27	3	0
Solaria-1	37.3	27	3	0
AssemblyAI	42.3	28	4	1
Speechmatics	46.0	31	7	0
Mistral Voxtral	48.1	25	6	6
Deepgram Nova-3	49.8	26	7	1
ElevenLabs Scribe v2	55.2	31	6	0

Common Voice 24 — Clean Audio (Multilingual)

Provider	WER (%)	Perfect transcriptions	High WER files	RTFx
Speechmatics	3.8	1067	2	2
AssemblyAI	3.9	1041	1	0
ElevenLabs Scribe v2	3.9	1059	3	6
Mistral Voxtral	5.1	984	2	6
AssemblyAI Universal v2	5.2	968	2	1
Solaria-3	6.9	—	—	—
Soniox v4	7.2	902	7	1
Deepgram Nova-3	7.9	808	0	2
Solaria-1	8.2	904	2	1

VoxPopuli — Formal Discourse (European Parliament)

Provider	WER (%)	Perfect transcriptions	High WER files	RTFx
ElevenLabs Scribe v2	1.7	418	0	5
AssemblyAI	2.1	396	0	2
Mistral Voxtral	2.1	394	0	5
Solaria-1	2.2	393	0	3
AssemblyAI Universal v2	2.2	377	0	3
Solaria-3	2.9	—	—	—
Speechmatics	3.0	326	0	6
Deepgram Nova-3	3.2	363	0	7

Earnings22 Full — Financial Calls (Long-form, single file)

Provider	WER (%)	RTFx
ElevenLabs Scribe v2	9.4	35
Speechmatics	10.0	17
AssemblyAI Universal v3	11.0	71
AssemblyAI Universal v2	11.1	82
Mistral STT	11.6	135
Solaria-1	11.8	28
Deepgram v3	14.5	348

Earnings22 Cleaned AA — Financial Calls (Curated by Artificial Analysis)

Provider	WER (%)	RTFx
Solaria-3	6.4	—
AssemblyAI	6.9	64
ElevenLabs Scribe v2	7.7	32
Speechmatics	7.8	24
Mistral Voxtral	7.9	57
Solaria-1	8.1	39
Deepgram Nova-3	12.0	234

Multilingual LibriSpeech — Audiobooks (5 languages, average)

Provider	WER (%)	Perfect transcriptions	High WER files	RTFx
ElevenLabs Scribe v2	3.7	565	3	15
AssemblyAI	4.7	508	3	1
Soniox v4	5.6	378	3	6
Solaria-1	5.9	367	3	3
AssemblyAI Universal v2	6.2	369	2	3
Deepgram Nova-3	7.5	270	4	7
Solaria-3	8.0	—	—	—

Multilingual LibriSpeech — WER by Language (%)

Provider	German (DE)	Spanish (ES)	French (FR)	Italian (IT)	Portuguese (PT)
Solaria-1	5.0	4.0	4.8	9.9	5.3
AssemblyAI Universal v3	3.5	3.2	2.6	9.7	4.4
ElevenLabs Scribe v2	3.1	3.2	2.9	6.1	3.0
Soniox v4	5.4	4.4	5.0	8.8	4.3
AssemblyAI Universal v2	3.4	4.0	5.8	11.9	5.9
Deepgram v3	6.9	4.6	6.2	8.8	11.3

Pipecat STT Benchmark — Real-Time Streaming

Provider	WER (%)	Perfect transcriptions	RTFx
AssemblyAI Universal v3	2.0	531	2
ElevenLabs Scribe v2	2.2	512	8
AssemblyAI Universal v2	2.5	494	1
Mistral STT	2.6	485	5
Solaria-1	2.7	482	4
Speechmatics	2.7	476	0
Soniox v4	2.9	480	2
Deepgram v3	3.1	449	8

Speaker Diarization Benchmark — DIHARD III — Diarization Error Rate (DER %, lower is better)

Provider	Broadcast	Meeting	Web Video	Socio Field	Court	Clinical	Restaurant	Socio Lab	CTS	Maptask	Simple Avg	Weighted Avg
Solaria-1	9.4	29.9	44.4	12.3	3.9	13.3	41.3	5.5	7.7	4.5	17.2	16.6
NVIDIA NeMo Sortformer	10.3	33.0	43.5	13.0	24.1	14.4	50.9	8.6	14.1	8.2	22.0	20.4
pyannoteAI Community-1	10.5	35.8	48.7	17.9	11.6	23.8	49.9	13.9	12.3	10.2	23.5	23.0
Speechmatics	17.2	55.6	55.6	28.9	15.0	24.9	58.4	18.6	20.1	23.4	31.8	30.1
AWS Transcribe	16.4	51.4	60.3	25.2	16.7	27.3	63.1	20.2	31.2	22.9	33.5	33.8
Soniox STT-async-preview-v1	24.8	58.3	57.5	30.1	39.3	35.1	67.4	28.0	29.2	27.6	39.7	37.8
ElevenLabs Scribe-v1	25.6	50.5	63.4	29.7	23.1	47.7	57.4	30.3	22.9	45.2	39.6	39.5
OpenAI GPT-4o Transcribe	26.4	57.8	64.1	28.8	30.0	40.8	59.7	26.5	—	34.8	41.0	42.8
AssemblyAI Universal	30.9	46.4	68.4	33.1	24.5	51.4	59.4	33.1	33.1	42.1	42.2	43.9
Deepgram v3	27.0	59.7	83.0	35.5	25.6	44.8	75.2	32.2	35.5	45.9	46.4	46.9

Speech-to-Text Benchmark Results — Word Error Rate (WER %, lower is better)

Real Customer Audio — English (Gladia Internal Dataset)

Switchboard — Conversational Speech

Common Voice 24 — Clean Audio (Multilingual)

VoxPopuli — Formal Discourse (European Parliament)

Earnings22 Full — Financial Calls (Long-form, single file)

Earnings22 Cleaned AA — Financial Calls (Curated by Artificial Analysis)

Multilingual LibriSpeech — Audiobooks (5 languages, average)

Multilingual LibriSpeech — WER by Language (%)

Pipecat STT Benchmark — Real-Time Streaming

Speaker Diarization Benchmark — DIHARD III — Diarization Error Rate (DER %, lower is better)

Open benchmark for speech-to-text

WER comparison across datasets

How we benchmark

Transparent benchmarks,
open source

Speech-to-Text Benchmark Results — Word Error Rate (WER %, lower is better)

Real Customer Audio — English (Gladia Internal Dataset)

Switchboard — Conversational Speech

Common Voice 24 — Clean Audio (Multilingual)

VoxPopuli — Formal Discourse (European Parliament)

Earnings22 Full — Financial Calls (Long-form, single file)

Earnings22 Cleaned AA — Financial Calls (Curated by Artificial Analysis)

Multilingual LibriSpeech — Audiobooks (5 languages, average)

Multilingual LibriSpeech — WER by Language (%)

Pipecat STT Benchmark — Real-Time Streaming

Speaker Diarization Benchmark — DIHARD III — Diarization Error Rate (DER %, lower is better)

Open benchmark for speech-to-text

WER comparison across datasets

How we benchmark

Transparent benchmarks,open source

Transparent benchmarks,
open source