Automatic speech recognition (ASR) is a cornerstone of many business applications in domains ranging from call centers to smart device engineering. At their core, ASR models, also referred to as Speech-to-Text (STT), intelligently recognize human speech and convert it into a written format.
Modern ASR engines leverage a combination of groundbreaking technologies, including Natural Language Processing (NLP), AI, ML, and LLMs. While ASR relies on all these technologies, it’s very different from each one of them in fundamental ways.
Take NLP, for example, which is frequently confused with ASR. The technology aims to understand natural language patterns and use those to create new content from the ground up, whereas ASR has a laser focus on transcribing the spoken word. It is therefore crucial to understand how speech recognition works to be able to build better products and experiences for your customers and teams.
In this article, we’ll explore the inner workings of ASR models and compare legacy approaches with today's state-of-the-art sequence-to-sequence models. If you’re a developer, AI engineer, CTO, or CPO, you’ll discover a wealth of insights into ASR models like Whisper seq2seq and how they’re vital for transcriptions, captioning, and content creation in business environments.
How does speech recognition work?
When incorporating speech recognition technologies into your business and customer workflows, it helps to understand how they work. Armed with these insights, you’ll be better able to select an ASR model that best meets your specific requirements. If you're a new to speech recognition, feel free to check out our introductory guide to speech-to-text AI before diving into this.
The traditional speech recognition approach
Let’s take a look at how speech recognition has historically worked. Legacy ASR systems, first introduced in the 1970s, function by converting analog audio signals into digital bits and processing them with a decoder to create sentences with words based on the data sequence and context. They remained mainstream for the last four decades until the introduction of end-to-end ASR models.
The decoder in a traditional speech recognition system analyzes the input data in conjunction with multiple ML models.
Acoustic Models (AM): The decoder needs acoustic models like Hidden Markov Models (HMM) and Gaussian Mixture Models (GMM) that understand the natural speech patterns to predict the exact spoken sound.
Language Models: Merely estimating the sound (phenome) is not enough. You need a language model that can predict the right sequence of words based on a statistical analysis of the language.
Lexicon Models: A lexicon model determines the phonetic variations in language. This helps distinguish between accents and similar-sounding phrases. These models all work in tandem to create the desired output (written text). An example of a lexicon model would be a Finite-State Transducer (FST) with a pronunciation dictionary mapping the word “SPEECH” to “S P IY CH.” All natural language words are represented in the FST as subword units.
As you might have guessed, this approach necessitates training multiple models independently of each other, which is both time-consuming and expensive. One of the most significant drawbacks, however, is the reliance on the lexicon model.
The success of the traditional speech recognition models depends to a great extent on how well the lexicon models are crafted. Experienced phoneticians collaborate to create a custom set for the language at hand. The process must be repeated for every single language the model aims to support. This makes implementation challenging, especially in dynamic business environments expanding into new markets.
Modern ASR models with end-to-end (E2E) deep learning
Modern ASR models take a disruptive approach to speech processing with end-to-end deep learning. One of the early approaches to building an end-to-end system was introduced in 2014 by researchers Alex Graves from Google DeepMind and Navdeep Jaitly from the University of Toronto.
Essentially, a complex neural network in a modern ASR model replaces multi-stage models in legacy systems which minimizes latency and drastically improves performance and accuracy.
This architecture also does away with independent language, acoustic, and lexicon models, and the resultant modern ASR system functions as a single neural network (as against multiple models in legacy systems), reducing development time and costs. In addition, they also achieve unparalleled accuracy levels and support multiple languages owing to their advanced neural architecture.
The modern ASR models are trained on large datasets and often require self-supervised deep learning as it would be extremely challenging for human operators to manually process voluminous data. Engineers use large amounts of unlabeled data to build a foundation model, which is then fine-tuned to achieve the desired word error rate (WER). Models can also be benchmarked based on this metric to compare performance.
A great example of a modern ASR model is OpenAI’s Whisper seq2seq model trained on a huge dataset of over 680,000 hours of audio data from many languages. This makes it more robust than models trained on smaller, more specialized datasets. The seq2seq model is essentially an ML architecture optimized for tasks that involve sequential data (like speech and text). It comprises two core components which take a sequence as input and generate a related sequence output.
Encoder: The encoder uses neural networks to process the input sequence and create a fixed-size vector representation. The encoder will also capture context from the input sequence and pass it to the decoder.
Decoder: A ‘decoder’ module takes the encoder output vector and creates an output sequence. The decoder will predict the next tokens of the output sequence based on the received context and its own previous predictions.
Transformers: The Whisper seq2seq model uses an end-to-end transformer architecture to understand context and meaning from the input audio. The model first splits the input audio into small chunks before passing them to the encoder. The decoder predicts the text caption.
Recurrent Neural Networks (RNN): Sometimes both the encoder and the decoder use Recurrent Neural Networks (RNN), a specific type of neural network well-suited for sequence prediction. RNN cells can remember information about previously seen elements of the sequence through their internal memory and use it to determine current output. Unlike transformers, RNNs are sequential models. LSTM (Long Short-Term Memory) models, for example, use an RNN architecture.
Legacy ASR models vs seq2seq architecture
We’ve summarized the key differences between the traditional and modern ASR approaches below for easy reference.
Characteristics
Legacy ASR models
Seq2Seq models
Internal structure
Modular pipeline with acoustic models, lexicon models, and language models
Use end-to-end neural networks with an encoder-decoder architecture to directly convert speech to text
Accuracy
Limited accuracy, cannot reach human accuracy levels
High accuracy, can reach human-level accuracy and beyond
Versatility
Limited adaptability to diverse inputs (some accents can throw off transcriptions)
Highly adaptable to different accents and languages
Training data type
Needs labeled phonetic data to function properly
Works with unlabeled or less labeled data
Training technique
Each model needs to be independently trained
The entire model is trained in one go
Suitability
Best suited for simple speech-to-text functions
Best suited for complex and real-time speech-to-text applications
Speed
Generally slower due to multi-stage component interactions
Typically faster due to parallel processing
Error detection and correction
Limited error correction capabilities
Complex error correction and control mechanisms
Scalability
Limited scalability
Highly scalable
Language support
Support a limited number of languages
Can support a large number of languages
Context awareness
Limited context awareness. Probabilistic language models analyze context to predict the next word with limited use of machine learning.
Highly context-aware, use deep learning for comprehensive context analysis
Which ASR models do we use today?
While legacy models suffer from drawbacks, they aren’t in any way rendered obsolete. In fact, modern ASR models build upon several traditional fundamental speech recognition systems such as acoustic and language models. Furthermore, legacy systems are also deployed for some specific tasks.
Generally speaking, though, Seq2seq models like Whisper are great at tasks that require natural language understanding. This includes speech recognition, translation, and caption generation. This is, in part, owing to their superior architecture that lends them advantages when dealing with sequences of varying lengths. That said, modern models do need very large datasets and compute resources to train and work. They can also face problems with long-range dependencies and when rare words are encountered.
How to select the right ASR model?
We know how challenging it can be to select an optimal ASR model for your specific needs. We’ve put together a few decisive factors you should consider to streamline the selection process.
Word error rate: The ideal goal of any voice recognition application is to achieve zero error rates. However, practical considerations dictate variations beyond our control, so make sure you factor in the precision and accuracy you need in your system when selecting an ASR model. If your application needs uncompromising performance, choose a modern ASR system like Whisper Seq2Seq.
End goal: Consider the requirements of your end users when choosing a model. How will they use the product or service?
Operating environment: ASR performance depends on the background noise to a great extent. If your end users will operate the product or application in noisy environments, you’ll need advanced noise suppression support in your model.
Audio properties: Consider the audio sample rate, bitrate, file format, channels, and duration when selecting a model.
Input audio type: Factor in how varied your input audio will be and the languages and dialects the model will need to support.
Performance: Every model performs differently, so you’ll need to evaluate it based on your specific benchmarks. If you need real-time speech-to-text conversions (such as in smart devices and wearables), choose a model with the lowest latency possible.
Conclusion
Automatic speech recognition plays a pivotal role in numerous business applications and knowing which kind of ASR system to use for a niche use case can greatly improve end-user experience.
Traditionally, legacy speech-to-text systems have used a combination of acoustic, lexicon, and language models to predict text. However, these suffer from accuracy constraints and require specialized expertise for phonetic training.
Modern ASR models are increasingly complex and can handle diverse input types, including multiple languages. A simplified end-to-end deep learning model ensures minimal latency, superior accuracy, and high scalability.
When choosing an ASR model for your business, consider its underlying architecture (such as RNN/transformer), input data modalities, and the desired word error rates.
Keeping LLMs accurate: Your guide to reducing hallucinations
Over the last few years, Large Language Models (LLMs) have become accessible and transformative tools, powering everything from customer support and content generation to complex, industry-specific applications in healthcare, education, and finance.
Transforming note-taking for students with AI transcription
In recent years, fuelled by advancements in LLMs, the numbers of AI note-takers has skyrocketed. These apps are increasingly tailored to meet the unique needs of specific user groups, such as doctors, sales teams and project managers.
RAG for voice platforms: combining the power of LLMs with real-time knowledge
It happens all the time. A user submits a query to a large language model (LLM) and swiftly gets a response that is clear, comprehensive, and obviously incorrect.