Keeping LLMs accurate: Your guide to reducing hallucinations
Published on Nov 14, 2024
Over the last few years, Large Language Models (LLMs) have become accessible and transformative tools, powering everything from customer support and content generation to complex, industry-specific applications in healthcare, education, and finance.
However, as versatile as LLMs are, they have their limitations. For example, small context windows can limit long-form accuracy and a training bias toward English-language data restricts their effectiveness in multilingual scenarios. But one of the most pressing challenges today is hallucinations.
Let’s explore what hallucinations are, their types and causes, and actionable tips to prevent them.
What are LLM hallucinations?
LLM hallucinations refer to instances where a language model produces responses that sound plausible but are factually incorrect or entirely fictional. These hallucinations typically happen when the model lacks sufficient knowledge or context, causing it to fabricate information to fill perceived gaps. LLMs are designed to generate coherent text, so they will often respond with high-confidence statements, which can make even false outputs seem reliable.
In speech recognition, we encounter hallucinations too. According to the latest research, Whisper ASR — a leading model in the field developed by OpenAI — hallucinates around 1% of the time. This may seem like a small percentage, but in certain business use cases where 100% information fidelity is critical, this becomes significant.
Indeed, in some cases, a hallucination might not be a big deal. For example, a person casually using a public LLM could ask for a list of fun facts and receive a fictional anecdote. While misleading, it’s a relatively low-stakes error. Similarly, some creative projects may welcome some fictional elements.
However, hallucinations can be problematic when they affect users relying on LLM-powered products for accurate information. Imagine a customer using an LLM-integrated platform to handle complex financial transactions or receive medical guidance. A hallucination in these scenarios, such as incorrect advice on tax compliance or an inaccurate interpretation of medical symptoms, could lead to serious consequences.
The bottom line: addressing hallucinations is critical to advancing LLM applications and ensuring their reliability for end users.
Types of LLM hallucinations
Most of us have encountered a hallucination in our own experience with LLMs. But, as we’ve said, the impact of these hallucinations varies widely, with some being relatively harmless and others posing real risks.
Factual incorrectness: This type of hallucination involves misrepresenting factual information, such as presenting outdated business data or incorrect technical specifications. For instance, an LLM might provide a wrong answer about a software feature in a customer support chat. While minor errors may only inconvenience users, they can also damage trust if the model consistently provides misleading or outdated details in professional settings.
Misinterpretation and contextual errors: Misinterpretations occur when a model fails to understand the nuances of a user’s input or misclassifies information from its knowledge base. In the context of agent assistance or sales enablement tools, a model might misunderstand a customer's query or incorrectly classify a lead, resulting in unhelpful or irrelevant responses. Such errors can impact customer satisfaction, hinder sales, or cause inefficiencies when the model misinterprets crucial context.
Complete fabrications: In this case, the model generates entirely fictional content. For example, a public-facing LLM generating false statistics or inventing legal precedents can mislead users and create real-world risks that can have serious societal consequences, such as spreading misinformation, creating legal risks, undermining public health, eroding trust in AI systems, and amplifying biases that perpetuate social divisions.
What causes hallucinations?
There are dozens of variables that can contribute to LLM hallucinations, from prompt phrasing and temperature settings to fine-tuning choices and knowledge cutoffs. But we’re just going to focus on three primary factors that have a particularly strong impact on the likelihood of hallucinations.
Limitations in training data
LLMs can only generate responses based on the data available during their training. Large models are typically trained on vast amounts of data from diverse sources, including the Internet. While this broadens their knowledge base, it also exposes them to a mix of high-quality information and low-quality or misleading data.
Consequently, when new information arises or specialized knowledge is required, models may rely on incomplete data or “guesswork,” leading to hallucinations. Even the most advanced models, without access to real-time data updates, can produce errors when handling current events or niche topics.
We also need to remember that the level of hallucination will vary depending on the language, especially with the so-called low-resource languages, i.e., those that are used the least when training models.
Model architecture constraints
One inherent limitation of LLMs is their constrainedcontext windows, which restrict the amount of information they can process at once. This makes it challenging for models to handle extended conversations or follow complex narratives, leading to errors in later parts of the dialogue. Additionally, without a memory component, LLMs cannot retain information from previous interactions, increasing the likelihood of hallucinations in multi-step queries.
Overfitting and bias amplification
Overfitting to specific data subsets or sources can cause LLMs to repeat certain patterns or assumptions. This can result in bias amplification, where hallucinations are not only incorrect but are skewed by social or cultural biases in the training data. For example, a model trained on limited perspectives might generate outputs that reinforce stereotypes or favor particular viewpoints, complicating its use in sensitive applications. This is similar to how some people believe in conspiracy theories like the Earth being flat: if you're fed enough counterexamples to established theories and guided down a misleading path, you might end up believing in a flawed idea.
Additionally, the engineering behind the inference process matters. Some companies use smaller models to speed up the initial steps before passing the task to a larger model. If the smaller model makes a wrong decision early on, it can mislead the larger model, reinforcing the error.
Flawed or "broken" prompt
A poorly constructed prompt can steer the model in the wrong direction, influencing the inference process and leading to incorrect outputs. An example of this is when DeepMind demonstrated how repeating a word like "poem" multiple times in a prompt could extract unintended information from the model. This shows that a prompt, when carefully crafted, can direct the model’s chain of thought in a specific, often incorrect, direction, thus contributing to hallucinations.
How can you detect LLM hallucinations
One common approach is to use a retrieval-augmented generation (RAG) system, which combines the model's generative capabilities with a retrieval mechanism to cross-check facts. Another method involves employing a "chain of thought" approach, where the model is guided through a series of predefined steps or checkpoints to ensure that it stays on a logical path.
For instance, if you want the model to complete a task, you might first ask it to generate a plan of execution. The model outlines the necessary steps, and these steps serve as checkpoints. By defining these checkpoints in advance, you reduce the likelihood of the model deviating from the intended path. For example, if you ask the model to plan a trip from Rennes to Metz, it might outline a route through Brest, Rennes, Le Mans, Paris, and then Metz, avoiding any incorrect detours like going through Marseille. The idea is that each step reinforces the original chain of thought, preventing the model from straying and generating hallucinated content. Although this approach doesn't directly detect hallucinations, it helps prevent them by keeping the model focused.
Another strategy for detecting hallucinations involves using systems that assess the truthfulness of the model’s output during training. There are benchmarks specifically designed to evaluate hallucinations, which involve comparing different candidate responses generated by the model and determining which one is most accurate.
For instance, you could present multiple candidate answers to the model and ask it to identify the most accurate one. By doing this, you can see how often the model produces incorrect answers, allowing you to measure the level of hallucination. Existing GitHub benchmarks often use matrices to evaluate the model’s performance, considering metrics like false positives, true positives, false negatives, and true negatives, which collectively give you a score indicating the model’s tendency to hallucinate.
Hugging Face has also developed benchmarks to evaluate models based on criteria like faithfulness to the truth. These benchmarks help rank models according to their ability to avoid hallucinations, providing a clear picture of which models are more reliable. Although this process can be technical and involves a fair amount of statistical analysis, it is essential for understanding and mitigating the occurrence of hallucinations in LLMs.
Tips to prevent hallucinations
While hallucinations remain a challenge, several strategies can help reduce their frequency and impact. Here are some of the best ones, based on our own experience and customer feedbacks.
1. Select appropriate models and training data
Not all LLMs are created equal, and choosing between open-source and proprietary models is a critical decision. Proprietary models, like OpenAI’s GPT-4 or Google’s PaLM, are often easier to deploy through APIs and may offer higher performance and support, though they come with licensing fees that can scale quickly with usage. Open-source models like Falcon LLM, StarCoder, and Meta’s LLaMA offer flexibility, transparency, and cost savings on licensing but require more in-house expertise and infrastructure costs to host and fine-tune effectively.
For certain applications—especially in fields like healthcare or finance—selecting a model trained on relevant, high-quality data can be essential for reducing hallucinations. Smaller, domain-specific models or well-maintained proprietary options may perform better than broad, general-purpose models in these cases, as they can be tailored to provide more reliable outputs for specialized needs.
2. Implement retrieval-augmented Generation (RAG)
As mentioned before, RAG is a powerful approach to ensure LLM responses are accurate and grounded in real-time, trusted information. By integrating retrieval mechanisms that cross-reference relevant data as responses are generated, RAG minimizes hallucinations and enhances the reliability of outputs. This technique can be especially useful in dynamic use cases, such as real-time transcription, where dependable information is crucial.
In a practical application, such as in a corporate setting where specific policies are being queried, even if the system has been trained on relevant documents, it’s advisable to continually remind the system of the key instructions or information. One common approach is to break down documents into smaller, manageable chunks that are relevant to the query at hand. By doing this, you avoid overwhelming the model with too much context, which helps to mitigate potential hallucinations.
3. Using the right prompting techniques
If the prompt is overly complex or lengthy compared to the data, the model may generate hallucinations. For instance, a one-sentence data input paired with a 50-sentence prompt can cause the model to overemphasize the structure of the prompt rather than the content of the data. Achieving a balance where the prompt and data are proportionate ensures that the model gives proper weight to the data.
Additionally, it’s important to avoid repeating patterns within your prompt, as these can mislead the model and contribute to hallucinations. Another practical tip is to pay attention to punctuation. For instance, if a sentence lacks a period at the end, the model might continue generating text based on the assumption that the sentence hasn’t ended.
4. Monitor and mitigate with human oversight
Deploying LLMs isn’t a “one-and-done” exercise; it requires ongoing human oversight and specialized expertise to keep outputs reliable. High-impact use cases need regular auditing to check for consistency and accuracy. This often involves roles like data scientists to analyze and fine-tune model performance, prompt engineers to craft optimized prompts, and compliance experts to ensure adherence to industry standards.
User feedback is also invaluable, as it enables teams to dynamically adjust model responses and refine LLM behavior based on real-world usage. This constant monitoring and feedback loop helps LLMs evolve to meet specific reliability standards and deliver dependable results over time.
Wrap-up
Mitigating and reducing hallucinations in LLMs begins with selecting the right model and training it on a diverse and expansive dataset. Models, even smaller ones, perform more reliably when exposed to a broad range of information during training, making them less prone to generating fabricated or inaccurate responses.
Another effective approach is using techniques like RAG, which not only integrates real-time retrieval mechanisms to ground the model's responses in factual data but can also validate the truthfulness of its output.
Finally, LLMs require ongoing human oversight, specialized expertise, and user feedback to keep outputs reliable.
Learn more about Gladia
Gladia provides a speech-to-text and audio intelligence API for building virtual meetings, note-taking apps, call center platforms and media products, providing transcription, translation and insights powered by best-in-class ASR, LLMs, and GenAI models. Our latest model Whisper-Zero is designed specifically to remove ASR hallucinations, delivering top-tier enterprise performance for your apps.
ASR vs. LLMs – Why voice is among the biggest challenges for AI
When people talk about recent AI advancements, Large Language Models (LLMs) like ChatGPT often steal the limelight. They summarize, write, and generate text with impressive fluency, making them the poster child of generative AI.
Ultimate guide to using LLMs with speech recognition is here!
Large Language Models (LLMs) have enabled businesses to build advanced AI-driven features, but navigating the many available models and optimization techniques isn't always easy.
Should you host an in-house speech-to-text solution or outsource to an API provider?
Businesses across industries are adopting speech-to-text (STT) technology to unlock new use cases and meet growing customer expectations. Whether it’s powering virtual assistants, transcribing conversations, or analyzing audio data for insights, STT has become essential for delivering seamless and engaging experiences.