Optimizing LLMs for accuracy is hard. You need to know how to start the optimization process, what techniques to use, and finally, what level of accuracy is good enough for your specific needs and use case.
In this blog post, we’ll cover some key techniques and best practices you can apply to improve the accuracy of an LLM, namely: prompt engineering, fine-tuning, and retrieval-augmented generation (RAG).
Let’s start with prompt engineering.
Prompt engineering
Prompts are the input that guides an LLM's output and task execution. Different types of prompts—zero-shot, few-shot, and chain-of-thought (CoT)—allow for tailored model behavior and influence output quality based on task complexity.
Zero-shot
Zero-shot prompts rely on the model's pre-trained knowledge without providing specific examples. This approach works well for straightforward tasks, for example:
Prompt: Complete the sentence by filling in the [mask]:
Sentence: The weather today is [mask].
Output: Sunny
Few-shot prompting
Few-shot prompts include one or more examples to guide the model, improving performance for complex tasks. When a single example is used, it's called one-shot prompting, for example:
Prompt:
I love this product! // Positive
The service was terrible! // Negative
What an amazing experience! // Positive
That restaurant was awful! //
Output:
Negative
Chain-of-thought (CoT)
CoT prompts guide the model to break tasks into intermediate reasoning steps, enhancing performance in problem-solving and multi-step calculations. While most effective with few-shot prompts, zero-shot CoT can also be applied to encourage step-by-step reasoning, for example:
Prompt (with CoT):
A farmer has a total of 80 apples. He wants to divide them into baskets such that each basket has an equal number of apples. If he places 12 apples in each basket, he will have 4 apples left over. How many baskets can he fill? Let's think step by step.
Output:
First, we know the farmer has 80 apples in total.If he puts 12 apples in each basket and has 4 apples left over, we can subtract the leftover apples from the total: 80 apples - 4 apples = 76 apples.
Now, we divide the remaining apples by the number of apples per basket:
76 apples / 12 apples per basket = 6 baskets (with a remainder).
So, the farmer can fill 6 baskets completely with 12 apples each.
To wrap up, zero-shot prompting requires no examples, relying on the model's pre-trained knowledge. Few-shot prompting improves understanding with a few examples, while chain-of-thought prompting enhances logical flow by guiding step-by-step reasoning. Choose the technique based on your task, goals, and the model’s capabilities.
Fine-tuning
Fine-tuning lets you take a pre-trained model and tailor it to your specific needs. Instead of building a model from scratch using pre-training, you start with a model already skilled in general language understanding and refine it with task-specific data.
During fine-tuning, the model's architecture remains unchanged, but its internal weights are adjusted to better fit the new dataset or domain. For instance:
- Medical applications: Models like Med-PaLM are fine-tuned with medical data, including research papers and health queries, enabling them to handle specialized tasks in healthcare.
- Programming: Code LLaMA is optimized for coding, offering powerful features like autocompletion, debugging, and multi-language code translation.
- Speech recognition: Fine-tuned models enhance automatic speech recognition (ASR) systems like Whisper, helping them tackle domain-specific terminology and complex language structures in fields like healthcare or low-resource languages.
Fine-tuning bridges the gap between general-purpose models and the specific demands of your use case. It allows you to leverage pre-trained knowledge while tailoring the model for specialized tasks — from healthcare diagnostics to coding or improving speech recognition systems.
By fine-tuning, you ensure the model delivers more accurate, relevant, and context-aware results, aligning closely with your unique objectives.
Retrieval-augmented generation (RAG)
RAG enhances LLM accuracy by integrating real-time retrieval of external data into the prompt. By accessing up-to-date information from sources like customer documentation, web pages, or third-party applications, RAG enables LLMs to deliver highly accurate, context-aware responses.
This approach ensures that your model remains relevant and reliable, no matter how dynamic or specialized your queries are.
Here is how the retrieval process works:
- User prompt: The user gives a specific query and triggers LLMs to create a response. RAG converts the query into vectorized representations called embeddings. Each element in an embedding corresponds to a specific property within the query’s text that the model can understand.
- Semantic search: RAG then performs a similarity search using AI algorithms to match the query embeddings with the embeddings in a vector database that contains external knowledge. Vector databases store these embeddings in chunks. Each chunk contains a segment of data corresponding to a particular domain. Algorithms will compute similarity metrics to determine which chunk is closest to the query embeddings to understand the relevant context. Relevant embeddings will be fetched to provide the LLM with the correct context associated with the user’s query.
- Prompt: LLM uses the context information retrieved from the vector database and the user’s query as input. It combines this with the configured prompt, which provides the LLM with the necessary instructions on how to generate a response.
- Post-processing: LLM processes the input according to the prompt and provides a response.
The process of obtaining reliable external data through techniques such as web scraping, API integration, and document indexing allows organizations to ensure that the information being retrieved is both current and accurate.
The difference between prompt engineering, fine-tuning, and RAG
Prompt engineering, RAG, and fine-tuning are all techniques to enhance LLMs’ output and increase its accuracy and relevance. They differ, however, in certain ways. Below is a brief overview of the main differences between the techniques.
|
Prompt engineering |
Fine-tuning |
RAG |
Adaptation |
Prompt engineering is an evolving, creative process that involves experimenting with different prompt structures and examples. |
After the fine-tuning phase for a specific task, LLMs become static. |
RAG is an evolving system that can learn from additional sources over time. |
Data training |
Focuses on changing how you ask the model a question or give it instructions. |
Re-trains the parameters of a model to optimize performance with new data for a specific task. |
RAG adds information from external sources related to a specific topic, without changing the model's internal parameters. |
Versatility |
Can be adapted for various use cases, such as text generation and data analysis. You can also tune prompts for specific industries like healthcare and finance. |
If a model hasn’t been fine-tuned for a domain-specific task, it doesn’t have sufficient knowledge to handle related queries. |
RAG can augment the LLM with any information source related to any domain without re-training the model on a new dataset and knowledge. |
Catastrophic forgetting |
Involves crafting specific prompts that guide the model's output. It does not directly involve retraining the model itself, which is where catastrophic forgetting usually occurs. |
Fine-tuning an LLM for a new task can lead to forgetting or losing previous knowledge learned during the pre-training phase. |
Since RAG does not change the model’s internal parameters, LLMs retain their pre-training knowledge |
Computational requirements
| Typically doesn't require significant computational resources. |
Fine-tuning a model requires extensive computational resources and the use of GPUs. |
RAG-powered models can be resource-intensive. |
LLM optimization isn’t a linear process
Enhancing LLM performance comes down to choosing the right technique—or combination of techniques—for your specific goals. Different techniques address different issues, and you need to choose the right approach based on your needs.
RAG, fine-tuning, and prompt engineering each offer unique benefits, and they’re not mutually exclusive.
You might begin with RAG for real-time context and later fine-tune the model for a highly specialized task. In some cases, prompt engineering or function calling alone may meet your needs.
The key is to embrace an iterative approach of testing, learning, and refining to achieve the best results.