If there is anything organisations should look at while implementing AI, it is RAG. RAG stands for Retrieval Augmented Generation. In human language: Have AI generate answers to questions based on specific data: your organisation’s data.
In practice, RAG enables your employees and customers to access your organisation’s data in ways and speed previously unimaginable and retrieve valuable insights from it. We all know those expert senior employees everybody goes to get answers to tricky questions. Answers that without RAG-enabled AI are hidden in an overload of documents.
So, in short, with RAG, you make your organisation’s data accessible via an LLM, like ChatGPT, and enable users to converse with it.
But that would be too much of a simplification that doesn’t do justice to what happens behind the scenes to make a RAG model work as it should.
The building blocks of a RAG model
In this blog, I will walk you through the process that a RAG model uses to convert a user’s prompt into a valuable answer. A RAG model exists out of the following building blocks:
- Query Translation,
- Routing,
- Query Construction,
- Embedding and Indexing,
- Retrieval,
- Generation.
One step essential for a RAG model to work is not mentioned in this list: Data Preprocessing. It’s the process of converting your organisation’s data into a (Vectorbased) format that is usable for the LLM to generate answers. For that topic, I created a separate blog:
Query Translation
You might think that a prompt from a user is inserted directly into the LLM to have it provide an answer, but that will most likely not provide the best answer. Just think about when you get a (difficult) question. The first obstacle is the quality of the question. Is it clear? Does it provide you with directions for finding the answer? Next, there are different strategies you can follow to come up with the best answer. You could take a step back and look at the bigger picture, split the question into smaller questions, or look at it from different angles.
That is precisely what Query Translation does. Query Translation is all about transforming or augmenting the prompt to retrieve, via Semantic or Hybrid search, the correct documents that enable the LLM to provide the best possible answer.
Strategies to transform a user’s prompt into something more usable are, for instance:
- Step- Back questions (Step-back prompting)
- Re-write the question (RAG-Fusion, Multi Query)
- Subquestions (Least-to-most)
I will zoom in on Multi Query to provide an example of how Query translation helps the RAG model provide more accurate answers. With Multi Query, the system creates multiple reframed prompt versions. For every individual reframed prompt, we retrieve the relevant documents. Doing so increases the chance of retrieving vital documents that the single prompt might not have given us.
Routing
With our newly generated prompts created in the Query Translation step, it is time to select the data source from which we retrieve the documents needed to answer the prompt. And just like you would, based on the question asked, look for the best source, and so does a RAG model.
The Routing function evaluates the prompts and then points the RAG model to the right data source to retrieve the documents.
There are two basic techniques for that:
- Logical Routing, where we provide the LLM knowledge of the available data sources and let it figure out what data source to use based on the prompts,
- Semantic Routing is when we have predefined embedded prompts and determine the similarity between the user’s and embedded prompts. Whichever embedded prompt is most similar will then determine the data source used.
These examples show that Routing doesn’t have to point to a data source directly but can also route it to a prompt that then handles the retrieval of documents.
Query construction
When our routing function points us to a specific data source, we need to translate the generic prompt based on natural language to the domain-specific language of the data source. You could call it the language translator, translating natural language into language a relational database like SQL, a GraphDB or Vectorstore understands.
As you probably guessed, this is a data source-specific operation, with everyone having their language and containing different fields that enable document retrieval.
As I described in my previous post, saving Metadata can be very helpful when retrieving data from a Vectorstore. Combining that with semantic search in Hybrid Search is one of the best practices.
Embedding and Indexing
Vectorstores are the preferred storage location for unstructured data, enabling easy access by LLMs. For LLMs to find the most relevant documents, different techniques are used, but we can split them up into two categories:
- Statistical methods look at the word frequency in a document to determine if it relates to the prompt.
- Embedding is where we split documents into chunks stored as a vector.
Embedding models have limited context windows, and using a larger context window makes using your LLM more expensive. It makes sense to split a document into chunks.
The easiest way, but probably not the best, is using an even chunk size without looking at the content. This could result in similar content split across chunks or two subjects in the same chunk.
Embedding based on Metadata
You could use the Metadata to perform the chunking, splitting the document based on chapters, titles, and other relevant Metadata. This ensures that chunks represent the structure of the content much better and keep similar content together.
While with even chunk size, the LLM might retrieve chunks with content that is partially irrelevant to the prompt, using the Metadata to chunk the content will enable the LLM to do a focused retrieval of the data, getting much more relevant data.
Advanced embedding
You can’t judge a book by its cover, but you can judge it based on a detailed summary. That’s one of the options for embedding: you use an LLM to create summaries of documents, store the summaries in your Vectorstore, and store the original document in a document store while making sure you connect the two. Then, when the summary in the Vectorstore is selected to answer the prompt, you can retrieve the original document to be used by the LLM to answer the prompt.
Retrieval
With our Vectorstore containing fast amounts of data that could be relevant to answering the prompt, we need to be selective with our retrieval. We do that by specifying the quantity of relevant documents we want to retrieve. The relevance of a document is based on a near neighbour search of the Query topic represented by a vector. Vectors near our Query vector are selected until we reach the maximum number.
Generation
This is where it all comes together. The data we embedded in the Vectorstore, the query translation that transformed the initial prompt into prompts that more accurately enable the retrieval of the right documents, our retrieval settings, and now we have the LLM able to generate an answer based on those documents.
Conclusion
With organisations creating more and more content, RAG-enabled AI systems can unlock the value of that data in a previously unimaginable way.
Setting up a RAG model isn’t for the faint-hearted, and if your current system isn’t delivering as promised, perhaps this blog has given you some insight into all the moving parts behind the scenes where things can go wrong.
There are many similarities between how AI and humans tackle answering questions.
If you are eager to gain more in-depth (technical) knowledge on how to setup a RAG model, then a good starting point can be:
The free course at Deeplearning:
https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications
Or this YouTube video series from LangChain that describes how to set up your RAG model: