RAG models enable organisations to turn a generic LLM into a specialist AI assistant, unlocking the value of Enterprise data to employees and customers.
With a RAG-enabled AI system, organisations can finally take real advantage of all the data they accumulate. Instead of just having data sit on file servers or SharePoint sites that nobody uses, RAG enables users to talk to that data and derive valuable insight.
As with every data-driven application, the Garbage-in-Garbage-Out saying also applies to your LLM. That is why taking the time and effort to prepare the data to best support your LLM in providing answers to prompts more than pays off.
Data preparation
Your data consists of all kinds of documents from different locations and applications. Documents like PDFs, Word, PowerPoint, Excel, and images, as well as applications like email, Slack, SharePoint, and many others. All these types of documents need to be processed, standardised and stored in a central vector database so that your RAG system is able to retrieve them.
When these documents are processed, keeping as much information as possible is vital. By that, we don’t mean just words but also whether words are used in a header if they represent the start of a chapter, or if they are part of a document summary.
Hybrid Search
Just like you would have a method for looking up relevant documentation to answer a question, so could the LLM that searches the vector database. This is called Hybrid Search; it combines Semantic Search that looks for content similar to the prompt and metadata to be more efficient.
Let’s say you get a question that requires you to find relevant documents to answer. You might use a process with the following steps:
- First, scan the titles of the documents,
- Select documents that have the correct subject
- If available, read the summaries,
- Then, scan the documents to determine the most relevant part to answer the question.
- Only then are you going to read specific content in the document
The only way you would be able to do that is by the availability of document titles, chapter names, and other (meta) data that provide a good indication of the relevancy of the documents in relation to the prompt you need to answer.
For AI, it is no different. If you want the RAG-powered LLM to be efficient and capable of answering questions on your data, then it is vital that you use a process that:
- Reads the documents,
- Is able to derive metadata,
- Standardises different content into one format.
For example, Word documents offer good insight into the structure of the text; it is embedded when the Word document is created. PDFs, however, often don’t provide any clues to the text structure.
In practice, this means that you need to normalise different documents and partition them into a common format, and you’ll have to create or use a document-specific method.
This data preprocessing will eventually feed the Vector Database that the LLM uses. A best practice is to normalise the preprocessing output to one file type; JSON would be a good candidate for that.
Chunking
Chunking is splitting large text documents into chunks that are then saved in the vector database that the LLM accesses to generate an answer.
Why chunking:
- Your LLM might have a limited context windows,
- Using a larger context window makes using your LLM more expensive.
The easiest way, but probably not the best, is using an even chunk size without looking at the content. This could result in similar content split across chunks or two subjects in the same chunk.
You could use the metadata to perform the chunking, splitting the document based on chapters, titles, and other relevant metadata. This ensures that chunks represent the structure of the content much better and keep similar content together.
While with even chunk size, the LLM might retrieve chunks with content that is partially irrelevant to the prompt, using the metadata to chunk the content will enable the LLM to do a focused retrieval of the data, getting much more relevant data.

Rules-based parsers vs Visual Information
So far, we have looked at documents that provide some kind of inherent formatting information. Documents like Word or an HTML file can be preprocessed using rules-based parsers.
Other documents like PDFs and Images don’t provide that metadata/ formatting information. Document Image Analysis (DIA) techniques are used for these kinds of documents to transform visual information into information that can be stored in the Vector database. Let’s have a look at two of these DIA methods:
- Document Layout Detection: uses an object detection model to draw and label bounding boxes around elements on a document image and then uses OCR to extract the text
- Vision Transformers: take a document image as input and produce a structured text representation as output without needing OCR.
Each model type has advantages and disadvantages that you need to consider.
Document Layout Models have the advantage of a more structured way of decomposing documents and can keep track of where a specific text was derived. Structured also means they are less flexible, and Document Layout Models require two calls: object detection and OCR.
A Vision Transformer Model has the advantage of being very flexible, making it ideally suited for documents with irregular formats and easily adapting to new formats. Being generative, the model has the potential to create hallucinations or repetition, and it is also computationally expensive.
The conversion of tables
Tables hold valuable information in a dense format, so reading, converting, and storing the content in the vector database is crucial. Again, we see a divide between documents that provide table structure information (Word or HTML) needed to extract the data and those that don’t.
For documents that do not provide the table information, we use techniques like Table Transformers, Vision Transformers and OCR Postprocessing, which eventually produce an HTML output that contains the table information.
Conclusion
As they say, garbage-in-is-garbage-out, spending time preprocessing your organisational data into a format that enables the LLM to find the correct data is crucial. There is no one-size-fits-all all. You must discover what extraction techniques work best for your use case.
If you have made it all the way to the end of this blog, you probably are interested in a more detailed overview of preprocessing your data for your organisation’s RAG-powered LLM application.
In that case, I strongly suggest you follow this free course from DeepLearning.AI. I am not affiliated with or sponsored by DeepLearning, but I find their courses helpful.
https://www.deeplearning.ai/short-courses/preprocessing-unstructured-data-for-llm-applications