26 February 2025
In this tutorial, you will use IBM's Docling and open source IBM Granite vision, text-based embeddings and generative AI models to create a RAG system. These models are available via various open source frameworks. In this tutorial, we will use Replicate to connect to the IBM Granite vision and generative AI models and HuggingFace to connect to the embeddings model.
Retrieval-augmented generation (RAG) is a technique used with large language models (LLMs) to connect the model with a knowledge base of information outside the data the LLM has been trained on without having to perform fine-tuning. Traditional RAG is limited to text-based use cases such as text summarization and chatbots.
Multimodal RAG can use multimodal LLMs (MLLM) to process information from multiple types of data to be included as part of the external knowledge base used in RAG. Multimodal data can include text, images, audio, video or other forms. Popular multimodal LLMs include Google’s Gemini, Meta’s Llama 3.2 and OpenAI’s GPT-4 and GPT-4o.
For this recipe, you will use an IBM Granite model capable of processing different modalities. You will create an AI system to answer real-time user queries from unstructured data in a PDF.
Welcome to this Granite tutorial. In this tutorial, you'll learn how to harness the power of advanced tools to build an AI-powered multimodal RAG pipeline. This tutorial will guide you through the following processes:
This tutorial uses three cutting-edge technologies:
By the end of this tutorial, you will accomplish the following:
This tutorial is designed for AI developers, researchers and enthusiasts looking to enhance their knowledge of document management and advanced natural language processing (NLP) techniques. The tutorial can also be found in the IBM Granite Community's Granite Snack Cookbook GitHub in the form of a Jupyter Notebook.
Ensure you are running Python 3.10, 3.11 or 3.12 in a freshly created virtual environment.
To see some logging information, we can configure INFO log level.
NOTE: It is okay to skip running this cell.
Specify the embeddings model to use for generating text embedding vectors. Here we will use one of the Granite Embeddings models
To use a different embeddings model, replace this code cell with one from this Embeddings Model recipe.
Specify the MLLM to use for image understanding. We will use the Granite vision model. Here we use the Replicate LangChain client to connect to a Granite model from the ibm-granite org on Replicate.
To get set up with Replicate, see Getting Started with Replicate. To connect to a model on a provider other than Replicate, substitute this code cell with one from the LLM component recipe.
Specify the language model to use for the RAG generation operation.
In this example, from a set of source documents, we use Docling to convert the documents into text and images. The text is then split into chunks. The images are processed by the MLLM to generate image summaries.
Docling will download the PDF documents and process them so we can obtain the text and images the documents contain. In the PDF, there are various data types, including text, tables, graphs and images.
With the documents processed, we then further process the text elements in the documents. We chunk them into appropriate sizes for the embeddings model we are using. A list of LangChain documents are created from the text chunks.
Next we process any tables in the documents. We convert the table data to markdown format for passing into the language model. A list of LangChain documents are created from the table's markdown renderings.
Finally we process any images in the documents. Here we use the vision language model to understand the content of an image. In this example, we are interested in any textual information in the image. You might want to experiment with different prompt text to see how it might improve the results.
NOTE: Processing the images can take a very long time depending upon the number of images and the service running the vision language model.
We can then display the LangChain documents created from the input documents.
Using the embedding model, we load the documents from the text chunks and generated image captioning into a vector database. Creating this vector database allows us to easily conduct a semantic similarity search across our documents.
NOTE: Population of the vector database can take some time depending on your embedding model and service.
Specify the database to use for storing and retrieving embedding vectors.
To connect to a vector database other than Milvus, replace this code cell with one from this Vector Store recipe.
We now add all the LangChain documents for the text, tables and image descriptions to the vector database.
Now that we have successfully converted our documents and vectorized them, we can set up out RAG pipeline.
Here we test the vector database by searching for chunks with relevant information to our query in the vector space. We display the documents associated with the retrieved image description.
Feel free to try different queries.
The returned document should be responsive to the query. Let's go ahead and construct our RAG pipeline.
First we create the prompts for Granite to perform the RAG query. We use the Granite chat template and supply the placeholder values that the LangChain RAG pipeline will replace.
`{context}` will hold the retrieved chunks, as shown in the previous search, and feeds this to the model as document context for answering our question.
Next, we construct the RAG pipeline by using the Granite prompt templates previously created.
The pipeline uses the query to locate documents from the vector database and use them as context for the query.
Awesome! We have created an AI application that can successfully leverage knowledge from the source documents' text and images.
We surveyed 2,000 organizations about their AI initiatives to discover what’s working, what’s not and how you can get ahead.
IBM® Granite™ is our family of open, performant and trusted AI models tailored for business and optimized to scale your AI applications. Explore language, code, time series and guardrail options.
Access our full catalog of over 100 online courses by purchasing an individual or multi-user subscription today, enabling you to expand your skills across a range of our products at a low price.
Led by top IBM thought leaders, the curriculum is designed to help business leaders gain the knowledge needed to prioritize the AI investments that can drive growth.