My IBM Log in Subscribe

Build a document-based question answering system by using Docling with Granite 3.1

18 December 2024

Authors

Ash Minhas

Manager, Technical Content | AI Advocate

IBM

Anna Gutowska

AI Engineer, Developer Advocate

IBM

Erika Russi

Data Scientist

IBM

In this tutorial, you will use IBM® Docling and open source Granite™ 3.1 to perform document visual question answering for various file types.

What is Docling?

Docling is an IBM open source toolkit for parsing documents and exporting them to preferred formats. Input file formats include PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc and Markdown. These documents can be exported to markdown or JSON. Docling also provides OCR (optical character recognition) support for scanned documents. Use cases include scanning medical records, banking documents and even travel documents for quicker processing.

RAG and large context windows

Retrieval augmented generation (RAG) is an architecture for connecting large language models (LLMs) with external knowledge bases without fine-tuning or retraining. Text is embedded, stored in a vector database and finally, is used by the pretrained model to return relevant information for natural language processing (NLP) and machine learning tasks.

When an LLM has a larger context window, the generative AI model can process more information at once. This means that we can use both RAG and models with large context windows to leverage the ability to efficiently process more relevant information at a time. The LLM we use in this tutorial is the IBM Granite-3.1-8B-Instruct model. This model extends to a context window size of 128K tokens. We will access the model locally by using Ollama, without the use of an API. This model is also available on Hugging Face.

Steps

This tutorial can be found on our GitHub in the form of a Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code, text, images and data visualizations to formulate a well-formed analysis.

Step 1. Set up your environment

We first need to set up our environment by fulfilling some prerequisites.

  • Install the latest version of Ollama to run locally.
  • Pull the latest Granite 3.1 model  and an embedding model by running the following command. 
!ollama pull granite3.1-dense:8b
!ollama pull nomic-embed-text
  • Install and import the necessary libraries and modules. We use Docling v2 in this tutorial.
# Install required packages
!pip install -q "langchain>=0.1.0" "langchain-community>=0.0.13" "langchain-core>=0.1.17" \
"langchain-ollama>=0.0.1" "pdfminer.six>=20221105" "markdown>=3.5.2" "docling>=2.0.0" \
"beautifulsoup4>=4.12.0" "unstructured>=0.12.0" "chromadb>=0.4.22" "faiss-cpu>=1.7.4" # Required imports
import os
import tempfile
import shutil
from pathlib import Path
from IPython.display import Markdown, display

# Docling imports
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, TesseractCliOcrOptions
from docling.document_converter import DocumentConverter, PdfFormatOption, WordFormatOption, SimplePipeline

# LangChain imports
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, OllamaLLM
from langchain_community.vectorstores import FAISS
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

Step 2. Document format detection

We will work with various document formats in this tutorial. Let's create a helper function to detect document formats by using the file extension.

def get_document_format(file_path) -> InputFormat:
    """Determine the document format based on file extension"""
    try:
        file_path = str(file_path)
        extension = os.path.splitext(file_path)[1].lower()
        format_map = {
            '.pdf': InputFormat.PDF,
            '.docx': InputFormat.DOCX,
            '.doc': InputFormat.DOCX,
            '.pptx': InputFormat.PPTX,
            '.html': InputFormat.HTML,
            '.htm': InputFormat.HTML
        }
        return format_map.get(extension, None)
    except:
        return "Error in get_document_format: {str(e)}"

Step 3. Document conversion

Next, we can use the DocumentConverter  class to create a function that converts any supported document to markdown. This function identifies text, data tables, document images and captions by using Docling. The function takes a file as input, processes it using Docling's advanced document handling, converts it to markdown and saves the results in a Markdown file. Both scanned and text-based documents are supported and document structure is preserved. Key components of this function are:

  • PdfPipelineOptions : Configures how PDFs are processed.
  • TesseractCliOcrOptions : Sets up OCR for scanned documents.
  • DocumentConverter : Handles the actual conversion process
def convert_document_to_markdown(doc_path) -> str:
    """Convert document to markdown using simplified pipeline"""
    try:
        # Convert to absolute path string
        input_path = os.path.abspath(str(doc_path))
        print(f"Converting document: {doc_path}")

        # Create temporary directory for processing
        with tempfile.TemporaryDirectory() as temp_dir:
            # Copy input file to temp directory
            temp_input = os.path.join(temp_dir, os.path.basename(input_path))
            shutil.copy2(input_path, temp_input)

            # Configure pipeline options
            pipeline_options = PdfPipelineOptions()
            pipeline_options.do_ocr = False # Disable OCR temporarily
            pipeline_options.do_table_structure = True

            # Create converter with minimal options
            converter = DocumentConverter(
                allowed_formats=[
                    InputFormat.PDF,
                    InputFormat.DOCX,
                    InputFormat.HTML,
                    InputFormat.PPTX,
                ],
                format_options={
                    InputFormat.PDF: PdfFormatOption(
                        pipeline_options=pipeline_options,
                    ),
                    InputFormat.DOCX: WordFormatOption(
                        pipeline_cls=SimplePipeline
                    )
                }
            )

            # Convert document
            print("Starting conversion...")
            conv_result = converter.convert(temp_input)

            if not conv_result or not conv_result.document:
                raise ValueError(f"Failed to convert document: {doc_path}")

            # Export to markdown
            print("Exporting to markdown...")
            md = conv_result.document.export_to_markdown()

            # Create output path
            output_dir = os.path.dirname(input_path)
            base_name = os.path.splitext(os.path.basename(input_path))[0]
            md_path = os.path.join(output_dir, f"{base_name}_converted.md")

            # Write markdown file
            print(f"Writing markdown to: {base_name}_converted.md")
            with open(md_path, "w", encoding="utf-8") as fp:
                fp.write(md)

            return md_path
    except:
        return f"Error converting document: {doc_path}"

Step 4. QA chain setup

The QA chain is the heart of our system. It combines several components:

Document loading:

  • Loads the markdown file that we created.
  • Splits it into manageable chunks for processing.

Text splitting:

  • Breaks down the document into smaller pieces.
  • Maintains context with overlap between chunks.
  • Ensures efficient processing by the language model.

Vector store:

  • Creates embeddings for each text chunk.
  • Stores them in a FAISS index for fast retrieval.
  • Enables semantic search capabilities.

Language model:

  • Uses Ollama for both embeddings and text generation.
  • Maintains conversation history.
  • Generates contextual responses.

The following setup_qa_chain  function sets up this entire pipeline.

def setup_qa_chain(markdown_path: Path, embeddings_model_name:str = "nomic-embed-text:latest", model_name: str = "granite3.1-dense:8b"):
    """Set up the QA chain for document processing"""
    # Load and split the document
    loader = UnstructuredMarkdownLoader(str(markdown_path))
    documents = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
    )
    texts = text_splitter.split_documents(documents)

    # Create embeddings and vector store
    embeddings = OllamaEmbeddings(model=embeddings_model_name)
    vectorstore = FAISS.from_documents(texts, embeddings)

    # Initialize LLM
    llm = OllamaLLM(
        model=model_name,
        temperature=0
    )

    # Set up conversation memory
    memory = ConversationBufferMemory(
        memory_key="chat_history",
        output_key="answer",
        return_messages=True
    )

    # Create the chain
    qa_chain = ConversationalRetrievalChain.from_llm(
        llm=llm,
        retriever=vectorstore.as_retriever(             search_kwargs={"k": 10}             ),         memory=memory,
        return_source_documents=True
    )

    return qa_chain

Step 5. Set up question-answering interface

Finally, let's create a simple interface for asking questions. This function takes in the chain and user query as parameters.

def ask_question(qa_chain, question: str):
    """Ask a question and display the answer"""
    result = qa_chain.invoke({"question": question})
    display(Markdown(f"**Question:** {question}\n\n**Answer:** {result['answer']}"))

Step 6. Perform question-answering

Let's put it all together and enumerate over our questions for a specific document. The path to this document is stored in doc_path  and can be any document you want to test. For our sample document, check out our GitHub. The system maintains conversation history and can handle follow-up questions.

# Process a document
doc_path = Path("ibmredbook.pdf") # Replace with your document path
# Check format and process
doc_format = get_document_format(doc_path)
if doc_format:
    md_path = convert_document_to_markdown(doc_path)
    qa_chain = setup_qa_chain(md_path)

    # Example questions
    questions = [
        "What is the main topic of this document?",
        "What are the key points discussed?",
        "Can you summarize the conclusions?"
    ]

    for question in questions:
        ask_question(qa_chain, question)
else:
    print(f"Unsupported document format: {doc_path.suffix}")

Output:

Converting document: ibmredbook.pdf
Starting conversion...
Exporting to markdown...
Writing markdown to: ibmredbook_converted.md

Question: What is the main topic of this document?

Answer: The main topics covered in this document are building and managing containers using Red Hat OpenShift, deploying applications, and security aspects related to containerization on IBM Power systems. The document also includes an introduction to Red Hat OpenShift, its benefits, core concepts, and implementation on IBM Power. Additionally, it discusses multi-architecture containerization, monitoring tools and techniques, log management, performance tuning, and optimization.

Question: What are the key points discussed?

Answer: This document primarily focuses on implementing Red Hat OpenShift Container Platform on IBM Power systems for managing containers in a hybrid cloud environment. Key points covered include:

Introduction to Red Hat OpenShift: It is an enterprise Kubernetes platform that extends Kubernetes with additional features and tools, enhancing productivity and security for businesses using container technology at scale.

Benefits of Using Red Hat OpenShift for Container Orchestration: The document highlights the advantages of employing Red Hat OpenShift for managing containers, such as its comprehensive solution for hybrid cloud environments, including a container runtime, networking, monitoring, a container registry, authentication, and authorization.

Minimum IBM Power Requirements: Red Hat OpenShift Container Platform 4.15 can be installed on IBM Power 9 or IBM Power 10 processor-based systems.

Deploying Red Hat OpenShift on IBM Power Systems: This involves tailoring the networking infrastructure to leverage the robust capabilities and unique architecture of Power Systems, optimizing network performance for high throughput and low latency, ensuring network security and compliance, and managing network configurations for enterprise-level deployments.

Optimizing Network Performance: The document emphasizes the importance of faster storage, particularly for etcd on control plane nodes, as Red Hat OpenShift Container Platform is sensitive to disk performance.

Multi-Architecture Containerization: The text discusses key concepts in multi-architecture containerization and provides guidelines for implementing it using IBM Power control planes. It also addresses challenges and solutions related to multi-architecture containerization.

Security Aspects: Enterprise-grade security is mentioned as a crucial aspect of Red Hat OpenShift on IBM Power systems, although specific details are not provided in the given context.

Monitoring Tools and Log Management: The document does not explicitly mention monitoring tools or log management; however, it can be inferred that these aspects are covered within the broader context of managing containers using Red Hat OpenShift on IBM Power systems.

Performance Tuning and Optimization: While specific tuning and optimization techniques are not detailed in the provided context, the document implies that performance considerations should be taken into account during deployment and configuration.

Question: Can you summarize the conclusions?

Answer: The document discusses the implementation of Red Hat OpenShift Container Platform on IBM Power systems for managing containers in a hybrid cloud environment. Here are the main conclusions drawn from this document:

Benefits: Red Hat OpenShift provides a comprehensive solution for hybrid cloud environments, encompassing essential components such as a container runtime, networking, monitoring, a container registry, authentication, and authorization. It extends Kubernetes with additional features and tools to enhance productivity and security, making it an ideal choice for businesses looking to leverage container technology at scale.

Minimum Requirements: Red Hat OpenShift Container Platform 4.15 can be installed on IBM Power 9 or IBM Power 10 processor-based systems. For comprehensive guidance and further information on installation and configuration, refer to the IBM Redbooks publication Implementing, Tuning, and Optimizing Workloads with Red Hat OpenShift on IBM Power (SG24-8537) and Red Hat OpenShift Documentation.

Deployment Process: Deploying Red Hat OpenShift on IBM Power Systems involves tailoring the networking infrastructure to fully leverage the robust capabilities and unique architecture of Power Systems. This includes optimizing network performance for high throughput and low latency, ensuring network security and compliance, and managing network configurations to meet enterprise-level demands.

Network Performance Optimization: Faster storage is recommended, particularly for etcd on control plane nodes. On many cloud platforms, storage size and IOPS scale together, so you might need to over-allocate storage volume to obtain sufficient performance.

Multi-Architecture Containerization: Red Hat OpenShift supports multiple architectures (x86 and IBM Power) with RHOS 4.14 or later, simplifying the management of your Red Hat OpenShift environment on both x86 and IBM Power servers.

Security Aspects: The integration of Red Hat OpenShift running on IBM Power servers with existing infrastructure involves strategic networking solutions that bridge on-premises systems with your new cloud environment. This enables organizations to leverage the strengths of both infrastructures for enhanced flexibility, scalability, and resilience while ensuring network security and compliance.

Performance Tuning: The document does not provide specific details about performance tuning; however, it is mentioned that optimizing network performance for high throughput and low latency is essential. For comprehensive guidance on performance tuning, refer to the IBM Redbooks publication Implementing, Tuning, and Optimizing Workloads with Red Hat OpenShift on IBM Power (SG24-8537) and Red Hat OpenShift Documentation.

In summary, this document highlights that implementing Red Hat OpenShift Container Platform on IBM Power systems offers a robust foundation for developing, deploying, and scaling cloud-native applications in a hybrid cloud environment. It emphasizes the importance of optimizing network performance, ensuring security, and leveraging multi-architecture containerization capabilities to create an efficient and flexible solution for managing containers.

Great! The system was able to retrieve relevant information from the document to answer questions. Feel free to test this system with any of your own files and questions!

Conclusion

Using Docling and Granite 3.1, you built a document question answering system compatible with various file types. As a next step, this methodology can be applied to a chatbot with an interactive UI. There are many opportunities to transform this tutorial to apply to specific use cases.

Related solutions

Related solutions

IBM watsonx.ai

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Discover watsonx.ai
Artificial intelligence solutions

Put AI to work in your business with IBM’s industry-leading AI expertise and portfolio of solutions at your side.

Explore AI solutions
AI consulting and services

Reinvent critical workflows and operations by adding AI to maximize experiences, real-time decision-making and business value.

Explore AI services

Think Newsletter

 

The latest AI and tech insights from Think

Sign up today
Take the next step

Get one-stop access to capabilities that span the AI development lifecycle. Produce powerful AI solutions with user-friendly interfaces, workflows and access to industry-standard APIs and SDKs.

Explore watsonx.ai Book a live demo