Use Llama 3.2-90b-vision-instruct for multimodal AI queries in Python with watsonx

25 September 2024

Authors
Anna Gutowska AI Advocate, IBM
Erika Russi Data Scientist, IBM
N.A. B. Lead, AI Advocacy, IBM
What is multimodal AI?

In this tutorial, you will discover how to apply the Meta Llama 3.2-90b-vision-instruct model now available on watsonx.ai to computer vision tasks such as image captioning and visual question answering.

Overview of multimodal AI
Multimodal versus unimodal AI models

Many of us are familiar with unimodal AI applications. A popular unimodal AI tool is ChatGPT. Chatbots like ChatGPT use natural language processing (NLP) to understand user questions and automate responses in real time. The type of input these unimodal large language models (LLMs) can be applied to is limited to text.

Multimodal artificial intelligence (AI) relies on machine learning models built on neural networks. These neural networks are capable of processing and integrating information from multiple data types using complex deep learning techniques. These different modalities produced by the generative AI model, sometimes called gen AI models, can include text, images, video and audio input.

Multimodal AI systems have many real-world use cases ranging from medical image diagnoses in healthcare settings using computer vision to speech recognition in translation applications. These AI technology advancements can optimize various domains. The major advantage of multimodal architectures is the ability to process different types of data.

Multimodal AI: how it works

Multimodal AI entails three elements:

Input module

The input module is built upon multiple unimodal neural networks for pre-processing different data types. Here, the data is prepared for machine learning algorithms performed in the fusion module.

Fusion module

The combining, aligning and processing of data occurs in this module. The fusion process occurs for each data modality. Several techniques are commonly used in this module. One example is early fusion, where raw data of all input types is combined. Additionally, mid-fusion is when data of different modalities are encoded at different preprocessing stages. Lastly, late fusion consolidates the data after being initially processed in the input module by different models corresponding to each modality.

Output module

The output module generates results in the desired output format by making sense of the data produced in the fusion module. These outputs can take on various forms such as text, image or a combination of formats.

Steps
Step 1. Set up your environment

While you can choose from several tools, this tutorial is best suited for a Jupyter Notebook. Jupyter Notebooks are widely used within data science to combine code with various data sources like text, images and data visualizations.

This tutorial walks you through how to set up an IBM account to use a Jupyter Notebook.

  1. Log in to watsonx.ai using your IBM Cloud account.

  2. Create a watsonx.ai project.

    You can get your project ID from within your project. Click the Manage tab. Then, copy the project ID from the Details section of the General page. You need this ID for this tutorial.

  3. Create a Jupyter Notebook.

    This step will open a Notebook environment where you can copy the code from this tutorial to implement prompt tuning on your own. Alternatively, you can download this notebook to your local system and upload it to your watsonx.ai project as an asset. This Jupyter Notebook along with the datasets used can be found on GitHub.

Step 2. Set up a Watson Machine Learning (WML) service instance and API key

For this tutorial, we suggest using the Meta 3.2-90b-vision-instruct model with watsonx.ai to achieve similar results. You are free to use any AI model that supports multimodal learning of your choice. There are several multimodal AI models to choose from including OpenAI’s GPT-4 V(ision) and DALL-E 3 as well as Google’s Gemini. Ensure you are using the appropriate API if working with other models as this tutorial is designed for watsonx.ai.

  1. Create a Watson Machine Learning service instance (select your appropriate region and choose the Lite plan, which is a free instance).
  2. Generate an API Key in WML.
  3. Associate the WML service to the project you created in watsonx.ai.
Step 3. Install and import relevant libraries and set up your credentials

We'll need a few libraries and modules for this tutorial. Make sure to import the following ones; if they're not installed, you can resolve this with a quick pip install.

#installations
%pip install image | tail -n 1
#imports
import requests
import base64

from PIL import Image

In this tutorial, the API requests will require Bearer authentication. To obtain your Bearer token, please run the following commands in your terminal and insert your watsonx API key where indicated. The token will begin with "Bearer " and will be followed by a long string of characters. For more detailed instructions, please reference the official documentation.

Note that this token expires an hour after generation. This means you will need to run the final command again once the token expires to continue with the tutorial.

# curl -fsSL https://clis.cloud.ibm.com/install/osx | sh

# ibmcloud login --apikey YOUR_API_KEY_HERE

# ibmcloud iam oauth-tokens

Once you copy your bearer token from your terminal, paste it in the following code block along with your project ID where indicated. Ensure that your bearer_token  begins with the word "Bearer " and is not just a long string of characters.

credentials = {
    "url": "https://us-south.ml.cloud.ibm.com/ml/v1/text/chat?version=2023-05-29",
    "project_id": "YOUR_PROJECT_ID_HERE",
    "bearer_token": "Bearer YOUR_BEARER_TOKEN_HERE"
}
Step 4. Encode images

In this tutorial, we will be working with several images for multimodal AI applications such as image captioning and object detection. The images we will be using can be accessed using the following URLs. We can store these URLs in a list to iteratively encode them.

url_image_1 = 'https://assets.ibm.com/is/image/ibm/hv6b0935?$original$'
url_image_2 = 'https://assets.ibm.com/is/image/ibm/c30a2d57-a62b-4bb3-818895bfe2fc7bf8?$original$'
url_image_3 = 'https://assets.ibm.com/is/image/ibm/nt170969?$original$'
url_image_4 = 'https://assets.ibm.com/is/image/ibm/fb123b45-6530-4dd9-a758-10a7ec234d9d?$original$'

image_urls = [url_image_1, url_image_2, url_image_3, url_image_4]

To gain a better understanding of our data input, let's display the images.

for idx, url in enumerate(image_urls):
    print(f'url_image_{idx}')
    display(Image.open(requests.get(url, stream=True).raw))

Output:

url_image_0

url_image_1

url_image_2

url_image_3

To encode these images in a way that is digestible for the LLM, we will be encoding the images to bytes that we then decode to UTF-8 representation.

encoded_images = []

for url in image_urls:
encoded_images.append(base64.b64encode(requests.get(url).content).decode("utf-8"))
Step 5. Set up the API request

Now that our images can be passed to the LLM, let's set up a POST request to the watsonx API. The system prompt remains the same for each iteration of the API call so we can set it to the variable system_prompt . This prompt instructs the LLM on guidelines for generating appropriate responses.

system_prompt = '''You always answer the questions with markdown formatting using GitHub syntax. The markdown formatting you support: headings, bold, italic, links, tables, lists, code blocks, and blockquotes.
You must omit that you answer the questions with markdown. Any HTML tags must be wrapped in block quotes, for example ```<html>```. You will be penalized for not rendering code in block quotes.
When returning code blocks, specify language.\n\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. Please produce concise responses.
'''

Let's create a function, augment_api_request_body , that takes the user query and image as parameters and augments the body of the API request. We will use this function in each iteration.

def augment_api_request_body(user_query, image):
    body = {
            "messages": [{"role":"system","content":system_prompt}, {"role":"user","content":[{"type":"text","text":user_query},{"type":"image_url","image_url":{"url": f"data:image/jpeg;base64,{image}"}}]}],
            "project_id": credentials.get("project_id"),
            "model_id": "meta-llama/llama-3-2-90b-vision-instruct",
            "decoding_method": "greedy",
            "repetition_penalty": 1,
            "max_tokens": 500
}

return body

Next, we can establish the headers of our API requests. This will remain unchanged throughout the tutorial. The headers provide the API with the request's metadata.

headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
    "Authorization": credentials.get("bearer_token")
}
Step 6. Image captioning

Now, we can loop through our images to see the text descriptions produced by the model in response to the query, "What is happening in this image?"

for i in range(len(encoded_images)):
    image = encoded_images[i]
    user_query = "What is happening in this image?"
    request_body = augment_api_request_body(user_query, image)
    response = requests.post(
        credentials.get("url"),
        headers=headers,
        json=request_body
        )
    if response.status_code != 200:
        raise Exception("Non-200 response: " + str(response.text))
    data = response.json()
    print(data['choices'][0]['message']['content'])

Output

The image depicts a bustling city street, with a busy road and sidewalks lined with tall buildings, trees, and streetlights. The street is filled with cars, taxis, and pedestrians, creating a vibrant and dynamic atmosphere. The scene is set against a backdrop of towering skyscrapers and bustling city life, capturing the energy and activity of urban living.

This image shows a woman running in the street. The woman is wearing a yellow hoodie, black capri leggings, and black sneakers. She has a white headphone around her neck and her brown hair is in a ponytail. The woman appears to be running in the street, with her right leg extended behind her and her left leg bent in front of her. Her arms are bent at the elbows, with her right arm extended behind her and her left arm extended in front of her. In the background, there is a large white building with a row of windows and doors. The building appears to be an industrial or commercial structure, possibly a warehouse or office building. The street in front of the building is empty, with no other people or vehicles visible. The overall atmosphere of the image suggests that the woman is engaged in some form of physical activity or exercise, possibly jogging or running for fitness or recreation.

The image depicts a flooded area, with water covering the ground and surrounding buildings. The water is dark brown and appears to be deep, with some areas reaching up to the roofs of the buildings. There are several buildings visible in the image, including what appears to be a house, a barn, and some smaller structures. The buildings are all partially submerged in the water, with some of them appearing to be damaged or destroyed. In the background, there are fields and crops that are also flooded. The fields are covered in water, and the crops are bent over or lying flat on the ground. There are also some trees and other vegetation visible in the background, but they appear to be struggling to survive in the flooded conditions. Overall, the image suggests that a severe flood has occurred in this area, causing significant damage to the buildings and crops. The floodwaters appear to be deep and widespread, and it is likely that the area will take some time to recover from the disaster.

This image shows a close-up of a nutrition label on a food product, with a person's finger pointing to the label. The label is white with black text and lists various nutritional information, including serving size, calories, fat content, cholesterol, sodium, carbohydrates, dietary fiber, and vitamins. The label also includes a table with nutritional values based on a 2,000 calorie diet. The background of the image is dark gray, suggesting that it may be a product photo or advertisement for the food item. Overall, the image appears to be intended to inform consumers about the nutritional content of the product and help them make informed purchasing decisions.

The Llama 3.2-90b-vision-instruct model was able to successfully caption each image in significant detail.

Step 7. Object detection

Now that we have showcased the model's ability to perform image-to-text conversion in the previous step, let's ask the model some questions that require object detection. Our system prompt will remain the same as in the previous section. The difference now will be in the user query. Regarding the second image depicting the woman running outdoors, we will be asking the model, "How many cars are in this image?"

image = encoded_images[1]
user_query = "How many cars are in this image?"
request_body = augment_api_request_body(user_query, image)
response = requests.post(
    credentials.get("url"),
    headers=headers,
    json=request_body
)
if response.status_code != 200:
    raise Exception("Non-200 response: " + str(response.text))
data = response.json()
print(data['choices'][0]['message']['content'])

Output

There is only one car in this image.

The model correctly identified the singular vehicle in the image. Now, let's inquire about the damage depicted in the image of flooding.

image = encoded_images[2]
user_query = "How severe is the damage in this image?"
body = augment_api_request_body(user_query, image)
response = requests.post(
    credentials.get("url"),
    headers=headers,
    json=body
)
if response.status_code != 200:
    raise Exception("Non-200 response: " + str(response.text))
data = response.json()
print(data['choices'][0]['message']['content'])

Output:

The image depicts a severe flood scenario, with water covering the entire area up to the rooftops of the buildings. The water level is high enough to submerge the lower floors of the buildings, causing significant damage to the structures and their contents. The floodwaters also appear to be contaminated with debris and sediment, which could further exacerbate the damage. Overall, the damage in this image appears to be catastrophic, with the potential for long-term consequences for the affected community.

This response highlights the value that multimodal AI has for domains like insurance. The model was able to detect the severity of the damage caused to the flooded home. This could be a powerful tool for improving insurance claim processing time.

Next, let's ask the model how much sodium content is in the nutrition label image.

image = encoded_images[3]
user_query = "How much sodium is in this product?"
body = augment_api_request_body(user_query, image)
response = requests.post(
    credentials.get("url"),
    headers=headers,
    json=body
)
if response.status_code != 200:
    raise Exception("Non-200 response: " + str(response.text))
data = response.json()
print(data['choices'][0]['message']['content'])

Output:

**Sodium Content:**

The product contains **640mg of sodium**.

Great! The model was able to discern objects within the images following user queries. We encourage you to try out more queries to further demonstrate the model's performance.

Summary

In this tutorial, you used the Llama 3.2-90b-vision-instruct model to perform multimodal operations including image captioning and visual question answering. For more use cases of this model, we encourage you to check out the official documentation page. There you will find more information on the model’s parameters and capabilities. The Python output is important as it shows the multimodal system's ability to extract information from multimodal data.

Related solutions IBM® watsonx.ai™

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with ease and build AI applications in a fraction of the time with a fraction of the data.

IBM Consulting™ services

Redefine how you work with AI for business. IBM Consulting™ is working with global clients and partners to co-create what’s next in AI. Our diverse, global team of more than 20,000 AI experts can help you quickly and confidently design and scale cutting edge AI solutions and automation across your business. ​

IBM's AI solutions

IBM’s artificial intelligence solutions help you build the future of your business. These include: IBM® watsonx™, our AI and data platform and portfolio of AI-powered assistants; IBM® Granite™, our family of open-sourced, high-performing and cost-efficient models trained on trusted enterprise data; IBM Consulting, our AI services to redesign workflows; and our hybrid cloud offerings that enable AI-ready infrastructure to better scale AI.

Resources Meta’s Llama 3.2 models now available on watsonx, including multimodal 11B and 90B models News

Get started

What is multimodal AI? Related topic

Get started

Multimodal AI for Healthcare and Life Sciences Research

Get started

Multimodal Machine Learning in Image-Based and Clinical Biomedicine: Survey and Prospects Research

Get started

Think Newsletter

 

The latest AI and tech insights from Think

Sign up today
Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai Book a live demo