Generative Search

Generative search is the combination of generative AI models with search capabilities; extending the search and retrieval capabilities of search solutions with the content generation and summarization capabilities of large language models (LLMs). The conceptual architecture of Generative Search is shown in the diagram below

Generative AI architecture patterns

Client documents, including Word documents, PDFs, web data, and other structured and unstructured text, are ingested into an Content Analysis component and prepared for semantic searching.
A user submits a question using the Content Analysis user interface.
The Content Analysis component uses the users question to identify the documents and passages of information that are most relevant to the user's question. Depending on the user's question the component sends the question along with relevant documents and passages to an LLM to generate a human-like response.
The LLM returns the generated response which is subsequently passed to the user.

IBM Product Architecture

The mapping of the IBM watsonx family of products to the generative search pattern is shown in the diagram below. IBM Watson Discovery is used to provide the document ingestion, document understanding, and content analysis and insights of the Content Analysis component. IBM watsonx.ai is used to select, tune, test, and deploy the large language model used to extend Watson Discovery.

On-premise / private deployments

Some clients do not have watsonx.ai available in their local region, or may have security concerns or regulatory requirements that prevent them from using the watsonx.ai SaaS solution. For these clients, we offer watsonx.ai as a set of containerized services that can be deployed on Red Hat Openshift running within the clients' data centers, or within a virtual private cloud (VPC) within a cloud-service provider's infrastructure.

Architecture Decisions and Consideration

Many factors go into choosing a models that will work well for your project.

The model's license may restrict how it can be used. For example, a model's license may prevent it from being used as part of a commercial application.

The data set used to train the model training has a direct impact on how well the model works for a specific application and significantly affects the risk that the model may generate non-sensical, offensive, or simply unwanted responses. Similarly, models trained on copyrighted or private data may open their users to legal liability. IBM provides full training data transparency and indemnification from legal claims arising from its models.

The size of the model, i.e. how many parameters it is trained with, and the size of its context window, i.e. how long of a passage of text can the model accept, affect model performance, resource requirements and by extension cost, and throughput. While it's tempting to go with a "bigger is better" philosophy and choose a 20 billion parameter model, the resource requirements and improvement (if any) in accuracy may not justify it. Recent studies have shown that smaller models can significantly outperform larger ones for some solutions.

Any fine-tuning applied to a model can affect its suitability for a task. For example, IBM offers two versions of the Granite model: one tuned for general chat applications, and another tuned to follow instructions.

Other considerations when choosing a model include:

Selection of model parameters, eg. the model temperature, to balance the creation of human-like text and factual responses. Setting the model temperature to a high value will generate consistent but potentially uninteresting or overly terse responses, while setting the temperature to a low value will introduce more variety into the responses but will add unpredictability in the response length and content.
Selection and implementation of model guardrails to guard against ineffective or offensive results.
The language of the client data and user prompts must also be taken into account. The majority of LLMs are trained on English language text and can often translate between English and other languages with varying levels of expertise. Applications requiring multi-lingual or localized language support may require the use of multiple models trained in each of the supported languages, or implementation of a translation step to translate multi-lingual inputs into English or another 'base' language.

Some clients may wish to have a feedback mechanism to continuously refine the system's responses. Over time, this feedback can be improve both the Watson Discovery configuration (e.g., by refining search parameters) and to fine-tune the LLM.

Clients in regulated industries may wish to include an additional layer of monitoring to adhere to regulations, and provide transparency to users about how their queries are being handled and the role of AI in generating responses.

As with any integrated system, continuously monitor performance, user satisfaction, and any potential issues. Be prepared to scale the infrastructure as user demand grows.

Next steps

Talk to our experts about implementing a hybrid cloud deployment pattern.

More ways to explore

Hybrid Cloud Architecture Center

Diagram tools and templates

IBM Well-Architected Framework

Architecture Decisions and Consideration

Contributors

Bryan Kyle, Manav Gupta, Mihai Criveti, Chris Kirby

Updated: November 30, 2023