AI Query Optimizer

The Db2 AI Query Optimizer is an evolution of the traditional Db2 cost-based optimizer. The traditional optimizer is infused with AI to enhance the key functions of the optimizer that are suited to using AI techniques. This AI infusion can result in improved and stabilized performance while also simplifying the tuning of the optimizer.

Optimizer Evolution

The traditional Db2 cost-based optimizer uses an estimate of the number of rows that are processed by an access plan operator to then accurately cost that operator. This estimate is called the cardinality estimate. The cardinality estimate is the most important input to the optimizer cost model, and its accuracy depends on the statistics that are collected from the database by the RUNSTATS command. The default set of statistics collected for the cardinality estimate are independent column statistics that do not capture correlation. Alternatively, collecting column group statistics can minimize errors for queries containing multiple equality predicates. However, all relevant combinations of column groups need to be collected, and even if available, the independence assumption is still applied to other types of predicates, such as range predicates.

The AI Query Optimizer uses a learned model for estimating cardinality to improve cardinality estimation. This model can be built automatically based on sampling the data in the tables to learn the underlying correlations of the data and predict accurate cardinality estimates. An Artificial Neural Network (ANN) model is used to learn and approximate selectivity functions for basic BETWEEN predicates that cover one or more columns in a set of predicates. A model is built for each table, learning the distribution of each column and the joint distribution of a combination of columns. All other predicates are expressed in this form for internal optimizer interactions with the learned model. For example, equality predicates are represented as range predicates where the lower bound and the upper bound are the same. A single range predicate uses the upper bound or lower bound of the column domain of values to compose the BETWEEN predicate. Other predicate types such as IN and OR predicates also have their estimates derived from the basic BETWEEN predicate form.

The current capability of the AI Query Optimizer covers cardinality estimation with predicates on a single table. Commonly used predicate patterns are handled including equality, range, IN and OR predicates involving a column and a constant. Predicates with expressions or parameter markers are not handled by the AI Query Optimizer and are instead handled by the traditional optimizer algorithms. The seamless integration of AI Query Optimizer models allows for any predicate that is not covered to use the traditional optimizer capabilities. The new models work in conjunction with the traditional optimizer functionality, allowing the system to handle queries of the same complexity handled by the traditional optimizer. For example, while predicates with parameter markers of the form c1 >= ? are not handled in the current implementation of the model, the traditional optimizer covers this aspect. Other supported predicates can still derive benefit from the model.

Automatic discovery and training of models

Model discovery and training is performed automatically as part of automatic table RUNSTATS operations for a database. The discovery operation mines the data by using a sample to find strongly correlated columns that identifies and ranks columns that will benefit the most from the model. This functionality helps keep the training time low by focusing on strongly correlated columns. Cardinality estimation for predicates on columns that are not included in the model are handled independently using statistics.

Model training is performed after statistics are collected, automatically generating training and validation query sets from a sample of data. The generated queries that have various combinations of BETWEEN predicates on the columns included in the model are run against the sample. The predicate pattern and the selectivity estimate for each query is encoded and used to train and build the model. Trained models are stored in system catalogs, and the existence of trained models for particular tables can be determined by querying the SYSCAT.AIOPT_TABLECARDMODELS catalog view.

Entries in the catalogs are distinguished by a model schema, name, and version, where the model schema is always set to "SYSIBM". The first time a model is defined for a given table, a unique model name is created and associated with that model. Subsequently trained model versions on the same table will share the same name of the first model version. When a model is trained, a new version for that model is created. The current version of a model is version 0, which is considered the "active" version for that model and is the model instance used for inference. Only one previous version of the model is stored, meaning that each model can have at most 2 versions. The REVERT clause of the ALTER MODEL statement can be used to swap the current, active version with the previous stored version.

The retraining of models is determined using the same table data change analysis that is performed by automatic table RUNSTATS operations. When a statistics collection is required, it also drives a re-train of the model.

As with statistics, the optimizer uses a catalog cache to manage models during query optimization. This management results in higher catalog cache requirements. You can tune the value of the catalogcache_sz database configuration parameter to accommodate the additional memory needed to manage the models.

Model discovery and training occurs for objects that have automatic asynchronous statistics collection except for system catalogs, views, and nicknames.

Model discovery and training activities are registered in the statistics log. This includes whether they are successful or not and some metadata of the model. For example, for a single table model, the set of columns discovered and included in the trained model are included in the log records. As with other entries in the statistics log, you can query the logs using the SYSPROC.PD_GET_DIAG_HIST table function to view details about the discovery and training activities. Relevant entries look like this:

DISCOVER: TABLE CARDINALITY MODEL : <Object name with schema> : AT "2022-03-11-12.06.49.325975" : BY "Asynchronous" : start
…
DISCOVER: TABLE CARDINALITY MODEL : <Object name with schema> : AT "2022-03-11-12.06.49.327990" : BY "Asynchronous" : success
…
TRAIN   : TABLE CARDINALITY MODEL : <Object name with schema> : AT "2022-03-11-12.06.49.329230" : BY "Asynchronous" : start
 …
TRAIN   : TABLE CARDINALITY MODEL : <Object name with schema> : AT "2022-03-11-12.06.52.367035" : BY "Asynchronous" : success

Model discovery and training policies

An AI Query Optimizer policy (a defined set of rules or guidelines) can be used to specify which tables are allowed to have table cardinality models automatically discovered and defined on them. For example, if table T does not have a model defined on it and the policy disables models for T, the policy prevents automatic RUNSTATS from defining discovered models on T, but still allows automatic RUNSTATS to collect statistics on it. If a discovered model was previously defined on T prior to applying a policy to exclude it, subsequent automatic RUNSTATS collections on T continue to retrain the model, but not run any further discovery. To remove re-training from consideration, the ALTER MODEL statement can be used to disable further training.

You can use the RUNSTATS policy to specify the automated behavior of automatic table RUNSTATS operations for a database. Since model discovery and training is performed as part of automatic table RUNSTATS operations, the RUNSTATS policy is also used to specify the automated behavior of model discovery and training. A policy excluding table T from automatic RUNSTATS consideration also excludes model discovery and training on the table.

The automatic maintenance routines are used to get and set the AI Query Optimizer policy by specifying the AUTO_AI_OPT policy type. These routines include AUTOMAINT_GET_POLICY, AUTOMAINT_GET_POLICYFILE, AUTOMAINT_SET_POLICY, and AUTOMAINT_SET_POLICYFILE.

Predicate selectivity estimation prediction

If a model exists, the optimizer calls the model prediction interface. The optimizer supplies to the model prediction interface a combination of all supported predicates, expressed in terms of BETWEEN predicates, for each applicable column encoded by the model equivalent to how the training queries were encoded. The prediction returned represents the combined predicate selectivity. When this prediction is returned to the optimizer, any predicate that is not currently supported by the model, for example predicates with expressions, is handled independent of the model prediction using the available statistics.

Models are only used on equality, range, BETWEEN, IN and OR predicates referencing a single table. This also applies to OR predicate sub-terms that qualify these requirements. Join predicates, parameter markers, and expressions are not supported.

The following example shows what is and is not supported:


SELECT * FROM T1, T2 
WHERE 
   T1.C1 = ‘abc’ AND                   -- Equality predicates supported
   T1.C6 IN (5, 3, 205) AND            -- IN predicates supported
   T1.C2 BETWEEN 5 AND 10 AND          -- BETWEEN predicates supported
   T2.C3 <= 120 AND                    -- Range predicates supported
   ((T1.C4 > 5 AND T1.C5 < 20) OR  
     (T1.C4 < 2 AND T1.C5 = 100)) AND  -- OR predicates with eligible sub-term predicates supported

   T1.C0 = T2.C0 AND T1.C7 > T2.C7 AND -- Join predicate not supported
   T1.C3 = ? AND                       -- Parameter markers not supported
   MOD(T1.C4, 10) = 1                  -- Expressions not supported

The optimizer uses the model to estimate the combined selectivity of the supported predicates, referred to as the model selectivity. The estimate for the remaining unsupported predicates is computed using the table statistics collected by the RUNSTATS operation, referred to as the non-model selectivity. The optimizer combines the model and non-model selectivity estimates independently when computing the cardinality estimate.

Explain optimizer model usage

You can use the explain facility to review how the optimizer used the model to estimate the cardinality at each operator in the access plan. You can also review the model information associated with any of the database objects used in the access plan. Model information associated with each predicate applied in the access plan is included in the EXPLAIN_PREDICATE table and in the FF_SOURCE_SCHEMA, FF_SOURCE_NAME, and FF_SOURCE_TYPE columns. The “Predicates” information in each operator includes a description of each predicate, including the filter factor estimate. If the filter factor of any individual predicate is computed using a model, the model used to estimate the predicate is shown as the “Filter Factor Source”:


Predicates:
----------
2) Sargable Predicate, 
        Comparison Operator: 	    Less Than or Equal (<=)
        Subquery Input Required:    No
        Filter Factor: 	            0.934924
        Filter Factor Source:       SYSIBM.SQL240506160304427566
        Predicate Text: 
        --------------
        (Q1.C3 <= 120)

The “Table Cardinality Model Predicates” information shows the list of predicates and the model used to compute the combined selectivity for that list of predicates:


Table Cardinality Model Predicates:
-----------------------------------
Model: 	SYSIBM.SQL240506160304427566
Predicates:
        1) (Q1.C1 = ’abc’)
        2) (Q1.C3 <= 120)

The model information is included in the EXPLAIN_OBJECT table and in the MODEL_SCHEMA and MODEL_NAME columns for each object referenced in the access plan that has an associated model. The “Objects Used in Access Plan” information shows the associated model, including the columns included in the model:


Objects Used in Access Plan:
---------------------------
Schema:   MYSCHEMA
Name:     T1
Type:     Table
...
        Model Schema: 	SYSIBM
        Model Name: 	SQL240506160304427566
        Columns in model:
                C1
                C3