Cardinality underestimation in the DB2 Query Compiler due to overloaded dimension tables

Technical Blog Post

Abstract

Body

In a star schema the fact table links to a number of dimension tables.

The row in the fact table will have a column with an identifier mapping to a specific row in one of the dimensions.

This column will be used to join back to the dimension table in SQL statements.

When the DB2 SQL Query Compiler, a.k.a. the optimizer, determines an estimated result set for a join, it will consider the cardinality of the join columns in each of the tables that are joined.

The cardinality of a column is defined as the number of unique values for that column existing within a certain table ( but independent of the number of rows in the table ).

A predicate that is applied against a result set will filter out some of the rows. The degree to which this happens is called the filter factor. A lower filter factor means that fewer rows will match the predicate criteria, hence more rows are filtered out of the result set.

For a join predicate, the generic formula to determine the filter factor is 1 / max ( colcard left hand side / colcard right hand side )

So if for instance, in predicate A = B , A has a column cardinality of 5 and B has 10, then the join filter factor will be 1 / max( 5,10) = 1 / 10

In a traditional star schema a number of dimension tables, in particular time related dimension tables might get pre-populated with a larger amount of data that is currently in use.

For example, a fact table can contain sales data ranging over a few quarters, but the related dimension table, might have date values going as far into the future as 2100.

This makes sense from an ease of use point of view as populating the fact table doesn't need to take into account if there is a corresponding row in the dimension table.

However, the combination of the generic join predicate filter factor formula and this overloading of dimension table data can lead to a problem in the estimation produced by the DB2 Query Compiler.

I have tried to illustrate this in the following simplistic example

create table fact ( dim1 int, dim2 int, dim3 int, a int, b int , c int );
create table dim1 ( col1 int, col2 date );
create table dim2 ( col1 int, col2 int );
create table dim3 ( col1 int, col2 int );

!perl -e 'for($i=0; $i<100000; $i++) { printf("%d,%d,%d,%d,%d,%d\n", $i % 100 ,$i % 1000 ,$i % 1000,$i,$i,$i ) }' > ins ;
load from ins of del insert into fact;
!rm ins;

!echo "connect to star;" > ins;
!perl -e 'for($i=0; $i<10000; $i++) { printf("insert into dim1 values ( %d,current date + %d days );\n",$i,$i ) }' >> ins ;
!db2 +o -tf ins;
!rm ins;

!perl -e 'for($i=0; $i<10000; $i++) { printf("%d,%d\n",$i,$i ) }' >> ins ;
load from ins of del insert into dim2;
load from ins of del insert into dim3;
!rm ins;

create unique index idx1 on dim1 ( col1 );
create unique index idx2 on dim2 ( col1 );
create unique index idx3 on dim3 ( col1 );
create unique index idx_fact on fact ( dim1, dim2, dim3 );

runstats on table fact with distribution and sampled detailed indexes all;
runstats on table dim1 with distribution and sampled detailed indexes all;
runstats on table dim2 with distribution and sampled detailed indexes all;
runstats on table dim3 with distribution and sampled detailed indexes all;

As we can see the dimension table "dim1" is populated with date values ranging from today to 10000 days into the future.

The fact table however only has a limited range, up to 100 days from now. ( using the modulo 100 operation when generating data ).

The example statement joins the fact table with the dimension tables :

select a,b,c from fact , dim1, dim2, dim3 where
                  fact.dim1 = dim1.col1 and
                  fact.dim2 = dim2.col1 and
                  fact.dim3 = dim3.col1 and
                  dim1.col2 between current date and current date + 50 days;

To compare the actual number or rows returned with the estimate produced by the optimizer, we can use the db2caem utility :

( sel is a file containing the select statement )

db2caem -d star -sf sel

This produces a subdirectory containing : db2caem.exfmt.1 This is the same type of output as normally produced by the db2exfmt utility but it crucially now also contains the real runtime number of rows passing through each plan operator.

Total Cost: 10711.3
Query Degree: 1

                                     Rows
                                  Rows Actual
                                    RETURN
                                    (   1)
                                     Cost
                                      I/O
                                      |
                                      400
                                     51000
                                    ^NLJOIN
                                    (   2)
                                    10711.3
                                      NA
                            /---------+---------\
                          400                      1
                         51000                     1
                        ^NLJOIN                 IXSCAN
                        (   3)                  (   8)
                        10261.3                 9.34627
                          NA                      NA
                 /--------+--------\              |
               400                    1           -1
              51000                   1           NA
             ^HSJOIN               IXSCAN   INDEX: BENNYV
             (   4)                (   7)        IDX3
             9811.18               9.34627        Q1
               NA                    NA
         /-----+------\              |
     100000             40           -1
     100000             51           NA
     TBSCAN           TBSCAN   INDEX: BENNYV
     (   5)           (   6)        IDX2
     8024.9           1071.16        Q2
       NA               NA
       |                |
     100000            10000
       NA               NA
TABLE: BENNYV    TABLE: BENNYV
      FACT             DIM1
       Q4               Q3

As can be seen, the estimated number of rows after the join between DIM1 and FACT is 400, but the reality tells us that the number of rows is actually a lot more : 51000

This is wrong by over a factor 100, which is significant. Some degree of error is common and just down to the fact that estimates are made, but larger margins or error are problematic.

This is a problem as the optimizer will use that number 400 further on in the plan where it might lead to wrong plan choices, suboptimal execution plans and ultimately bad SQL performance.

The best solution for this problem is to give the optimizer more information about the join. The way this is done is by creating a view which recreates that join and collect statistical information about this join result.

This is commonly referred to as statistical join and here we can create one using :

create view sv1 as ( select dim1.* from dim1, fact where fact.dim1 = dim1.col1 );
alter view sv1 enable query optimization;
runstats on view sv1 with distribution;

The optimizer will recognize the join exists in this statistical view and after matching this, it will augment the generic formula with the information it has about the cardinalities of the join.

It is important to make the statistical view generic ( no local predicate ), so it can be of benefit to many different statements using the same join.

Using the same db2caem command we can now see what the estimates have become :

Total Cost: 12434
Query Degree: 1

                                    Rows
                                 Rows Actual
                                   RETURN
                                   (   1)
                                    Cost
                                     I/O
                                     |
                                   50981.3
                                    51000
                                   ^HSJOIN
                                   (   2)
                                    12434
                                     NA
                            /--------+---------\
                        50981.3                 10000
                         51000                  10000
                        ^HSJOIN                TBSCAN
                        (   3)                 (   8)
                        11160.1                752.45
                          NA                     NA
                 /--------+--------\             |
             50981.3                10000       10000
              51000                 10000        NA
             ^HSJOIN               TBSCAN TABLE: BENNYV
             (   4)                (   7)       DIM3
             9886.04               752.45        Q1
               NA                    NA
         /-----+------\              |
     100000             40          10000
     100000             51           NA
     TBSCAN           TBSCAN   TABLE: BENNYV
     (   5)           (   6)        DIM2
     8024.9           1071.16        Q2
       NA               NA
       |                |
     100000            10000
       NA               NA
TABLE: BENNYV    TABLE: BENNYV
      FACT             DIM1
       Q4               Q3

The estimates are clearly correct now. We can see that this has lead to a higher cost for the execution plan, but it is a common misconception that a higher cost will lead to a slower plan.

The increase in cost here merely reflects the fact that our estimates are closer to the reality whereas we had a severe underestimation at first.

We can also see that the change in estimation has lead to a change in join method higher up in the access plan ( from Nested Loop joins to Hash Joins ).

When investigating a poor access plan one might not obviously detect this problem at first but the usage of db2caem can help expose severe underestimations and as we saw, statistical views can easily correct this.

Please leave a comment if there are any questions or feedback.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm13285897

Tips

Cardinality underestimation in the DB2 Query Compiler due to overloaded dimension tables

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?