Technical Blog Post
Abstract
Cardinality underestimation in the DB2 Query Compiler due to overloaded dimension tables
Body
In a star schema the fact table links to a number of dimension tables.
The row in the fact table will have a column with an identifier mapping to a specific row in one of the dimensions.
This column will be used to join back to the dimension table in SQL statements.
When the DB2 SQL Query Compiler, a.k.a. the optimizer, determines an estimated result set for a join, it will consider the cardinality of the join columns in each of the tables that are joined.
The cardinality of a column is defined as the number of unique values for that column existing within a certain table ( but independent of the number of rows in the table ).
A predicate that is applied against a result set will filter out some of the rows. The degree to which this happens is called the filter factor. A lower filter factor means that fewer rows will match the predicate criteria, hence more rows are filtered out of the result set.
For a join predicate, the generic formula to determine the filter factor is 1 / max ( colcard left hand side / colcard right hand side )
So if for instance, in predicate A = B , A has a column cardinality of 5 and B has 10, then the join filter factor will be 1 / max( 5,10) = 1 / 10
In a traditional star schema a number of dimension tables, in particular time related dimension tables might get pre-populated with a larger amount of data that is currently in use.
For example, a fact table can contain sales data ranging over a few quarters, but the related dimension table, might have date values going as far into the future as 2100.
This makes sense from an ease of use point of view as populating the fact table doesn't need to take into account if there is a corresponding row in the dimension table.
However, the combination of the generic join predicate filter factor formula and this overloading of dimension table data can lead to a problem in the estimation produced by the DB2 Query Compiler.
I have tried to illustrate this in the following simplistic example
create table fact ( dim1 int, dim2 int, dim3 int, a int, b int , c int );
create table dim1 ( col1 int, col2 date );
create table dim2 ( col1 int, col2 int );
create table dim3 ( col1 int, col2 int );
!perl -e 'for($i=0; $i<100000; $i++) { printf("%d,%d,%d,%d,%d,%d\n", $i % 100 ,$i % 1000 ,$i % 1000,$i,$i,$i ) }' > ins ;
load from ins of del insert into fact;
!rm ins;
!echo "connect to star;" > ins;
!perl -e 'for($i=0; $i<10000; $i++) { printf("insert into dim1 values ( %d,current date + %d days );\n",$i,$i ) }' >> ins ;
!db2 +o -tf ins;
!rm ins;
!perl -e 'for($i=0; $i<10000; $i++) { printf("%d,%d\n",$i,$i ) }' >> ins ;
load from ins of del insert into dim2;
load from ins of del insert into dim3;
!rm ins;
create unique index idx1 on dim1 ( col1 );
create unique index idx2 on dim2 ( col1 );
create unique index idx3 on dim3 ( col1 );
create unique index idx_fact on fact ( dim1, dim2, dim3 );
runstats on table fact with distribution and sampled detailed indexes all;
runstats on table dim1 with distribution and sampled detailed indexes all;
runstats on table dim2 with distribution and sampled detailed indexes all;
runstats on table dim3 with distribution and sampled detailed indexes all;
As we can see the dimension table "dim1" is populated with date values ranging from today to 10000 days into the future.
The fact table however only has a limited range, up to 100 days from now. ( using the modulo 100 operation when generating data ).
The example statement joins the fact table with the dimension tables :
select a,b,c from fact , dim1, dim2, dim3 where
fact.dim1 = dim1.col1 and
fact.dim2 = dim2.col1 and
fact.dim3 = dim3.col1 and
dim1.col2 between current date and current date + 50 days;
To compare the actual number or rows returned with the estimate produced by the optimizer, we can use the db2caem utility :
( sel is a file containing the select statement )
db2caem -d star -sf sel
This produces a subdirectory containing : db2caem.exfmt.1 This is the same type of output as normally produced by the db2exfmt utility but it crucially now also contains the real runtime number of rows passing through each plan operator.
Total Cost: 10711.3
Query Degree: 1
Rows
Rows Actual
RETURN
( 1)
Cost
I/O
|
400
51000
^NLJOIN
( 2)
10711.3
NA
/---------+---------\
400 1
51000 1
^NLJOIN IXSCAN
( 3) ( 8)
10261.3 9.34627
NA NA
/--------+--------\ |
400 1 -1
51000 1 NA
^HSJOIN IXSCAN INDEX: BENNYV
( 4) ( 7) IDX3
9811.18 9.34627 Q1
NA NA
/-----+------\ |
100000 40 -1
100000 51 NA
TBSCAN TBSCAN INDEX: BENNYV
( 5) ( 6) IDX2
8024.9 1071.16 Q2
NA NA
| |
100000 10000
NA NA
TABLE: BENNYV TABLE: BENNYV
FACT DIM1
Q4 Q3
As can be seen, the estimated number of rows after the join between DIM1 and FACT is 400, but the reality tells us that the number of rows is actually a lot more : 51000
This is wrong by over a factor 100, which is significant. Some degree of error is common and just down to the fact that estimates are made, but larger margins or error are problematic.
This is a problem as the optimizer will use that number 400 further on in the plan where it might lead to wrong plan choices, suboptimal execution plans and ultimately bad SQL performance.
The best solution for this problem is to give the optimizer more information about the join. The way this is done is by creating a view which recreates that join and collect statistical information about this join result.
This is commonly referred to as statistical join and here we can create one using :
create view sv1 as ( select dim1.* from dim1, fact where fact.dim1 = dim1.col1 );
alter view sv1 enable query optimization;
runstats on view sv1 with distribution;
The optimizer will recognize the join exists in this statistical view and after matching this, it will augment the generic formula with the information it has about the cardinalities of the join.
It is important to make the statistical view generic ( no local predicate ), so it can be of benefit to many different statements using the same join.
Using the same db2caem command we can now see what the estimates have become :
Total Cost: 12434
Query Degree: 1
Rows
Rows Actual
RETURN
( 1)
Cost
I/O
|
50981.3
51000
^HSJOIN
( 2)
12434
NA
/--------+---------\
50981.3 10000
51000 10000
^HSJOIN TBSCAN
( 3) ( 8)
11160.1 752.45
NA NA
/--------+--------\ |
50981.3 10000 10000
51000 10000 NA
^HSJOIN TBSCAN TABLE: BENNYV
( 4) ( 7) DIM3
9886.04 752.45 Q1
NA NA
/-----+------\ |
100000 40 10000
100000 51 NA
TBSCAN TBSCAN TABLE: BENNYV
( 5) ( 6) DIM2
8024.9 1071.16 Q2
NA NA
| |
100000 10000
NA NA
TABLE: BENNYV TABLE: BENNYV
FACT DIM1
Q4 Q3
The estimates are clearly correct now. We can see that this has lead to a higher cost for the execution plan, but it is a common misconception that a higher cost will lead to a slower plan.
The increase in cost here merely reflects the fact that our estimates are closer to the reality whereas we had a severe underestimation at first.
We can also see that the change in estimation has lead to a change in join method higher up in the access plan ( from Nested Loop joins to Hash Joins ).
When investigating a poor access plan one might not obviously detect this problem at first but the usage of db2caem can help expose severe underestimations and as we saw, statistical views can easily correct this.
Please leave a comment if there are any questions or feedback.
UID
ibm13285897