Technical Blog Post
Abstract
One value problem in DB2 query optimization
Body
Column cardinality affects filer factor estimation of the predicate when distribution stats is not collected.
If your query includes a predicate like "WHERE COL != <value> " for a flag type column which typically has low cardinality, you may encounter a sudden plan change depending on the environment. Typically, COLCARD=1 is tricky, hence I call it one value problem in this blog entry.
For example, when COLCARD = 2 for column COL2, meaning number of distinct values in the COL2 is 2, filter Factor of (Q2.COL2 <> 0) = 0.5.
This is because ( filter factor of "COL2 = <value>" ) = 1/(COLCARD) = 1/2 = 0.5, as the result, ( filter factor of "COL2 != <value>" ) = 1 - 1/(COLCARD) = 1 - 1/2 = 0.5
-Initial case: COLCARD= 2 and Filter Factor of (Q2.COL2 <> 0) = 0.5
UPDATE SYSSTAT.COLUMNS
SET COLCARD=2,
...
WHERE COLNAME = 'COL2' AND TABNAME = 'B' AND TABSCHEMA = 'B ';
Predicates:
----------
5) Sargable Predicate,
Comparison Operator: Not Equal (<>)
Subquery Input Required: No
Filter Factor: 0.5
Predicate Text:
--------------
(Q2.COL2 <> 0)
Similarly, if COLCARD varies from 2 to 10, FF (Filter factor) varies accordingly as follows.
COLCARD FF of COL2 = <value> FF of COL2 != <value>
------- --------------------- ---------------------
2 1/2 ( = 0.5 ) 1 - 1/2 ( = 0.5)
3 1/3 ( = 0.3333) 1 - 1/3 ( = 0.6667)
4 1/4 ( = 0.25) 1 - 1/4 ( = 0.76)
5 1/5 ( = 0.2) 1 - 1/5 ( = 0.8)
10 1/10( = 0.1) 1 - 1/10( = 0.9)
20 1/20( = 0.05) 1 - 1/20( = 0.95)
However, if COLCARD of COL2 = 1, DB2 does not apply the formula of ( 1 - 1/(COLCARD) ) for "COL2 !"= <value>" predicate, because FF = 0 makes cost comparison meaningless.
In stead, DB2 chooses 1/CARD for the "COL2 !"= <value>" predicate as a positive small number less than 0.5, where CARD if the table cardinality, i.e. number of rows in the table.
-Plan change: COLCARD=1 and Filter Factor of (Q2.COL2 <> 0) = 1e-06
UPDATE SYSSTAT.TABLES
SET CARD=1000000,
...
WHERE TABNAME = 'B' AND TABSCHEMA = 'B ';
UPDATE SYSSTAT.COLUMNS
SET COLCARD=1,
...
WHERE COLNAME = 'COL2' AND TABNAME = 'B' AND TABSCHEMA = 'B ';
Predicates:
----------
5) Sargable Predicate,
Comparison Operator: Not Equal (<>)
Subquery Input Required: No
Filter Factor: 1e-06
Predicate Text:
--------------
(Q2.COL2 <> 0)
This is an assumption based on available stats, however, changing FF from 0.5 to 0.000001 is fairly drastic and may cause plan change.
Here's a sample scenario to demonstrate that just updating a row caused the plan change from HSJOIN to NLJOIN which may be sub-optimal.
The query text is:
$ cat s1.sql
select x.key from b.b x where x.col1 = 1 and x.col2 != 0 and x.key in ( select y.key from a.a y where y.col1 = 1 )
;
The initial plan with COLCARD=1 for COl2 is like
<HSJOIN><IXSCAN TABLE='B.B'/><IXSCAN TABLE='A.A'/></HSJOIN>
The plan after chainging to COLCARD=1 for COl2 by an UPDATE is like:
<NLJOIN><IXSCAN TABLE='B.B'/><IXSCAN TABLE='A.A'/></NLJOIN>
The command used for the test.
db2 update db cfg for sample using AUTO_SAMPLING off AUTO_STATS_VIEWS off AUTO_STMT_STATS off AUTO_RUNSTATS off AUTO_TBL_MAINT off AUTO_MAINT off
db2 terminate
db2stop force
db2start
db2 connect to sample
db2 -tvf exp1.sql
db2 "drop table a.a"
db2 "create table a.a ( key integer not null, col1 smallint not null) in userspace1"
db2 "create unique index a.ix0a on a.a (key)"
db2 "create index a.ix1a on a.a (col1)"
db2 "drop table b.b"
db2 "create table b.b ( key integer not null, col1 smallint not null, col2 smallint not null) in userspace1"
db2 "create unique index b.ix0b on b.b (key)"
db2 "create index b.ix1b on b.b (col1)"
db2 "load from expa.del of del insert into a.a"
db2 "load from expb.del of del insert into b.b"
db2 "runstats on table a.a and indexes all"
db2 "runstats on table b.b and indexes all"
db2look -d sample -e -z b -t b -m -o db2look.1.out
mkdir s1caemout
db2caem -d sample -sf s1.sql -o s1caemout
db2 "update b.b set col2=1 where key = 1"
db2 "runstats on table b.b and indexes all"
db2look -d sample -e -z b -t b -m -o db2look.2.out
mkdir s1caemout2
db2caem -d sample -sf s1.sql -o s1caemout2
The test data files are generated with following export commands:
$ cat exp1.sql
export to expa.del of del
with w1 as (
select row_number() over( order by a.tabname ) number from syscat.columns a, syscat.columns b
fetch first 1000000 rows only
)
, w2 as (
select
number key
,mod(number,1000) col1
from w1
)
select key, col1 from w2 order by col1, key;
;
export to expb.del of del
with w1 as (
select row_number() over( order by a.tabname ) number from syscat.columns a, syscat.columns b
fetch first 1000000 rows only
)
, w2 as (
select
number key
,mod(number,100) col1
,case number
when 1 then 0
else 1
end col2
from w1
)
select key, col1, col2 from w2 order by col1, key, col2
;
The access plan graph with Rows Actual for after initial runstats. DB2 estimated number of rows from FETCH(3) = 5000 where Actual number of rows = 10000
Access Plan:
-----------
Total Cost: 1583.12
Query Degree: 1
Rows
Rows Actual
RETURN
( 1)
Cost
I/O
|
5
999
^HSJOIN
( 2)
1583.12
NA
/---------+----------\
5000 1000
10000 1000
FETCH FETCH
( 3) ( 5)
1385.76 137.25
NA NA
/---+---\ /---+---\
10000 1e+06 1000 1e+06
10000 NA 1000 NA
IXSCAN TABLE: B IXSCAN TABLE: A
( 4) B ( 6) A
780.87 Q2 86.361 Q1
NA NA
| |
1e+06 1e+06
NA NA
INDEX: B INDEX: A
IX1B IX1A
Q2 Q1
3) FETCH : (Fetch)
Filter Factor: 0.5
Predicate Text:
--------------
(Q2.COL2 <> 0)
Output Streams:
--------------
4) To Operator #2
Estimated number of rows: 5000
Actual number of rows: 10000
- After running "update b.b set col2=1 where key = 1" which made COL2 all value 1 and running runstats. The plan was changed because the filter factor was changed from 0.5 to 0.000001.
The big difference between estimated number of rows and actual number of rows was in FETCH(3) due to the filter factor 0.000001 whereas all the rows satisfy (Q2.COL2 <> 0).
Access Plan:
-----------
Total Cost: 1402.79
Query Degree: 1
Rows
Rows Actual
RETURN
( 1)
Cost
I/O
|
1e-05
1000
^NLJOIN
( 2)
1402.79
NA
/---------+----------\
0.01 0.001
10000 0.1
FETCH FETCH
( 3) ( 5)
1385.76 23.9671
NA NA
/---+---\ /---+----\
10000 1e+06 1 1e+06
10000 NA 1 NA
IXSCAN TABLE: B IXSCAN TABLE: A
( 4) B ( 6) A
780.87 Q2 16.5197 Q1
NA NA
| |
1e+06 1e+06
NA NA
INDEX: B INDEX: A
IX1B IX0A
Q2 Q1
3) FETCH : (Fetch)
Filter Factor: 1e-06
Predicate Text:
--------------
(Q2.COL2 <> 0)
Output Streams:
--------------
4) To Operator #2
Estimated number of rows: 0.01
Actual number of rows: 10000
- The solution is to collect DISTRUBITION stats either by manually or by AUTO_RUNSTATS configuration parameter. Distributions stats avoids the one value problem.
Here are questions and answers about the above steps for your reference.
Q1. Why the subquery "select x.key from b.b x where x.col1 = 1 and x.col2 != 0 and x.key in ( select y.key from a.a y where y.col1 = 1 )" results the joins.
A1. Because, DB2 rewrites the query to join as follows, then chooses optimal the plan. Optimized statement is the result of query rewrite.
Original Statement:
------------------
select
x.key
from
b.b x
where
x.col1 = 1 and
x.col2 != 0 and
x.key in
(select
y.key
from
a.a y
where
y.col1 = 1
)
Optimized Statement:
-------------------
SELECT
Q2.KEY AS "KEY"
FROM
"A ".A AS Q1,
"B ".B AS Q2
WHERE
(Q1.COL1 = 1) AND
(Q2.KEY = Q1.KEY) AND
(Q2.COL2 <> 0) AND
(Q2.COL1 = 1)
Q2. What are the characteristics of the two input data files ?
A2. data for table a.a: key column is a sequential number from 1 to 1000000, col1 is modulo 1000 of key, ordered by col1
data for table b.b: key column is a sequential number from 1 to 1000000, col1 is modulo 100 of key, col2 is 1 except key = 1, ordered by col1.
So, 999999 rows had col2=1, col2=0 only for key=1 initially.
Q3. Why index A.IX1A on COL1 was used for table A.A in HSJOIN and why index A.IX0A on KEY was used for table A.A in NLJOIN ?
A3. These indexes were chosen for cheaper cost for each join.
Q4. Is there a way to disable subquery to join rewrite answered in A1?
A4. Yes, SUBQ2JOIN OPTION in OPTGUIDELINES allows you to do.
Q5. Does DISTRIBUTION stats always avoid the problem ?
A5. Yes, it does, if it is used in optimization, but, queries which do NOT respect distribution stats will still hit the problem. They are:
- The predicate in question uses parameter markers or host variables, without reopt enabled. e.x. ( COL2 != ? )
- Statement concentrator enabled databases. i.e. STMT_CONC = LITERALS
Note, these type of queries are optimized for query compilation time. This is a trade off, less query compilation time, less optimization.
UID
ibm11140454