Summary tables

DB2® Warehouse uses DB2 summary tables to improve the performance of queries issued to cube models and cubes. A summary table is a special type of a materialized query table (MQT) that specifically includes summary data.

Because the Optimization Advisor always recommends MQTs with summarized data, the term summary table is used in the DB2 Warehouse documentation to describe the recommended MQTs.

You can complete expensive calculations and joins for your queries ahead of time and store that data in a summary table. When you run queries that can use the precomputed data, DB2 will reroute the queries to the summary table. A query does not need to match the precomputed calculations exactly. If you use simple analytics like SUM and COUNT, DB2 can dynamically aggregate the results from the precomputed data. Many different queries can be satisfied by one summary table. Using summary tables can dramatically improve query performance for queries that access commonly used data or that involve aggregated data over one or more dimensions or tables.

Figure 1 shows a cube model based on a star schema with a Sales facts object and Product, Time, and Retailer dimensions. The facts object has measures and attributes, and each dimension has a set of attributes and is joined to the facts object by a facts-to-dimension join.

Figure 1. Cube model. Cube model with a Sales facts object and Product, Time, and Retailer dimensions

Cube model based on a star schema with a Sales facts object, a Time dimension, a Product dimension, and a Retailer dimension

The hierarchy for each dimension in the cube model is shown in Figure 2. The highlighted boxes connected by the thick dark line across the bottom of the hierarchies represent the data that actually exists in the base tables. Sales data is stored at the Day level, Retailer site level, and Product level. Data above the base level in the hierarchy must be aggregated. If you query a base table for sales data from a particular month, DB2 must dynamically add the daily sales data to return the monthly sales figures. For example, you can use the following query to see the sales data for each product type, in each region, by each month in 2004:

SELECT PRODUCT_TYPE_CODE, REGION_EN, CURRENT_MONTH, SUM(SALE_TOTAL)
FROM TIME, PRODUCT, RETAILER, REGION, SALES
WHERE SALES.ORDER_DAY_KEY=TIME.DAY_KEY
AND   SALES.RETAILER_SITE_KEY=RETAILER.RETAILER_SITE_KEY
AND   SALES.PRODUCT_KEY=PRODUCT.PRODUCT_KEY 
AND   RETAILER.COUNTRY_KEY=REGION.COUNTRY_KEY
AND   CURRENT_YEAR=2004
GROUP BY PRODUCT_TYPE_CODE, CURRENT_MONTH

The thin line connecting the Type-Region-Month slice in Figure 2 represents the slice that the query accesses. Type-Region-Month is a slice of the cube model and includes one level from each hierarchy. You can define summary tables to satisfy queries at or above a particular slice. A summary table can be built for the Type-Region-Month slice that is accessed by the query. Any other queries that access data at or above that slice including All Time, Year, Quarter, All Retailers, All Products, and Line can be satisfied by the summary table with some additional aggregating. However, if you query more detailed data below the slice, such as Day or Retailer, DB2 cannot use the summary table for this more granular query.

Figure 2. Query slice. Product, Retailer, and Time hierarchies. Shows the Type-Region-Month slice and that the base data exists in the Product-Retailer Site-Day slice.

Three hierarchies, Product, Retailer, and Time, are shown with each of their hierarchy levels.

In Figure 3, the dotted line defines the Type-Country-Month slice. A summary table built for the Type-Country-Month slice can satisfy any query that accesses data at or above the slice. All of the data that can be satisfied by a summary table built for the Type-Country-Month slice is included in the top set of highlighted boxes.

Figure 3. Summary table slice. Product, Retailer, and Time hierarchies. The highlighted data can be satisfied by a summary table built at the Type-Country-Month slice.

The dimensions from the previous figure are shown.

The rewriter in the DB2 SQL compiler knows about existing summary tables, and can automatically rewrite queries to read from the summary table instead of the base tables. Rewritten queries are typically much faster because the summary tables are usually much smaller than the base tables and contain pre-aggregated data. Users continue to write queries against the base tables. DB2 will decide when to use a summary table for a particular query and will rewrite the user's query to access the summary tables instead, as shown in Figure 4. The rewritten query accesses a summary table that contains pre-aggregated data. A summary table is often significantly smaller, and therefore significantly faster, than the base tables and returns the same results as the base tables.

You can use the DB2 EXPLAIN facility to see if the query was rerouted, and if applicable, which table it was rerouted to.

Figure 4. Query rewrite. The DB2 process for rewriting a query

Diagram of the DB2 query rewrite process.

The query to see the sales data for each product type, in each region, by each month in 2004, can be rewritten to use the summary table built for the Type-Country-Month slice. The original query is here:

SELECT PRODUCT_TYPE_CODE, REGION_EN, CURRENT_MONTH, SUM(SALE_TOTAL)
FROM TIME, PRODUCT, RETAILER, REGION, SALES
WHERE SALES.ORDER_DAY_KEY=TIME.DAY_KEY
AND   SALES.RETAILER_SITE_KEY=RETAILER.RETAILER_SITE_KEY
AND   SALES.PRODUCT_KEY=PRODUCT.PRODUCT_KEY 
AND   RETAILER.COUNTRY_KEY=REGION.COUNTRY_KEY
AND   CURRENT_YEAR=2004
GROUP BY PRODUCT_TYPE_CODE, CURRENT_MONTH

The rewritten query is here:

SELECT PRODUCT_TYPE_CODE, REGION_EN, CURRENT_MONTH, SUM(SALE_TOTAL)
FROM SUMMARYTABLE1 
WHERE CURRENT_YEAR = '2004'
GROUP BY PRODUCT_TYPE_CODE, CURRENT_MONTH;

The rewritten query is much simpler and quicker for DB2 to complete because the data is pre-aggregated and many of the table joins are precomputed so DB2 accesses one small table instead of six tables, including a large fact table. The savings with summary tables can be tremendous, especially for schemas that have large fact tables. For example, a fact table with 1 billion rows might be pre-aggregated to a summary table with only 1 million rows, and the calculations involved in this aggregation occur only once instead of each time a query is issued. A summary table that is 1000 times smaller is much faster than accessing the large base tables.

Figure 5 shows the summary table for the Type-Country-Month slice. DB2 needs to calculate data for Region from the higher level Country instead of from the lower level Retailer, so the summary table has fewer rows than the base tables because there are fewer countries than retailers. DB2 does not need to perform any additional calculations to return sales data by Month and Type because the data is already aggregated at these levels. This query is satisfied entirely by the data in the summary table that joins the tables used in the query ahead of time and the joins do not need to be performed at the time the query is issued. For more complex queries, the performance gains can be dramatic.

Figure 5. Summary table. Example of the summary table that is created for the Type-Country-Month slice

In some cases, a query might access an attribute that is related to an attribute that is included in the summary table. The DB2 optimizer can use functional dependencies and constraints to dynamically join the summary table with the appropriate dimension table.

When the Optimization Advisor recommends a summary table, all of the measures in the cube model are represented. In the cube model in Figure 1, the SalesFacts object has only five measures including Sales, Total cost, Quantity, Gross profit, and Gross margin, which are all included in the summary table. Sometimes the Advisor can represent a set of measures using a smaller number of columns. For example, multiple measures may reference common expressions. The Optimization Advisor does not need to include all of the related attributes that are defined for a level in the summary table because DB2 can exploit semantic information such as constraints to determine how to join from the summary table back to the dimension tables to retrieve necessary information.

Feedback | Information roadmap