After you have determined the grain of your model, you
identify the dimensions that are true to that grain. You create columns,
hierarchies, and cases for snowflaking.
The following metadata is collected when you identify dimension
tables:
- Dimension names
- Business definitions
- Hierarchies
- Handling dimension changes
- Load frequency and statistics
- Usage statistics
- Archive rules and statistics
- Purge rules and statistics
- Data quality and accuracy
- Primary and foreign keys and how the keys are generated
- Data source information
- Facts
Identify dimensions
Identify
the dimensions that are true to the grain of your model.
Dimension
tables contain columns that describe the fact records in the fact
table. Some of these columns provide descriptive information. Other
columns specify how the data in the fact table is summarized to provide
useful information. Dimension tables contain hierarchies that help
to summarize data. Dimension tables are smaller, denormalized lookup
tables that contain descriptive columns that you reference when you
define queries.
To learn more about dimension tables, see Dimension tables and entities.
To identify dimensions, you perform
the following steps:
- Use the grain definition to locate possible dimensions.
- List all of the dimensions that are associated with this grain.
Define the level of detail to include in that dimension.
- Create surrogate keys for the primary keys of the dimensions.
To learn more about surrogate keys, see Surrogate keys.
Identify shared dimensions
Identify any shared dimensions that are available, instead of redesigning
the dimensions. Identify whether a dimension that is used exists inside
the enterprise data warehouse or dimensional model. To learn more
about shared dimensions, see
Shared dimensions.
To identify a shared dimension, perform the following
steps:
- Identify whether a dimension exists. If the dimension exists,
then you must use that dimension. If the dimension exists, then you
do not identify dimensional attributes for that particular dimension.
Share a dimension with an existing dimension even if the shared dimension
contains only a subset of attributes from the primary dimension.
- If the dimension does not exist, create a dimension that you use
across the enterprise. If the dimensional model requires a dimension
that does not exist, design a new dimension. When you design this
dimension, interact with subject matter experts to find out how they
plan to use the dimension.
Identify dimensional columns and
hierarchies
After you identify the dimensions, fill the
dimensions with columns. Use the descriptive columns to define the
restriction criteria for queries.
Consider the following points
when you design your dimensions:
- Preserve the source system key
- Use a separate entry in the dimension table to preserve the natural
source system key of the entity being used in the source system.
- Use surrogate keys
- Use a surrogate key for the primary key of a dimension. You do
not analyze the surrogate key. To learn more about surrogate keys, see Surrogate keys.
- Create descriptive, unique columns in your model
The columns of a dimension reflect the potential areas of interest that you can use
to aggregate data or to create constraints and report breaks.
Create columns that are descriptive and easily understood.
Define columns that can contain a NULL value when a column does not
apply to a specific item or its value is unknown.
Define unique
column names within the model. If you have duplicate names in different
dimension tables, create a distinction. For example, if you have multiple
columns that are named Address Type Code, rename
one column Beneficiary Address Type Code and
another column Premium Address Type Code.
Properly document all columns.
- Handling codes
- If you use codes in your dimension, provide documentation of the
code. For example, if you identify branches by a branch code, where
each code represents a branch name, then include both the code and
the name.
After you have defined the columns, you can
define the hierarchies of the dimension. A hierarchy is
a cascaded series of many-to-one relationships. A hierarchy contains
different levels, each corresponding to a dimension attribute. To
learn more about hierarchies, see Hierarchies.
Identify the granularity of date
and time dimensions
Identify the granularity of the date and
time dimensions. The date and time dimensions help determine the granularity
of the overall dimensional model and the level of information that
is stored in the model. If you define an incorrect grain in either
the date or time dimension, you may omit important dimensions. For
example, if you design a dimensional model for an order management
system and design the grain of the date dimension as Quarter, you may
miss many other dimensions (such as Time and Employee). You can include
those other dimensions if the model defines the grain of the date
dimension been at the day level.
- Date dimension
- Since all dimensional models are based on units of time, every data mart
has a date dimension. For example, you may want to measure how the
business performs over a period of time. A dimensional model may contain several
date dimensions.
The date dimension typically does not have an OLTP
source system that is connected to the dimension. You can develop
the date dimension before you design the dimensional model. To build
a date dimension, perform the following steps:
- Identify all of the columns that you need.
- Use SQL statements to populate the date dimension table for columns
such as Date, Day, Month, and Year. You may need to manually enter
data for special columns. For example, if you are tracking non-business
days (such as holidays), you should enter Christmas or Boxing Day manually. You should also manually
enter dates such as fiscal date, fiscal month, fiscal quarter, and
fiscal years.
Date data is often stored in a separate date dimension
(instead of a Date column in a fact table) for two reasons:
- There are several date attributes that the SQL date functions
do not support. These functions include fiscal periods, holidays,
seasons, weekdays, weekends, and national events. When you create
a date dimension, the business can look at the performance indicators
of their business across various fiscal and date-related attributes.
The performance indicators are not apparent if you use a SQL date
or time column inside the fact table.
- It is much easier to drag the columns from a date table instead
of using complex SQL functions to create the logic for the reports.
- Time dimension
You can handle time in dimensional modeling in two ways:
- Time of day as a separate dimension
- Time of day as a fact
You should handle time data in a dimension table and not
a fact inside the fact table. You should create a time dimension if
you must support the roll-up of time periods into more summarized
groupings for reporting and analysis. For example, create a time dimension
for the following summarized groupings:
- Hours
- Business-specific time groupings (weekday mornings, late-night
shifts, or late-evenings)
You should also create a time dimension if you want to represent
different hierarchies for the time that you are measuring. For example, create
different hierarchies for Standard and Military time.
However,
if you do not roll up or filter time-of-day groups, express time as
a fact inside the fact table. In this case, time is treated as a simple numeric
fact, in the timestamp data type.
Note: Expressing time as a fact inside
the fact table makes the reporting and summarization process difficult
if you need to analyze data that was stored over a period of time.
You can easily summarize and report on data if you create a separate
time dimension.
Identify slowly changing dimensions
You must identify slowly changing dimensions and determine how
you will handle the changing data.
A slowly changing dimension is a dimension whose attributes for a record change slowly over
time. For example, you may need to track employee transfers within
the company.
The workbench supports the following types of slowly
changing dimensions:
- Type 0
- No changes are made to the dimension.
- Type 1
- New data overwrites the old data. Historical changes are not tracked.
- Type 2
- With this method, two separate entries are created. The original
record and the new record are both available in the table. The new
row gets its own primary key (surrogate key).
- Type 4
- Separate tables are created to store some or all historical data.
Only one table contains the current data, and when updates occur,
the old data is moved into a historical table.
- Type 6
- Type 1, Type 2, and Type 3 (storing data in separate columns in
the same table) are combined. Typically, the old data is moved into
separate records and columns, and the new data is placed in the original
records.
Identify cases for snowflaked dimensions
You create a snowflake schema deign if you normalize and expand
the dimension tables in a star schema. A dimension table is snowflaked
when the low-cardinality attributes in the dimension have been removed
to separate normalized tables and these normalized tables are then
joined back into the original dimension table.
Note: Snowflaking
in the dimensional model environment is not recommended. Snowflaking
makes the dimensional model harder to understand. Snowflaking can
also cause decreased performance, because you must join more tables.
You must only snowflake a dimension in a dimensional model in
two cases:
- The dimension table contains two or more sets of attributes that
define information at different grains.
- You populate the sets of attributes of the same dimension table via
different source systems.
Do not snowflake hierarchies of one dimension table into separate
tables. Hierarchies should belong to the dimension table only. Do
not snowflake the hierarchies. Multiple hierarchies can belong to
the same dimension if the dimension has been designed at the lowest
possible detail.
If hierarchies are split into separate tables,
performance is impacted, because more joins are required. In some
situations, you may snowflake the hierarchies of a main table. When
you use an aggregate of the fact table, only snowflake dimensions
with hierarchies to avoid joins to large dimension tables. For example,
if you have brand information that you want to separate from a product
dimension table, create a Brand snowflaked dimension that contains
a single row for each brand, using fewer rows than the Product dimension table.
Note: If you use too many tables in a snowflake schema, the design may
become too complex. Dimensional modeling aims to create a simple,
understandable model. If you have more tables, you need to create
more joins. Performance is decreased when you create more joins.