Surrogate keys join the dimension tables to
the fact table. Surrogate keys serve as an important means of identifying
each instance or entity inside of a dimension table.
Reasons for using surrogate keys
- Data tables in various OLTP source systems may use different keys
for the same entity
- It may also be possible that a single key is being used by different
instances of the same entity. This means that different customers
might be represented using the same key across different OLTP systems.
This can be a major problem when trying to consolidate information
from various source systems. Or for companies trying to create/modify
data warehouses after mergers and acquisitions. Existing systems that
provide historical data might have used a different numbering system
than a current OLTP system. Moreover, systems developed independently
may not use the same keys, or they may use keys that conflict with
data in the systems of other divisions. This situation may not cause
problems when each department independently reports summary data,
but can when trying to achieve an enterprise-wide view of the data.
This means that we cannot rely on using the natural primary keys
of the source system as dimension primary keys because there is no
guarantee that the natural primary keys will be unique for each instance.
A surrogate key uniquely identifies each entity in the dimension table,
regardless of its natural source key. This is primarily because a
surrogate key generates a simple integer value for every new entity.
- Surrogate keys provide the means to maintain data warehouse information
when dimensions change
- Surrogate keys are necessary to handle changes in dimension table
attributes.
- Natural OLTP system keys may change or be reused in the source
data systems
- Some systems have reuse keys belonging to obsolete data or for
data that has been purged. However, the key may still be in use in
historical data in the data warehouse, and the same key cannot be
used to identify different entities.
The design, implementation,
and administration of surrogate keys is the responsibility of the
data warehouse team. Surrogate keys are maintained in the data preparation
area during the data transformation process.
- Improve performance of queries
- The narrow integer surrogate keys mean a thinner fact table. The
thinner the fact table, the better the performance.
- Handle exception cases
- If requirements need to be determined or do not apply, use a surrogate
key.
- Changes or realignment of measures should be carried in a separate
column in the table
- If data must be reviewed or summarized, regardless of the number
of times it appears in a table, you should use that column as part
of the surrogate key.
Avoid globally unique identifiers as surrogate keys
Globally unique identifiers are known to work well in the source
OLTP systems, but they are difficult to use when it comes to data
warehouses. This is primarily because of two reasons:
- Globally unique identifiers use a significant amount of space
compared to their integer counterparts. Globally unique identifiers
take about 16 bytes each, where an integer takes about 4 bytes.
- Indexes on globally unique identifiers columns are relatively
slower than indexes on integer keys because globally unique identifiers
are four times larger.