Dimensional modeling: Surrogate keys

Surrogate keys join the dimension tables to the fact table. Surrogate keys serve as an important means of identifying each instance or entity inside of a dimension table.

Reasons for using surrogate keys

Data tables in various OLTP source systems may use different keys for the same entity

It may also be possible that a single key is being used by different instances of the same entity. This means that different customers might be represented using the same key across different OLTP systems.

This can be a major problem when trying to consolidate information from various source systems. Or for companies trying to create/modify data warehouses after mergers and acquisitions. Existing systems that provide historical data might have used a different numbering system than a current OLTP system. Moreover, systems developed independently may not use the same keys, or they may use keys that conflict with data in the systems of other divisions. This situation may not cause problems when each department independently reports summary data, but can when trying to achieve an enterprise-wide view of the data.

This means that we cannot rely on using the natural primary keys of the source system as dimension primary keys because there is no guarantee that the natural primary keys will be unique for each instance. A surrogate key uniquely identifies each entity in the dimension table, regardless of its natural source key. This is primarily because a surrogate key generates a simple integer value for every new entity.

Surrogate keys provide the means to maintain data warehouse information when dimensions change

Surrogate keys are necessary to handle changes in dimension table attributes.

Natural OLTP system keys may change or be reused in the source data systems

Some systems have reuse keys belonging to obsolete data or for data that has been purged. However, the key may still be in use in historical data in the data warehouse, and the same key cannot be used to identify different entities.

The design, implementation, and administration of surrogate keys is the responsibility of the data warehouse team. Surrogate keys are maintained in the data preparation area during the data transformation process.

Improve performance of queries

The narrow integer surrogate keys mean a thinner fact table. The thinner the fact table, the better the performance.

Handle exception cases

If requirements need to be determined or do not apply, use a surrogate key.

Changes or realignment of measures should be carried in a separate column in the table

If data must be reviewed or summarized, regardless of the number of times it appears in a table, you should use that column as part of the surrogate key.

Avoid globally unique identifiers as surrogate keys

Globally unique identifiers are known to work well in the source OLTP systems, but they are difficult to use when it comes to data warehouses. This is primarily because of two reasons:

Globally unique identifiers use a significant amount of space compared to their integer counterparts. Globally unique identifiers take about 16 bytes each, where an integer takes about 4 bytes.
Indexes on globally unique identifiers columns are relatively slower than indexes on integer keys because globally unique identifiers are four times larger.