Data mining — Data preparation steps

Data preparation starts at the end of the data understanding phase when the relevant data is understood and its content is known.

This data is usually not ready for immediate analysis for the following reasons:

Data might not be clean and therefore not suitable for further analysis. In particular, data might be incomplete, wrong, or inconsistent. There are various techniques and tools to cleanse data, but cleansing is not part of this tool.
Data might be distributed in many tables, and values might be recorded at an inconvenient granularity for the business purpose at hand. For example, point-of-sale data contains individual sales transactions while the business goal is to perform product profitability analysis. Obviously, data must be reorganized so that products can become the focus of analysis.

Often, the task of this data reorganization is called data preparation. Data preparation consists of the following major steps:

Defining a data preparation input model

The first step is to define a data preparation input model. This means to localize and relate the relevant data in the database. This task is usually performed by a database administrator (DBA) or a data warehouse administrator, because it requires knowledge about the database model.

In this step, the DBA defines semantic concepts such as hierarchies. The relevant tables are joined so that the data transformation tasks can be defined by using these semantic concepts.

If an OLAP model in form of a Cubing Services model is available, this step can be skipped because the cube model can be imported as input model.

Defining a data preparation profile

The second step is to define a data preparation profile. This means to determine the focus of analysis and to specify the relevant properties that are to be computed by the data transformation. Because the profile definition can be based on the semantic concepts that are defined in the previous step, it can easily be performed by the mining analyst.

At the end of this step, a single logical table is defined. This logical table is the starting point for subsequent data mining analysis. You can create this table by generating a data flow or an SQL script. The resulting table of the data flow or the SQL script is then used as table source in a mining flow.

Figure 1. Data preparation overview

The picture above illustrates the different phases of the data mining process