Data preparation starts at the end of the data understanding
phase when the relevant data is understood and its content is known.
This data is usually not ready for immediate analysis for the following
reasons:
- Data might not be clean and therefore not suitable for further
analysis. In particular, data might be incomplete, wrong, or inconsistent.
There are various techniques and tools to cleanse data, but cleansing
is not part of this tool.
- Data might be distributed in many tables, and values might be
recorded at an inconvenient granularity for the business purpose at
hand. For example, point-of-sale data contains individual sales transactions
while the business goal is to perform product profitability analysis.
Obviously, data must be reorganized so that products can become the
focus of analysis.
Often, the task of this data reorganization is called data preparation.
Data preparation consists of the following major steps:
- Defining a data preparation input model
- The first step is to define a data preparation input model. This
means to localize and relate the relevant data in the database. This
task is usually performed by a database administrator (DBA) or a data
warehouse administrator, because it requires knowledge about the database
model.
In this step, the DBA defines semantic concepts such as hierarchies.
The relevant tables are joined so that the data transformation tasks
can be defined by using these semantic concepts.
If an OLAP
model in form of a Cubing Services model
is available, this step can be skipped because the cube model can
be imported as input model.
- Defining a data preparation profile
- The second step is to define a data preparation profile. This
means to determine the focus of analysis and to specify the relevant
properties that are to be computed by the data transformation. Because
the profile definition can be based on the semantic concepts that
are defined in the previous step, it can easily be performed by the
mining analyst.
At the end of this step, a single logical table
is defined. This logical table is the starting point for subsequent
data mining analysis. You can create this table by generating a data
flow or an SQL script. The resulting table of the data flow or the
SQL script is then used as table source in a mining flow.
Figure 1. Data preparation overview