Data preparation in the mining process

A typical data mining project undergoes several phases. These phases are described in detail in the Cross Industry Standard Process for Data Mining (CRISP-DM).
In a nutshell, the project life cycle of a data mining project according to CRISP-DM includes the following phases:
Business understanding
To identify the business goals and to determine how to measure success.
Data understanding
To select relevant data and to understand this data. This means to understand the semantics of tables and columns and to know the data distributions.
Data preparation
To cleanse the selected data and to transform it, for example, by joining and by aggregation so that it is suitable for data mining analysis.
Modeling
To run the data mining algorithms.
Evaluation
To look at mining models, understand influencing factors, and assess model accuracy.
Deployment
To score, this means to apply the data mining model to new data.
For more information about CRISP-DM, visit this Web site:
 http://www.crisp-dm.org/

Although modeling is mathematically the most complicated step in the mining process, data preparation usually requires most effort in a data mining project. According to experience, about 40-70% of the time in a data mining project is needed for data preparation. This is seen by many as a major pain point. Also, it frequently slows down data mining projects.

One of the reasons for the high effort is the wide set of skills that is necessary to perform this task. As a consequence, the data preparation task is performed jointly by a data warehouse administrator and a mining analyst. This gives rise to another problem, the delay due to communication overhead between the people involved.


Feedback