A typical data mining project undergoes several phases.
These phases are described in detail in the Cross Industry Standard
Process for Data Mining (CRISP-DM).
In a nutshell, the project life cycle of a data mining project
according to CRISP-DM includes the following phases:
- Business understanding
- To identify the business goals and to determine how to measure
success.
- Data understanding
- To select relevant data and to understand this data. This means
to understand the semantics of tables and columns and to know the
data distributions.
- Data preparation
- To cleanse the selected data and to transform it, for example,
by joining and by aggregation so that it is suitable for data mining
analysis.
- Modeling
- To run the data mining algorithms.
- Evaluation
- To look at mining models, understand influencing factors, and
assess model accuracy.
- Deployment
- To score, this means to apply the data mining model to new data.
For more information about CRISP-DM, visit this Web
site:
http://www.crisp-dm.org/
Although modeling is mathematically the most complicated step in
the mining process, data preparation usually requires most effort
in a data mining project. According to experience, about 40-70% of
the time in a data mining project is needed for data preparation.
This is seen by many as a major pain point. Also, it frequently slows
down data mining projects.
One of the reasons for the high effort is the wide set of skills
that is necessary to perform this task.
- On the one hand, deep knowledge about the database design and
about the SQL language is needed. If ETL tooling is used, in-depth
knowledge of the ETL tool is also required.
- On the other hand, the definition of the right transformation
requires knowledge about the semantics of the data and their relation
to the business scenario. However, even experienced mining analysts
need a lot of time to write SQL code or configure ETL operators.
As a consequence, the data preparation task is performed jointly
by a data warehouse administrator and a mining analyst. This gives
rise to another problem, the delay due to communication overhead between
the people involved.