Background of data transformation

Data transformation comprises data imputation, data splitting, and transformations through standardization and normalization.

Data imputation

Many analytic algorithms require that the data set does not have missing attribute values. However, real-world data sets often have missing attribute values. To handle missing values, data imputation is one of the most popular approaches.

Data imputation replaces missing values with reasonably guessed, usable values so that you can run the algorithm despite of missing attribute values. Reasonably guessed, usable values are, for example, modes (the most frequent values) for discrete attributes, or means or medians for continuous attributes.

Other approaches are available, but they have limitations.

Such approaches are, for example:

Ignoring
Instances with missing values of one or more attributes are skipped.
The ignoring approach is not practical when instances with missing values comprise a large fraction of the data set. Even if the number of missing values are minimal, they might have a negative impact on the quality of the analysis result.
Codomain extension
missing is considered as an extra value that is added to the codomain of an attribute.
The codomain extension is only applicable to discrete attributes, for which the term missing or any other term that is used, can be considered an actual value. This approach might produce good results with predictive modeling algorithms if the distribution of missing values in the training set for the model creation is comparable to the one in the data set for which the model is applied.
Internal processing
Internal missing values are handled by algorithm-specific techniques.
This approach might produce superior results, but typically adds considerable computational complexity.

Data splitting

Before you use the stored procedures for classification or regression, you might want to split this table into the following disjoint data sets:

  • One data set to train the prediction model
  • One data set to test the prediction model

Standardization and normalization

Standardization and normalization are methods of data preprocessing. They are applied to numeric attributes so that they have similar means and standard deviations.