Background of data transformation
Data transformation comprises data imputation, data splitting, and transformations through standardization and normalization.
Data imputation
Many analytic algorithms require that the data set does not have missing attribute values. However, real-world data sets often have missing attribute values. To handle missing values, data imputation is one of the most popular approaches.
Data imputation replaces missing values with reasonably guessed, usable values so that you can run the algorithm despite of missing attribute values. Reasonably guessed, usable values are, for example, modes (the most frequent values) for discrete attributes, or means or medians for continuous attributes.
Other approaches are available, but they have limitations.
Such approaches are, for example:
- Ignoring
- Instances with missing values of one or more attributes are skipped.
- Codomain extension
- missing is considered as an extra value that is added to the codomain of an attribute.
- Internal processing
- Internal missing values are handled by algorithm-specific techniques.
Data splitting
Before you use the stored procedures for classification or regression, you might want to split this table into the following disjoint data sets:
- One data set to train the prediction model
- One data set to test the prediction model
Standardization and normalization
Standardization and normalization are methods of data preprocessing. They are applied to numeric attributes so that they have similar means and standard deviations.