Stratified samples

If you specify a value that is different from NULL or an empty string for the parameter <stratSampleColumn>, a stratified sample is created for the training data set. This means that, based on the complement of the test data set, a sample is created. In this sample, the values of the Stratified Sample column occur with approximately the same frequency.

Creating a stratified sample can be useful if you use classification and the distribution of the values of the target column is unbalanced. For example, a specific value might occur with a frequency of 99%. A good classification model would be to predict always this value because the percentage of wrong predictions is 1% only. However, you do not always want this. Performing the training run on a stratified sample helps.

Applying models that are computed on a stratified sample should be done with care because the confidence values that are computed by the classification model are not correct on a data set that does not have the same characteristics as the stratified sample. Correcting these confidence values requires expert knowledge in data mining. Therefore it is recommended to use the Easy Mining procedures on a non-stratified sample. They automatically take into account the differences of the frequencies of the target values.

If you do not specify the Stratified Sample column, the training data set contains the records of the complement of the test data set.