Cross-Validation

This panel is activated only if the objective is to predict a target. The options on this panel control whether to use cross-validation when calculating the nearest neighbors.

Cross-validation divides the sample into a number of subsamples, or folds. Nearest neighbor models are then generated, excluding the data from each subsample in turn. The first model is based on all of the cases except those in the first sample fold, the second model is based on all of the cases except those in the second sample fold, and so on. For each model, the error is estimated by applying the model to the subsample excluded in generating it. The "best" number of nearest neighbors is the one which produces the lowest error across folds.

Cross-Validation Folds. V-fold cross-validation is used to determine the "best" number of neighbors. It is not available in conjunction with feature selection for performance reasons.

  • Randomly assign cases to folds. Specify the number of folds that should be used for cross-validation. The procedure randomly assigns cases to folds, numbered from 1 to V, the number of folds.
  • Set random seed. When estimating the accuracy of a model based on a random percentage, this option allows you to duplicate the same results in another session. By specifying the starting value used by the random number generator, you can ensure the same records are assigned each time the node is executed. Enter the desired seed value. If this option is not selected, a different sample will be generated each time the node is executed.
  • Use field to assign cases. Specify a numeric field that assigns each case in the active dataset to a fold. The field must be numeric and take values from 1 to V. If any values in this range are missing, and on any split fields if split models are in effect, this will cause an error.