Overview (TWOSTEP CLUSTER command)

TWOSTEP CLUSTER groups observations into clusters based on a nearness criterion. The procedure uses a hierarchical agglomerative clustering procedure in which individual cases are successively combined to form clusters whose centers are far apart. This algorithm is designed to cluster large numbers of cases. It passes the data once to find the cluster centers and again to assign cluster memberships. In addition to the benefit of few data passes, the procedure allows the user to set the amount of memory used by the clustering algorithm.

Basic Features

Cluster Features (CF) Tree. TWOSTEP CLUSTER clusters observations by building a data structure called the CF tree, which contains the cluster centers. The CF tree is grown during the first stage of clustering and values are added to its leaves if they are close to the cluster center of a particular leaf.

Distance Measure. Two types of distance measures are offered—the traditional Euclidean distance and the likelihood distance. The former is available when no categorical variables are specified. The latter is especially useful when categorical variables are used. The likelihood function is computed using the normal density for continuous variables and the multinomial probability mass function for categorical variables. All variables are treated as independent.

Tuning the Algorithm. You can control the values of algorithm-tuning parameters with the CRITERIA subcommand.

Noise Handling. The clustering algorithm can optionally retain any outliers that do not fit in the CF tree. If possible, these values will be placed in the CF tree after it is completed. Otherwise, TWOSTEP CLUSTER will discard them after preclustering.

Missing Values. TWOSTEP CLUSTER will delete listwise any records with missing fields.

Numclusters. This subcommand specifies the number of clusters into which the data will be partitioned. The user may tell TWOSTEP CLUSTER to automatically select the number of clusters.

Optional Output. You can specify output to an XML file with the OUTFILE subcommand. The cluster membership for each case used can be saved to the active dataset with the SAVE subcommand.

Weights. TWOSTEP CLUSTER ignores specification on the WEIGHT command.

Basic Specification

  • The minimum specification is a list of variables, either categorical or continuous, to be clustered and at least one of the following subcommands: OUTFILE, PRINT, or SAVE.
  • The number of clusters may be specified with the NUMCLUSTERS subcommand.
  • Unless the NOSTANDARDIZE subcommand is given, TWOSTEP CLUSTER will standardize all continuous variables.
  • If DISTANCE is Euclidean, TWOSTEP CLUSTER will accept only continuous variables.

Subcommand Order

  • The subcommands can be specified in any order.

Syntax Rules

  • Minimum syntax: a variable must be specified.
  • Empty subcommands are silently ignored.
  • Variables listed in the CONTINUOUS subcommand must be numeric.
  • If a subcommand is issued more than once, TWOSTEP CLUSTER will ignore all but the last issue.