Creating Decision Trees
The Decision Tree procedure creates a tree-based classification model. It classifies cases into groups or predicts values of a dependent (target) variable based on values of independent (predictor) variables. The procedure provides validation tools for exploratory and confirmatory classification analysis.
The procedure can be used for:
Segmentation. Identify persons who are likely to be members of a particular group.
Stratification. Assign cases into one of several categories, such as high-, medium-, and low-risk groups.
Prediction. Create rules and use them to predict future events, such as the likelihood that someone will default on a loan or the potential resale value of a vehicle or home.
Data reduction and variable screening. Select a useful subset of predictors from a large set of variables for use in building a formal parametric model.
Interaction identification. Identify relationships that pertain only to specific subgroups and specify these in a formal parametric model.
Category merging and discretizing continuous variables. Recode group predictor categories and continuous variables with minimal loss of information.
Example. A bank wants to categorize credit applicants according to whether or not they represent a reasonable credit risk. Based on various factors, including the known credit ratings of past customers, you can build a model to predict if future customers are likely to default on their loans.
A tree-based analysis provides some attractive features:
- It allows you to identify homogeneous groups with high or low risk.
- It makes it easy to construct rules for making predictions about individual cases.
Data Considerations
Data. The dependent and independent variables can be:
- Nominal. A variable can be treated as nominal when its values represent categories with no intrinsic ranking (for example, the department of the company in which an employee works). Examples of nominal variables include region, postal code, and religious affiliation.
- Ordinal. A variable can be treated as ordinal when its values represent categories with some intrinsic ranking (for example, levels of service satisfaction from highly dissatisfied to highly satisfied). Examples of ordinal variables include attitude scores representing degree of satisfaction or confidence and preference rating scores.
- Scale. A variable can be treated as scale (continuous) when its values represent ordered categories with a meaningful metric, so that distance comparisons between values are appropriate. Examples of scale variables include age in years and income in thousands of dollars.
Frequency weights If weighting is in effect, fractional weights are rounded to the closest integer; so, cases with a weight value of less than 0.5 are assigned a weight of 0 and are therefore excluded from the analysis.
Assumptions. This procedure assumes that the appropriate measurement level has been assigned to all analysis variables, and some features assume that all values of the dependent variable included in the analysis have defined value labels.
- Measurement level. Measurement level affects the tree computations; so, all variables should be assigned the appropriate measurement level. By default, numeric variables are assumed to be scale and string variables are assumed to be nominal, which may not accurately reflect the true measurement level. An icon next to each variable in the variable list identifies the variable type.
Icon | Measurement level |
---|---|
|
Scale |
|
Nominal |
|
Ordinal |
You can temporarily change the measurement level for a variable by right-clicking the variable in the source variable list and selecting a measurement level from the pop-up menu.
- Value labels. The dialog box interface for this procedure assumes that either all nonmissing values of a categorical (nominal, ordinal) dependent variable have defined value labels or none of them do. Some features are not available unless at least two nonmissing values of the categorical dependent variable have value labels. If at least two nonmissing values have defined value labels, any cases with other values that do not have value labels will be excluded from the analysis.
You can use Define Variable Properties to assist you in the process of defining both measurement level and value labels.
To Obtain Decision Trees
This feature requires the Decision Trees option.
- From the menus choose:
- Select a dependent variable.
- Select one or more independent variables.
- Select a growing method.
Optionally, you can:
- Change the measurement level for any variable in the source list.
- Force the first variable in the independent variables list into the model as the first split variable.
- Select an influence variable that defines how much influence a case has on the tree-growing process. Cases with lower influence values have less influence; cases with higher values have more. Influence variable values must be positive.
- Validate the tree.
- Customize the tree-growing criteria.
- Save terminal node numbers, predicted values, and predicted probabilities as variables.
- Save the model in XML (PMML) format.
Fields with unknown measurement level
The Measurement Level alert is displayed when the measurement level for one or more variables (fields) in the dataset is unknown. Since measurement level affects the computation of results for this procedure, all variables must have a defined measurement level.
Scan Data. Reads the data in the active dataset and assigns default measurement level to any fields with a currently unknown measurement level. If the dataset is large, that may take some time.
Assign Manually. Opens a dialog that lists all fields with an unknown measurement level. You can use this dialog to assign measurement level to those fields. You can also assign measurement level in Variable View of the Data Editor.
Since measurement level is important for this procedure, you cannot access the dialog to run this procedure until all fields have a defined measurement level.
Changing Measurement Level
- Right-click the variable in the source list.
- Select a measurement level from the pop-up menu.
This changes the measurement level temporarily for use in the Decision Tree procedure.
To permanently change the level of measurement for a variable, see Variable Measurement Level.
Growing Methods
The available growing methods are:
CHAID. Chi-squared Automatic Interaction Detection. At each step, CHAID chooses the independent (predictor) variable that has the strongest interaction with the dependent variable. Categories of each predictor are merged if they are not significantly different with respect to the dependent variable.
Exhaustive CHAID. A modification of CHAID that examines all possible splits for each predictor.
CRT. Classification and Regression Trees. CRT splits the data into segments that are as homogeneous as possible with respect to the dependent variable. A terminal node in which all cases have the same value for the dependent variable is a homogeneous, "pure" node.
QUEST. Quick, Unbiased, Efficient Statistical Tree. A method that is fast and avoids other methods' bias in favor of predictors with many categories. QUEST can be specified only if the dependent variable is nominal.
There are benefits and limitations with each method, including:
Feature | CHAID* | CRT | QUEST |
---|---|---|---|
Chi-square-based** | X | ||
Surrogate independent (predictor) variables | X | X | |
Tree pruning | X | X | |
Multiway node splitting | X | ||
Binary node splitting | X | X | |
Influence variables | X | X | |
Prior probabilities | X | X | |
Misclassification costs | X | X | X |
Fast calculation | X | X |
*Includes Exhaustive CHAID.
**QUEST also uses a chi-square measure for nominal independent variables.