Evaluation

The stepwise selection methods guarantee that your model will have only "statistically significant" predictors, but it does not guarantee that the model is actually good at predicting the target. To do this, you need to analyze scored records.

Figure 1. Cox nugget: Settings tab
Cox nugget: Settings tab
  1. Place the model nugget on the canvas and attach it to the source node, open the nugget and click the Settings tab.
  2. Select Time field and specify tenure. Each record will be scored at its length of tenure.
  3. Select Append all probabilities.

    This creates scores using 0.5 as the cutoff for whether a customer churns; if their propensity to churn is greater than 0.5, they are scored as a churner. There is nothing magical about this number, and a different cutoff may yield more desirable results. For one way to think about choosing a cutoff, use an Evaluation node.

    Figure 2. Evaluation node: Plot tab
    Evaluation node: Plot tab
  4. Attach an Evaluation node to the model nugget; on the Plot tab, select Include best line.
  5. Click the Options tab.
    Figure 3. Evaluation node: Options tab
    Evaluation node: Options tab
  6. Select User defined score and type '$CP-1-1' as the expression. This is a model-generated field that corresponds to the propensity to churn.
  7. Click Run.
    Figure 4. Gains chart
    Gains chart

    The cumulative gains chart shows the percentage of the overall number of cases in a given category "gained" by targeting a percentage of the total number of cases. For example, one point on the curve is at (10%, 15%), meaning that if you score a dataset with the model and sort all of the cases by predicted propensity to churn, you would expect the top 10% to contain approximately 15% of all of the cases that actually take the category 1 (churners). Likewise, the top 60% contains approximately 79.2% of the churners. If you select 100% of the scored dataset, you obtain all of the churners in the dataset.

    The diagonal line is the "baseline" curve; if you select 20% of the records from the scored dataset at random, you would expect to "gain" approximately 20% of all of the records that actually take the category 1. The farther above the baseline a curve lies, the greater the gain. The "best" line shows the curve for a "perfect" model that assigns a higher churn propensity score to every churner than every non-churner. You can use the cumulative gains chart to help choose a classification cutoff by choosing a percentage that corresponds to a desirable gain, and then mapping that percentage to the appropriate cutoff value.

    What constitutes a "desirable" gain depends on the cost of Type I and Type II errors. That is, what is the cost of classifying a churner as a non-churner (Type I)? What is the cost of classifying a non-churner as a churner (Type II)? If customer retention is the primary concern, then you want to lower your Type I error; on the cumulative gains chart, this might correspond to increased customer care for customers in the top 60% of predicted propensity of 1, which captures 79.2% of the possible churners but costs time and resources that could be spent acquiring new customers. If lowering the cost of maintaining your current customer base is the priority, then you want to lower your Type II error. On the chart, this might correspond to increased customer care for the top 20%, which captures 32.5% of the churners. Usually, both are important concerns, so you have to choose a decision rule for classifying customers that gives the best mix of sensitivity and specificity.

    Figure 5. Sort node: Settings tab
    Sort node: Settings tab
  8. Say that you have decided that 45.6% is a desirable gain, which corresponds to taking the top 30% of records. To find an appropriate classification cutoff, attach a Sort node to the model nugget.
  9. On the Settings tab, choose to sort by $CP-1-1 in descending order and click OK.
    Figure 6. Table
    Table
  10. Attach a Table node to the Sort node.
  11. Open the Table node and click Run.

Scrolling down the output, you see that the value of $CP-1-1 is 0.248 for the 300th record. Using 0.248 as a classification cutoff should result in approximately 30% of the customers scored as churners, capturing approximately 45% of the actual total churners.