Home Topics Confusion Matrix What is a confusion matrix?
Explore watsonx.ai
Icons of people, documents and images

Published: 19 January 2024
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu

 

The confusion matrix helps assess classification model performance in machine learning by comparing predicted values against actual values for a dataset.

A confusion matrix (or, error matrix) is a visualization method for classifier algorithm results. More specifically, it is a table that breaks down the number of ground truth instances of a specific class against the number of predicted class instances. Confusion matrices are one of several evaluation metrics measuring the performance of a classification model. They can be used to calculate a number of other model performance metrics, such as precision and recall, among others.

Confusion matrices can be used with any classifier algorithm, such as Naïve Bayes, logistic regression models, decision trees, and so forth. Because of their wide applicability in data science and machine learning models, many packages and libraries come preloaded with functions for creating confusion matrices, such scikit-learn’s sklearn.metrics module for Python.

Take a tour of IBM watsonx

Explore IBM watsonx and learn how to create machine learning models using statistical datasets.

Related content

Subscribe to the IBM newsletter

The confusion marix layout

In a confusion matrix, columns represent the predicted values of a given class while rows represent the actual values (i.e. ground truth) of a given class, or vice-versa. Note that the reverse also appears in research. This grid structure is a convenient tool for visualizing model classification accuracy by displaying the number of correct predictions and incorrect predictions for all classes alongside one another.

A standard confusion matrix template for a binary classifier may look like this:

The top-left box provides the number of true positives (TP), being the number of correct predictions for the positive class. The box beneath it is false positives (FP), those negative-class instances incorrectly identified as positive cases. These are also called type I errors in statistics. The top-right box is the number of false negatives (FN), the actual positive instances erroneously predicted negative. Finally, the bottom-right box displays the number of true negatives (TN), which are the actual negative class instances accurately predicted negative. Totaling up each of these values would provide the model’s total number of predictions.1    

Of course, this template is for a rudimentary binary classification problem. The confusion matrix can visualize results for multiclass classification problems as well. For example, imagine that we are developing a species classification model as part of a marine life conservation program. The model predicts fish species. A confusion matrix for such a multiclass classification problem may look like this:

The diagonal boxes all indicate true predicted positives. The other boxes provide quantities for false positives, false negatives, and true negatives depending on which class one chooses to focus.

Using the confusion matrix for model evaluation

Given its readily accessible visualization of classifier predictive results, the confusion matrix is useful for calculating other model evaluation metrics. Values can simply be pulled from the matrix and plugged into a number of equations for measuring model performance.

Accuracy

Model accuracy is not a wholly informative evaluation metric for classifiers. For instance, imagine we run a classifier on a data set of 100 instances. The model’s confusion matrix shows only one false negative and no false positives; the model correctly classifies every other data instance. Thus the model has an accuracy of 99%. Though ostensibly desirable, high accuracy is not in itself indicative of excellent model performance. For instance, say our model aims to classify highly contagious diseases. That 1% misclassification poses an enormous risk. Thus, other evaluation metrics can be used to provide a better picture of classification algorithm performance.

Precision and recall

Precision is the proportion of positive class predictions that actually belong to the class in question.2 Another way of understanding precision is that it measures the likelihood a randomly chosen instance belongs to a certain class.3 Precision may also be called positive predicted value (PPV). It is represented by the equation:

Recall denotes the percentage of class instances detected by a model.4 In other words, it indicates the proportion of positive predictions for a given class out of all actual instances of that class.5 Recall is also known as sensitivity or true positive rate (TPR) and is represented by the equation:

F1 score

Precision and recall can share an inverse relationship at times. As a model increases recall by returning more actual class instances (i.e. true positives), the model will inevitably misclassify non-instances (i.e. false positives) as well, thereby decreasing precision.6 The F1 score attempts to combine precision and recall to resolve this tradeoff.

The F1 score—also called F-score, F-measure, or the harmonic mean of precision and recall—combines precision and recall to represent a model’s total class-wise accuracy. Using these two values, one can calculate the F1 score with the equation, where P denotes precision (PPV) and R denotes recall (sensitivity):

The F1 score is particularly useful for imbalanced datasets, in which the precision-recall tradeoff can be most apparent. For example, say we have a classifier predicting the likelihood of a rare disease. A model that predicts no one in our test dataset has the disease may have perfect precision yet zero recall. Meanwhile, a model that predicts everyone in our dataset has the disease would return perfect recall but precision equal to the percentage of people who actually have the disease (e.g. 0.00001% if only one in every ten million have the disease). The F1 score is a means of balancing these two values to obtain a more holistic view of a classifier’s performance.7

Some researchers criticize the use of the F1 score as a performance metric. Such arguments typically claim that the F1 score gives equal weight to precision and recall, which may not be equally important performance metrics for all datasets.8 In response, researchers have proffered modified variants of the F1 score.9

Conditional measures

Conditional measures signify a model’s accuracy rate for detecting a certain class or non-class. Recall, also known as true positive rate (TPR) or sensitivity, is one such measure, indicating the ratio of positive class predictions out of all actual class instances. Specificity—or, true negative rate (TNR)—is the other conditional measure. It measures the proportion of correct negative predictions out of actual non-instances of a given class. One can compute specificity with the equation:10

False positive rate

Specificity helps calculate a model’s false positive rate (FPR). Other classifier evaluation visualizations, notably ROC curve and AUC, utilize FPR. FPR is the probability that a model will falsely classify a non-instance of a certain class as part of that class. Thus, per its name, it represents the rate at which a model returns false positives, known as type I errors in statistics.

While type I errors refer to false positives, type II errors denote false negatives, actual instances of a given class erroneously classified as not part of that class. Per its name, the false negative rate (FNR) denotes the probability a model erroneously classifies an actual class instance as not part of that class. Much as FPR corresponds to specificity, FNR corresponds to sensitivity:

Note that FNR is often not used in literature because it requires knowing the total number of actual instances for a given class, which can remain unknown in unseen test datasets.11

Unconditional metrics

Unconditional metrics are those that represent the chances of a specific class occurring or not occurring according to the model. Precision—or, positive predicted value (PPV)—is one unconditional metric. As mentioned, it measures the likelihood that a chosen instance belongs to a certain class. The other unconditional metric, negative predicted value (NPV), is the probability that a chosen instance will not belong to that class. Essentially, both unconditional metrics attempt to answer whether a randomly chosen instance will belong to a specific class or not. One can compute NPV with the equation:12

Related resources Create a confusion matrix with Python

Use scikit-learn to create a confusion matrix for a simple binary classification problem.

Analyzing machine learning model performance

Learn more about analyzing model performance using confusion matrices and performance metrics.

Classify data with SVM algorithm in Python

Use scikit-learn to create a confusion matrix as part of a text classification task with SVMs.

Create a confusion matrix with R

Use several R packages to create a confusion matrix for a simple binary classification problem.

Take the next step

Build an AI strategy for your business on one collaborative AI and data platform—IBM watsonx. Train, validate, tune and deploy AI models to help you scale and accelerate the impact of AI with trusted data across your business.

Explore watsonx Book a live demo
Footnotes

1 Kai Ming Ting, “Confusion matrix,” Encyclopedia of Machine Learning and Data Mining, Springer, 2018.

2 Ethan Zhang and Yi Zhang, “Precision,” Encyclopedia of Database Systems, Springer, 2018.

3 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

4 Ethan Zhang and Yi Zhang, “Recall,” Encyclopedia of Database Systems, Springer, 2018.

5 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

6 Ben Carterette, “Precision and Recall,” Encyclopedia of Database Systems, Springer, 2018.

7 Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, MIT Press, 2016, https://www.deeplearningbook.org/ (link resides outside ibm.com). Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.

8 David Hand and Peter Christen, "A note on using the F-measure for evaluating record linkage algorithms," Statistics and Computing, Vol. 28, 2018, pp. 539–547, https://link.springer.com/article/10.1007/s11222-017-9746-6 (link resides outside ibm.com).

9 David Hand, Peter Christen, and Nishadi Kirielle, "F*: an interpretable transformation of the F-measure," Machine Learning, Vol. 110, 2021, pp. 451–456, https://link.springer.com/article/10.1007/s10994-021-05964-1 (link resides outside ibm.com). Davide Chicco and Giuseppe Jurman, "The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation," BMC Genomics, Vol. 21, 2020https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-019-6413-7 (link resides outside ibm.com).

10 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.

11 Allen Downey, Think Stats, 2nd edition, O’Reilly, 2014.

12 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.