Published: 31 July2024
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu
Classification models are a type of predictive modeling that organizes data into predefined classes according to feature values.
Classification models are a type of machine learning model that divides data points into predefined groups called classes. Classifiers are a type of predictive modeling that learns class characteristics from input data and learns to assign possible classes to new data according to those learned characteristics.1 Classification algorithms are widely used in data science for forecasting patterns and predicting outcomes. Indeed, they have an array of real-world use cases, such as patient classification per potential health risks and spam email filtering.
Classification tasks can be binary or multiclass. In binary classification problems, a model predicts between two classes. For example, a spam filter classifies emails as spam or not spam. Multiclass classification problems classify data among more than two class labels. For instance, an image classifier might classify images of pets by using a myriad of class labels, such as dog, cat, llama, platypus and more.
Some sources, especially online, refer to classification as a form of supervised machine learning. But classifiers do not fall exclusively within the domain of supervised learning. Unsupervised learning approaches to classification problems have been a key focus of recent research.
Explore IBM watsonx and learn how to create machine learning models using statistical datasets.
Subscribe to the IBM newsletter
Of course, each machine learning classification algorithm differs in its internal operations. All nevertheless adhere to a general two-step data classification process:
Learning. In supervised learning, a human annotator assigns each data point in the training dataset a label. These points are defined as a number of input variables (or independent variables), which might be numerical, text strings, image features and so forth. In mathematical terms, the model considers each data point as a tuple x. A tuple is merely an ordered numerical sequence represented as x = (x1,x2,x3…xn). Each value in the tuple is a given feature of the data point. The model uses each data point’s features along with its class label to decode what features define each class. By mapping training data per this equation, a model learns those general features (or variables) associated with each class label.
Classification. The second step in classification tasks is classification itself. In this phase, users deploy the model on a test set of unseen data. Previously unused data is ideal for evaluating model classification in order to avoid overfitting. The model uses its learned predicted function y=f(x) to classify the unseen data across distinct classes according to each sample’s features. Users then evaluate model accuracy according to the number of correctly predicted test data samples.2
Classification models output two types of predictions: discrete and continuous.
Discrete. Discrete predictions are the predicted class labels for each data point. For example, we can use a predictor to classify medical patients as diabetic or nondiabetic based on health data. The classes diabetic and nondiabetic are the discrete categorical predictions.
Continuous. Classifiers assign class predictions as continuous probabilities called confidence scores. These probabilities are values between 0 and 1, representing percentages. Our model might classify a patient as diabetic with a .82 probability. This means that the model believes the patient has an 82% chance of being diabetic and a 18% chance of being nondiabetic.
Researchers typically evaluate models by using discrete predictions while using continuous predictions as thresholds. A classifier ignores any prediction under a certain threshold. For instance, if our diabetes predictor has a threshold of .4 (40%) and classifies a patient as diabetic with a probability of .35 (35%), then the model will ignore that label and not assign the patient to the diabetic class.3
Researchers and developers use a wide array of evaluation metrics for classification models. The chosen evaluation metric depends on the specific classification task. All measure the accuracy with which a learner (that is, classifier) accurately predicts model classes.
Classification metrics utilize a common terminology. True positives (TP) are those data samples the model correctly predicts in their respective class. False positives (FP) are those negative-class instances incorrectly identified as positive cases. False negatives (FN) are actual positive instances erroneously predicted as negative. True negatives (TN) are the actual negative class instances the model accurately classifies as negative.
Precision. This is called positive predicted value (PPV). It is the proportion of positive class predictions that actually belong to the class in question. For example, in a spam filter, precision is the proportion of messages the model classifies as spam that are, in fact, spam. It is represented by the equation:4
Recall. Also sensitivity or true positive rate (TPR), recall denotes the percentage of class instances detected by a model. Returning to the spam filter, recall indicates how many actual spam messages the model actually classifies as spam. It is represented by the equation:5
F1 score. Precision and recall can share an inverse relationship; as a classifier returns more true positives (increased recall), the classifier inevitably misclassifies non-instances (that is, false positives) as well, decreasing precision. F1 score aims to resolve this tradeoff. F1 (or F-score) effectively combines precision and recall to represent a model’s total class-wise accuracy. It is represented by the equation:6
Data visualization tools help illustrate findings in data analysis. Data scientists and machine learning researchers use two primary tools for visualizing classifier performance:
Confusion matrix. The confusion matrix is a table whose columns represent the predicted values of a given class while rows represent the actual values, or vice versa. The top-left box signifies the number of true positives; the box beneath is the false positives; the top-right box is the number of false negatives; the bottom-right box is the number of true negatives. Totaling up each of these values provide the model’s total number of predictions.7 A confusion matrix for a binary classifier may look like:
ROC curve. A receiver operating characteristic (ROC) curve visualizes the proportion of true positives to true negatives. The chart plots the true positive rate against the true negative rate for each threshold used in model classification. The area under curve (AUC) statistic arises from the ROC curve. AUC measures how likely a randomly selected positive has a higher confidence score than a random negative. AUC values range from 0 to 1. The former signifies that the model scores all negatives with higher probabilities than positives, while 1 means that the model scores every positive with higher probability.8
There are many different types of classification algorithms. While they have overlapping use cases, some are more suited to particular applications than others. Here is an overview of three popular machine learning algorithms for classification. All three can be readily implemented in Python by using various scikit-learn libraries.
Online sources often juxtapose machine learning classification and regression tasks. But this is an oversimplification. Logistic regression is a probability classifier derived from linear regression models. Linear regression uses one or more independent variables to predict the value of an independent variable, whose value can be any continuous rational number. Logistic regression is a modification to the linear regression such as the output value (or independent variable) is limited to any value between 0 and 1. It does this by applying a logit—or log odds—transformation to the standard linear regression formula:9
Logistic regression models are useful for binary classification of multivariate regression problems. Common applications are fraud detection and biomedical predictions. For instance, logistic regression has been implemented to help predict patient mortality induced by trauma and coronary heart disease.10
Naïve Bayes classifiers (also styled Naive Bayes) are a popular type of classifier based on the Bayes’ theorem. One of its chief differences from other classifiers is that it calculates posterior probability for class predictions. This means that the Naïve Bayes updates initial class predictions (called prior probabilities) with each new piece of data. For example, say we attempt to classify a patient’s probability of developing diabetes. This patient’s medical data—such as blood pressure, age, blood sugar levels, and more—serve as the independent variables. To predict, a Bayesian classifier calculates the current believed prevalence of diabetes across a population (prior probability) with the probability of our patient’s medical data values appearing in someone with diabetes (conditional probability). Naïve Bayes classifiers follow the Bayes’ Rule equation:11
Naïve Bayes is known as a generative classifier. This means that the Bayesian classifier, by using a given observations’ variable values, calculates which class is most likely to have generated the observation. Natural language processing (NLP) researchers have widely applied Naïve Bayes for text classification tasks, such as sentiment analysis. Using a bag of words model, in which each word constitutes a variable, the Naïve Bayes classifier for sentiment analysis. predicts whether a positive or negative class produced the text in question.12
K-nearest neighbors (KNN) maps data points onto a multidimensional space. It then groups those data points with similar feature values into separate groups, or classes. To classify new data samples, the classifier simply looks at the k number of points nearest to a new given data sample test input x, counts how many members of each class comprise the neighbor subset, and returns that proportion as the class estimate for the new data point. In other words, the model assigns a new data point to whichever class comprises the majority of that point’s neighbors. KNN model’s typically compare distance between data points with Euclidean distance:13
Approximate nearest neighbor (ANN) is a variant of KNN. In high-dimensional data spaces, it can become computationally expensive to find a given data point’s exact neighbors. Dimensionality reduction is one means of solving for this. ANN is another. Rather than find a given data point’s exact nearest neighbor, ANN finds an approximate nearest neighbor within a given distance. Recent research has shown promising results for ANN in the context of multilabel classification.14
Note that many of these techniques can be negatively impacted by outliers. Fortunately, a number of regularization techniques can help account for such negative impact factors. There are also many other algorithms for classification in machine learning. Some additional algorithms are decision trees, random forests, gradient boosting and support vector machines (SVM).
Explore the basics of solving a classification-based machine learning problem and get a comparative study of some of the current most popular algorithms.
Machine learning (ML) is a branch of artificial intelligence (AI) and computer science that focuses on the using data and algorithms to enable AI to imitate the way that humans learn, gradually improving its accuracy.
Now available—a next generation enterprise studio for AI builders to train, validate, tune and deploy AI models.
1 Chris Drummond, “Classification,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017.
2 Jaiwei Han, Micheline Kamber, and Jian Pei, Data Mining: Concepts and Techniques, 3rd edition, Morgan Kaufman, 2012.
3 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016.
4 Ethan Zhang and Yi Zhang, “Precision,” Encyclopedia of Database Systems, Springer, 2018.
5 Ethan Zhang and Yi Zhang, “Recall,” Encyclopedia of Database Systems, Springer, 2018.
6 Ben Carterette, “Precision and Recall,” Encyclopedia of Database Systems, Springer, 2018.
7 Kai Ming Ting, “Confusion matrix,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017.
8 Peter Flach, “ROC Analysis,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017.
9 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016. Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor, An Introduction to Statistical Learning with Applications in Python, Springer, 2023
10 Lisa X. Deng, Abigail May Khan, David Drajpuch, Stephanie Fuller, Jonathan Ludmir, Christopher E. Mascio, Sara L. Partington, Ayesha Qadeer, Lynda Tobin, Adrienne H. Kovacs, and Yuli Y. Kim, "Prevalence and Correlates of Post-traumatic Stress Disorder in Adults With Congenital Heart Disease," The American Journal of Cardiology, Vol. 117, No. 5, 2016, pp. 853-857, https://www.sciencedirect.com/science/article/abs/pii/S0002914915023590 (link resides outside of ibm.com).
11 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016. William Bolstad and James Curran, Introduction to Bayesian Statistics, 3rd edition, Wiley, 2016.
12 Daniel Jurafsky and James Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd edition, 2023.
13 Max Kuhn and Kjell Johnson, Applied Predictive Modeling, Springer, 2016. Kevin Murphy, Machine Learning: A Probabilistic Perspective, MIT Press, 2012.
14 Ville Hyvönen, Elias Jääsaari, Teemu Roos, “A Multilabel Classification Framework for Approximate Nearest Neighbor Search,” Journal of Machine Learning Research, Vol. 25, No. 46, 2024, pp. 1−51, https://www.jmlr.org/papers/v25/23-0286.html (link resides outside of ibm.com).