The Ultimate Guide for Evaluating Classification Models

Classification is the art of predicting categorical class labels of data points. This article discusses the most important evaluation methods you need to know.

6 min readFeb 5, 2022

Classification is a specific problem set in machine learning were algorithms are used to predict categorical response values. These categories are labels of classes in the data set. Each data point has one of these labels. The usual process consists of cleaning the data and separating it in training and testing data. Next, a model is trained with the training data and predicts the labels of the testing data. Based on these predicted labels, several evaluation metrics can be calculated. This article contains all the evaluation metrics you need to know in order to build top notch models.

Confusion matrix

The confusion matrix is the basis for all the other evaluation metrics. It visualizes the performance of an algorithm based on the positive and negative predicted values. Each row of the matrix represent class instances and each column of the matrix represent predicted class instances. This way, a clear overview of the true positives/negatives and false positives/negatives is given. An example of a confusion matrix is shown below, in case of a binary classification. Whenever there is a multi-class classification problem, you can simply just add rows and columns to the confusion matrix.

Accuracy and error

Next on is accuracy and error, which are often used models to evaluate performance. Accuracy refers to the ability of the classification model to predict class labels correctly. The accuracy is calculated by dividing the sum of all correctly predicted values by the total number of values. Error refers to the inability of the classification model to predict class labels correctly. The error is calculated by dividing the sum of all negatively predicted values by the total number of values. This results in the following formulas:

Accuracy score formula for classification model evaluation

Error score formula for classification model evaluation

Keep in mind that the optimal metric is very dependent on your problem set, especially in the case of imbalanced classes. Take for example a supervised outlier classification problem, where there are very few outliers in the total data set. Whenever the model predicts that every data point isn’t an outlier, the accuracy and error will still be pretty high because the truly predicted values outweigh the falsely predicted values.

Precision and recall

Precision refers to the fraction of truly predicted values that are actual true. Recall on the other hand refers to wether the model was able to identify all the values correctly. It is also called true positive rate. Both formulas are as follows:

Precision score formula for classification model evaluation

Recall score formula for classification model evaluation

There is usually a tradeoff between precision and recall. The higher the precision, the lower the recall and vice versa. The application of precision and recall is very dependent on your problem, just as with accuracy and error. Consider for example a cancer department in a hospital, where the goal is to diagnose as many sick people as possible. In other words, the goal is to make the recall as high as possible because you want as few false negatives as possible. In other settings, the goal can be to make the precision as high as possible.

Multi-class classification

You now probably think, how do I apply this for classification problems that aren’t binary? Classification problems that consist of more that two categories. The idea here is to take an average of the precision and recall of all the categories. This can be done either by using macro averaging or micro averaging.

Macro averaging first calculates the precision and recall for every label. In case of a 3-class classification problem, there will be three precision metrics and three recall metrics. Then, an average of these values is calculated for the macro-average.

Macro averaging for calculating precision and recall — Macro averaging

Micro averaging takes every label as its own binary classification problem. So, each label has its own confusion matrix with 4 quadrants. Next, these confusion matrixes are merged to one pooled matrix with 4 quadrants. The precision and recall scores are then calculated for the pooled confusion matrix, resulting in the micro averaging scores.

Micro averaging for calculating precision and recall — Micro averaging

F-measure

The F1 score combines the precision and recall to a joint mean. The score is between 0 and 1, where both values have an equal weight. This can be used whenever a single metric is required and the difference between the precision and recall is not very important.

F1 score formula for classification model evaluation

In certain problem sets, precision or recall can be more important than the other. As just discussed, there is always a tradeoff between the two. The F-beta score comes to the rescue. It is used in specific problem sets where precision or recall is more important than the other. Based on the beta value, one weighs heavier than the other. The formula is as follows:

Loss functions

A loss function is a statistical method which measures how far a particular iteration of the model is from the actual values. It measures how far an estimated value is from its true value. There are a lot of known loss functions for regression tasks, like mean absolute errors and sum of squared errors. But there are also loss functions for classification tasks. The two main ones are binary cross-entropy and categorical cross-entropy.

Binary cross-entropy

Cross-entropy is the measure of the difference between two random variables. Binary cross-entropy is a loss function used in binary classification tasks. Categorical cross-entropy is a loss function used in multi-class classification tasks. It essentially is the same as binary cross-entropy, only with more classes. The only requirement is that the class labels should be one-hot-encoded. The formula is shown below. In an ideal situation, a model has a log-loss of 0.

Both binary and categorical cross-entropy are more advanced and out of the scope of this article. For more in depth information, you should read this article and this article.

AUC - ROC

Classification models sometimes return a probability score of which class the data point belongs to. As a machine learning engineer, you are able to set the threshold from which probability score it belongs to a certain category.

It is important to find the optimal threshold for your specific application. AUC ROC is a performance measurement for classification models using various thresholds. This way, an overview is given of the effect of the threshold on the model’s sensitivity and specificity.

ROC stands for ‘receiver operator characteristic’ and is the probability curve. It helps to define the best threshold for your problem. Below is a figure that illustrates the ROC curve. The Y-axis is the true positive rate, also called the recall and the X-axis is the false positive rate. For each threshold, a dot can be placed in the graph which shows the true positive rate and false positive rate.

AUC stands for ‘area under the curve’ and is the degree of separability. This can be used to compare models, and decide which model is better for your problem. The more area under the curve, the better the model.

Conclusion

There are a lot of different evaluation metrics for classification models. This article discussed the most used ones in the machine learning domain. There are a ton of metrics that aren’t discussed in this article. However, with these metrics in your toolbox, you are able to grasp more advanced techniques relatively quickly.