Measuring Machine Learning Model Performance

nithyav2k16
Aug 26, 2019
5 min read

Updated: Feb 11, 2020

By now you understand the basics of machine learning. Next up is finding out whether the machine learning you did was any good or not. In other words, how can you assess the quality of a model after you learned it?

Is our machine learning model any good?

Well, it depends. First of all, you have to define performance. This definition depends on a context in which you want to use model. Some cases, accuracy of a model can be the most important thing. In other cases, the computation time is a good indicator of performance and then even other cases, the interpretability of the model is essential. Next, there are also these three different tasks in machine learning - classification, regression, and clustering. Each of these have different performance measures . We'll discuss every one of these in some more detail in this post.

How to measuring machine learning model performance — Measuring machine learning model performance

Classification

In classification systems, accuracy and error are the basic measures of performance. Basically, they reflect the number of times the system is right or wrong. If accuracy goes up, the error rate goes down. Typically, accuracy is simply the number of correctly classified instances over the total amount of classified instances. The error rate is one minus accuracy.

Remember the classification of squares we discussed before ? Each square had a set of features. For example, square could be small and have dotted lines. Suppose each square has a label as well. This time it can be colored or not colored. Positive or negative, so we dealing with a binary classification problem.You develop a Classification system that labels two squares as positive and three as negative like this.

In reality, however the model only classified three squares correctly. One Positive Square was wrongly classified as negative and one negative square was wrongly classified as positive . Accuracy of this model is calculated as three divided by five 60% .these are the number of correctly classified squares divided by the total number of squares. As you can see here, accuracy does not always tell the whole story.

Think of a classifier that predicts whether patients have very rare heart disease or not . A classifier that simply predicts that every patient is not sick will still have a high accuracy misclassifying one patient who has this disease and clarifying 99% of other patients correctly as not having the disease resulting in an accuracy of 99%. however, you can intuitively feel that this classifier is bogus and should not be used as a predictor.

Confusion Matrix

This is why it is always interesting to calculate the confusion matrix. The confusion matrix consists of rows and columns both containing all available labels. Each cell and the confusion matrix contains the frequency of instances which are classified in a certain way. For now, we will focus on binary classifiers. We call one clause positive and the other negative. The upper left corner in the matrix contains the frequency of true positives - instances which are correctly classified as positive.The lower right corner contains the frequency of true negatives. The instances which are correctly classified as negative. Upper right corner contains the frequency of false negatives. These are the instances which are classified as negative but are infact positive. Finally, the lower left corner contains frequency of false positives. Instances which are classified as positive but are infact negative . Through these measures the accuracy as well as new ratios like precision and recall can be calculated.

Precision is the rate you get when dividing the true positives by the sum of true positives and false positives . The recall is a rate to get when you divide the true positives by the sum of true positives and false negatives .

Specifically for the problem of classifying squares, we can create the following confusion matrix. We see that one positive square was correctly classified as positive. Two negative squares were correctly classified as negative.These are the true negatives. One positive square was wrongly classified as negative, false negative and one negative square was wrongly classified as positive, a false positive.In this confusion matrix, accuracy can be calculated as a sum of the diagonal numbers divided by the sum of all the numbers. The precision and recall can be measured as shown in the picture below.

machine learning model performance measures — Performance measures

You can also build a confusion matrix for the disease classifier. Out of 100 people, there are 99 healthy persons and one sick person . Our classifier however labels them all as not sick. The confusion Matrix does looks like this, although the accuracy is 99% the recall is 0%, the precision is not even defined because there are no positive predictions.

Regression : RMSE

Now let's discuss about Regression. To compare the performance of regression algorithms, you can use a root means squared error or RMSE. This term is calculated by first taking the sum of squares of differences between the predicted and real values. Next, you divide this value by number of values and you rip it up by taking square roots . In the following plot, you can see that the term is strongly related to the mean distance between the points and a regression line. If these distances are small, our RMSE will be small. If they are large, our RMSE will be large.

Regression calculation using RMSE — Root Mean Squared Error

Finally, there are also different ways to see whatever clustering algorithm did a good job. Here, you have no label information whatsoever, so you'll have to go with this and metrics between points.

Clustering

More specifically, the performance measure for clustering is always comprised of two elements. On the one hand there's a similarity within each cluster, so how points in the same cluster are like and on the other hand there are similarities between clusters. So how the points in two different clusters are similar. You'll want the first metric similarity within each cluster to be high. While the second metric, similarity between different clusters should be low.

There are many different techniques to measure these two concepts.For within cluster similarity , there is a within sum of squares (WSS) and the cluster diameter among many others. The smaller the sum of squares and the smaller the diameter, the more your points within same clusters are similar.

For between clusters similarity, you can use things like between cluster sum of squares (or) the intercluster distance. The higher they are, the less similar the clusters are.

These are pretty technical measures that involve some fancy equations. A performance index to compare clustering algorithms is Dunn Index .Specifically, the Dunn index is a ratio of the minimum intercluster distance between two clusters divided by the maximum diameter found in clustering. There are tons of performance measures out there and these are just a few of them. The most important thing is to think critically about models you're trading and also to interpret the performance measures greatly. This is something you'll get good at with experience.

In order to learn further machine learning and data science concepts in depth, check out this course