Notes on ML model evaluation

Posted February 11, 2020 | View revision history


This post is a first-pass attempt to organize my thoughts for evaluating machine learning models, for my security analytics class. This touches on the “evaluation” phase of CRISP-DM.


By Kenneth Jensen - Own work based on: (Figure 1), CC BY-SA 3.0,

There exist two uses of statistical models – (1) Machine Learning, and (2) “traditional statistics”. Machine learning focuses foremost on building models which will be used for making predictions. “Traditional statistics” focuses on using models to extract insights from a dataset– e.g., t-tests and ANOVAs, ANCOVAs, to test for differences between group means on some outcome variable. The thing to understand is that both “uses” use the same models. It is possible to extract data insights from machine learning models, and it is possible to make predictions using “traditional statistics” models – the same algorithms are used to obtain the same models.

Machine learning is betimes criticized for producing “black box” models – e.g., a neural network model from which it seems impossible to understand “why” it makes a certain prediction. Some machine learning applications must (arguably and sometimes legally) only use “interpretable” models, so that illegal or unethical bias in the predictions can be ruled out. For example:

Such models should use algorithms which produce interpretable coefficients, decision paths, or other metrics for exogenous variables – for example, decision trees, naive bayes, logistic regression. Bad algorithms would include k-nearest-neighbors (predictions dependent on other points in the training dataset, not on exogenous factors of the datapoint), neural networks (complexity of its layers resists human evaluation), mostly-SVM.

Note though, that there exists “partial dependency” analysis, which is a backdoor way to learn how changes in values of one feature at a time influence predictions. Pretty cool.

“Traditional Statistics” Model interpretation

Traditional statistical libraries in R, SAS, SPSS, etc. provide nice print-out summaries of regression models, which allow easy interpretation of individual model variables. Scikit-learn has a standard API for buckets for train-test-splitting data, for fitting modeling algorithms, and for extracting overall model evaluation metrics. But scikit-learn has no such nice summary output view of a model’s parameters. Enter statsmodels – provides model summary print-outs comparable to R’s.

Image from this SO post

Gamut of ML model evaluation metrics for binomial classifiers

Most classifiers make probability predictions. These can be converted to nominal predictions using a cutoff threshold. Nominal predictions can be displayed using a confusion matrix. A set of confusion matrices for a given classifier can be obtained by creating one confusion matrix for each possible probability cutoff threshold.

Evaluation metrics such as the following can be derived from a given confusion matrix. These all have synonyms from various fields of use.

  • TPR (a.k.a recall, sensitivity, “probability of detection”)
  • FPR (1 - specificity)
  • PPV (precision)
  • F1 (harmonic mean of TPR and PPV)
  • Accuracy

A Type 1 error is a False Positive, and therefore hurts (lowers) TPR A Type 2 error is a False Negative, and therefore hurts (lowers) specificity, and hurts (raises) FPR


An F-score (F1-score) is the harmonic mean of TPR and PPV. Harmonic mean is an alternative to an arithmetic mean, and is more appropriate when averaging things that are proportions. A harmonic mean is the… deep breath… inverse of the average of the inverses of the given measures. [:exhale:] It’s a happy-medium between “how many of the actual yes-es did we say were yes-es” and “how often were we right when we said something was a “yes.”


  • ROC curves
  • Precision-Recall curves
  • Lift curves

ROC curves

The set of all TPR and FPR metrics derived from the complete set of confusion matrices for a given classifier can be plotted on a ROC Curve – TPR on the y-axis, and FPR on the X axis.

ROC Curves do not depend on class priors. If your testing data does not have the same distribution as the population from which it was drawn, plots such as cumulative gain charts and lift charts, which plot predictions that have been sorted by probability scores, can be deceptive. ROC curves, however, plot two metrics which consider actual-positives and actual-negatives separately. Therefore, ROC curves will not change based on the ratio of classes.

A ROC curve also plots a “p-coin random guess” line – a line plotting the skill of a coin-flip classifier model. This model calculates a random probability between 0 and 1 for each test data point, and if that probability is above a given threshold, it predicts a 1. If a threshold of x is used, say, .75, then a model would have a 25% chance for each test case of classifying it as a 1, and a 75% chance of classifying it as a 0. Therefore, 25% of the actually-positive cases will be predicted to be yeses (TPR of .25), and 75% of the actually-negative cases will be classified as no’s (FPR of .25). Note that the above conclusion does not depend on the balance of actually-yes-es to actually-nos, since the classes are considered separately for these metrics. When the complete set of TPR-FPR pairs is calculated from all possible thresholds, plotting the resultant set of points on a ROC curve results in a line drawn from [0,0] to [1,1]. This is called a p-coin because such a classifier can also be considered flipping a maybe-weighted coin. For a fair coin, each datapoint has a .5-chance of being classified as a 1, while a weighted coin that flipped a 1 9 out of 10 times would give each test case a .9 chance of being called a 1. A ROC curve for a given model had better perform better than a random coin flip for all of its possible thresholds. If it doesn’t then it would be better to always predict the opposite of whatever the given model predicts. From such a classifier, turn away!

An area-under-the-curve can be calculated from a ROC curve to give a single metric that assesses the performance of a single classifier across all possible cutoff thresholds. This is equivalent to the ratio of total space that a classifier’s curve “fills” down to the bottom-right of the [[0,1],[0,1]] ROC-curve plot space. Higher AUC == good.

Precision-Recall curves

These are like ROC curves, except that they plot different metrics – TPR and PPV. These two metrics together only consider 3 out of 4 possible cells of a confusion matrix – the “true negative”s are ignored. The thought here is that in application, things relating to the “actual yes”-es and the “yes-prediction”s are more interesting. Note that these are the two metrics that make an “F1”-score. Therefore, AUC of a precision-recall curve is related to a cumulative averaging of the complete set of F1 scores across all possible cutoff thresholds for a given model.

A no-skill line on a precision-recall curve is not a line drawn from [0,0] to [1,1] – rather, for a set of random guesses, the precision, or proportion of time that a “yes”-guess is correct, is always equal to the proportion of actual-yes-values in the set of points from which the random sample was drawn. In other words, regardless of how often you guess “yes”, if you’re doing so completely randomly, you’re going to right at the rate that there are actual “yes”-es in the sample you are guessing against. Recall, on the other hand, will scale linearly with the likelihood of a no-skill classifier of guessing “yes”-es. Recall that recall is a measure of “of the actual-positive test-cases, how many were correctly classified?” The more often you guess “yes”, the more of the actual “yes”-es you’ll get right. See this explained with proper stats notation in this stackoverflow post

Lift curves


Model evaluations using python scikit-learn


David Eargle is an Assistant Professor at the University of Colorado Boulder in the Leeds School of Business. He earned his Ph.D. degree in Information Systems from the University of Pittsburgh. His research interests include human-computer interaction and information security. He has coauthored several articles in these areas using neurophysiological and other methodologies in outlets such as the Journal of the Association for Information Systems, the European Journal of Information Systems, the International Conference on Information Systems, and the Hawaii International Conference on System Sciences), along with the Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI). More about the author →

This page is open source. Please help improve it.