Validation of Binary Classification Models

Comparing Logistic Regression, Neural Networks, and LightGBM on synthetic data

Binary classification is one of the most common tasks in applied machine learning: fraud detection, credit default prediction, medical diagnosis, churn modeling. In each case the model outputs a probability that an observation belongs to the positive class, and the business decides on a threshold to act on. Choosing the right validation metrics is critical, especially when the classes are not balanced.

Threshold-Free Metrics

Before committing to a decision threshold, it is worth evaluating how well the model ranks and scores observations across all possible cut-offs. Threshold-free metrics answer the question: does the model assign higher probabilities to actual positives than to actual negatives? They should be the first thing you look at, because if the model cannot rank well, no choice of threshold will save it.

ROC Curve and AUC

Given a threshold $T$, an observation with score $X > T$ is classified as positive. As $T$ varies, we trace out the Receiver Operating Characteristic curve, plotting TPR$(T)$ against FPR$(T)$. The Area Under this curve (AUC) has a clean probabilistic interpretation: it equals the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative one:

\text{AUC} = P(X_1 > X_0)

Equivalently, AUC is the normalized Mann-Whitney U statistic: $\text{AUC} = U / (P \cdot N)$. The Gini coefficient relates to AUC as $G = 2\,\text{AUC} - 1$ and is the standard discriminatory power metric in credit risk modeling (Basel II/III).

A random classifier scores AUC = 0.5; a perfect one scores 1.0. AUC is threshold-independent, which is both its strength (it summarizes all operating points) and its weakness (it weighs regions of the ROC curve that may be operationally irrelevant).

Precision-Recall Curve and AUPR

When the positive class is rare, the ROC curve can be overly optimistic because FPR stays low even with many false positives (the denominator is the large negative class). The precision-recall (PR) curve plots precision against recall and is more informative for imbalanced problems. A model that looks strong on ROC may reveal poor precision at useful recall levels on the PR curve.

The area under the PR curve (AUPR) summarizes this trade-off in a single number:

\text{AUPR} = \int_0^1 \text{PPV}(\text{TPR})\, d\text{TPR}

A random classifier achieves AUPR equal to the base rate (the proportion of positives), so unlike AUROC (where 0.5 is random) the baseline shifts with class balance. This makes AUPR particularly sensitive to performance on the minority class: a model must maintain high precision even as recall increases, which is hard when positives are rare. For imbalanced problems, a drop from AUROC 0.95 to AUPR 0.40 is common and reveals that the model produces many false positives at useful recall levels. AUPR should be the primary ranking metric when the positive class is the one you care about.

Calibration Curve

Discrimination (ranking) is not the same as calibration (probability accuracy). A model can rank well (high AUC) while systematically over- or under-estimating probabilities. The calibration curve bins predicted probabilities and plots them against observed event rates. A perfectly calibrated model lies on the diagonal. In regulated industries like banking, well-calibrated probabilities feed directly into expected loss calculations and capital requirements, making calibration as important as discrimination.

Brier Score

Measures the mean squared error between predicted probabilities and actual outcomes:

\text{BS} = \frac{1}{N}\sum_{t=1}^{N}(p_t - y_t)^2

The Brier score captures both discrimination and calibration in a single number. It can be decomposed into a reliability term (calibration error), a resolution term (how much predictions vary across groups), and an uncertainty term (base rate entropy). Lower is better; a model that always predicts the base rate achieves the uncertainty-only baseline.

KS Statistic

The Kolmogorov-Smirnov statistic measures the maximum vertical distance between the cumulative distribution functions of the positive and negative class scores. It identifies the threshold at which the model achieves the greatest separation between classes. KS is widely used in credit scoring as a quick summary of discriminatory power, though it only captures performance at a single operating point.

Confusion Matrix and Threshold-Dependent Metrics

Once you are satisfied with the model's ranking ability, the next step is choosing a probability cut-off and evaluating performance at that specific operating point. For a given threshold, every prediction falls into one of four cells:

	Predicted 1	Predicted 0
True 1	True Positive (TP)	False Negative (FN)
True 0	False Positive (FP)	True Negative (TN)

From these four counts we derive the rates that drive model selection:

Sensitivity (true positive rate, recall): $\text{TPR} = \frac{\text{TP}}{\text{TP} + \text{FN}}$. Of all actual positives, how many did we catch?
Specificity (true negative rate): $\text{TNR} = \frac{\text{TN}}{\text{TN} + \text{FP}}$. Of all actual negatives, how many did we correctly reject?
Precision (positive predictive value): $\text{PPV} = \frac{\text{TP}}{\text{TP} + \text{FP}}$. Of all positive predictions, how many were correct?
False Positive Rate: $\text{FPR} = 1 - \text{TNR}$. The false alarm rate, critical in applications like fraud screening where flagging too many legitimate transactions has a direct cost.

Which cell matters most depends on the business context. In credit risk, a false negative (approving a borrower who defaults) is far more expensive than a false positive (declining a good applicant). In medical screening, missing a true positive can be life-threatening. The choice of metric should reflect these asymmetric costs.

Accuracy

The fraction of correct predictions:

\text{ACC} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}}

Accuracy is intuitive but dangerously misleading on imbalanced data. If only 1% of transactions are fraudulent, a model that predicts "not fraud" for every observation achieves 99% accuracy while catching zero fraud. This is why accuracy should never be the primary metric for rare-event problems.

F1 Score

F1 captures the trade-off between precision and recall at a specific operating point. It is the harmonic mean of the two, giving equal weight to both:

F_1 = 2 \cdot \frac{\text{PPV} \cdot \text{TPR}}{\text{PPV} + \text{TPR}} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}}

Unlike accuracy, F1 ignores true negatives entirely, making it more sensitive to model performance on the minority class. The harmonic mean ensures that a model scoring poorly on either precision or recall cannot achieve a high F1. Note that F1 is a point on the PR curve: it tells you how well the model performs at your chosen threshold, while AUPR tells you how well it performs across all thresholds.

Matthews Correlation Coefficient

MCC is one of the few threshold-dependent metrics that remains informative regardless of class balance. It is essentially the Pearson correlation between the observed and predicted binary vectors, and is related to the $\chi^2$ statistic of the $2 \times 2$ confusion table:

\text{MCC} = \frac{\text{TP}\times\text{TN} - \text{FP}\times\text{FN}}{\sqrt{(\text{TP}+\text{FP})(\text{TP}+\text{FN})(\text{TN}+\text{FP})(\text{TN}+\text{FN})}}

MCC ranges from $-1$ (total disagreement) through $0$ (random) to $+1$ (perfect). Unlike F1, it accounts for all four quadrants of the confusion matrix. A model that achieves high accuracy by always predicting the majority class will score MCC $\approx$ 0, immediately flagging the problem. For rare-event modeling, MCC and the PR curve together give the most honest picture of model performance.

Rare Events and Class Imbalance

Many real-world classification problems involve rare events: loan defaults (1 to 5%), fraud (0.1 to 1%), equipment failure, rare diseases. When the event rate drops below a few percent, standard metrics and training procedures break down in predictable ways:

Accuracy becomes meaningless. A naive classifier that always predicts the majority class achieves accuracy of $(1 - \text{event rate})$, which can be 99%+ while providing zero business value.
ROC AUC can be misleading. Because the negative class dominates, even a large number of false positives barely moves the FPR. A model with AUC = 0.95 may still have unacceptable precision at the operating threshold you care about. The PR curve is more revealing.
Threshold selection is non-trivial. The default 0.5 cut-off is almost never optimal for imbalanced problems. The optimal threshold depends on the relative cost of false positives versus false negatives, which is a business decision, not a statistical one.

The Data

We generate 4000 observations from the make_moons distribution with Gaussian noise ($\sigma = 0.3$), standardize the features, and split 60/40 into training and test sets. While this synthetic dataset has balanced classes, the non-linear decision boundary and overlapping regions make it a useful testbed for comparing model flexibility. Three models are fit to the same data:

Logistic Regression: a linear decision boundary. Fast, interpretable, and the baseline in most regulated settings. It will struggle where the true boundary is curved.
Neural Network (MLP): a single hidden layer with 3 tanh neurons. Enough capacity to learn moderate non-linearity, but sensitive to initialization and prone to overfitting on small datasets.
LightGBM: 200 trees, max depth 4. Gradient-boosted trees handle non-linear boundaries naturally and are the workhorse of tabular ML. The ensemble structure also provides built-in regularization.

Toggle between models and train/test splits below to see how each one performs across all the metrics discussed above.

Validation of Binary Classification Models

Threshold-Free Metrics

ROC Curve and AUC

Precision-Recall Curve and AUPR

Calibration Curve

Brier Score

KS Statistic

Confusion Matrix and Threshold-Dependent Metrics

Accuracy

F1 Score

Matthews Correlation Coefficient

Rare Events and Class Imbalance

The Data

Decision Boundary

ROC Curve

Precision–Recall

Calibration

KS Statistic