Comparing Logistic Regression, Neural Networks, and LightGBM on synthetic data
Binary classification is one of the most common tasks in applied machine learning: fraud detection, credit default prediction, medical diagnosis, churn modeling. In each case the model outputs a probability that an observation belongs to the positive class, and the business decides on a threshold to act on. Choosing the right validation metrics is critical, especially when the classes are not balanced.
Before committing to a decision threshold, it is worth evaluating how well the model ranks and scores observations across all possible cut-offs. Threshold-free metrics answer the question: does the model assign higher probabilities to actual positives than to actual negatives? They should be the first thing you look at, because if the model cannot rank well, no choice of threshold will save it.
Given a threshold $T$, an observation with score $X > T$ is classified as positive. As $T$ varies, we trace out the Receiver Operating Characteristic curve, plotting TPR$(T)$ against FPR$(T)$. The Area Under this curve (AUC) has a clean probabilistic interpretation: it equals the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative one:
Equivalently, AUC is the normalized Mann-Whitney U statistic: $\text{AUC} = U / (P \cdot N)$. The Gini coefficient relates to AUC as $G = 2\,\text{AUC} - 1$ and is the standard discriminatory power metric in credit risk modeling (Basel II/III).
A random classifier scores AUC = 0.5; a perfect one scores 1.0. AUC is threshold-independent, which is both its strength (it summarizes all operating points) and its weakness (it weighs regions of the ROC curve that may be operationally irrelevant).
When the positive class is rare, the ROC curve can be overly optimistic because FPR stays low even with many false positives (the denominator is the large negative class). The precision-recall (PR) curve plots precision against recall and is more informative for imbalanced problems. A model that looks strong on ROC may reveal poor precision at useful recall levels on the PR curve.
The area under the PR curve (AUPR) summarizes this trade-off in a single number:
A random classifier achieves AUPR equal to the base rate (the proportion of positives), so unlike AUROC (where 0.5 is random) the baseline shifts with class balance. This makes AUPR particularly sensitive to performance on the minority class: a model must maintain high precision even as recall increases, which is hard when positives are rare. For imbalanced problems, a drop from AUROC 0.95 to AUPR 0.40 is common and reveals that the model produces many false positives at useful recall levels. AUPR should be the primary ranking metric when the positive class is the one you care about.
Discrimination (ranking) is not the same as calibration (probability accuracy). A model can rank well (high AUC) while systematically over- or under-estimating probabilities. The calibration curve bins predicted probabilities and plots them against observed event rates. A perfectly calibrated model lies on the diagonal. In regulated industries like banking, well-calibrated probabilities feed directly into expected loss calculations and capital requirements, making calibration as important as discrimination.
Measures the mean squared error between predicted probabilities and actual outcomes:
The Brier score captures both discrimination and calibration in a single number. It can be decomposed into a reliability term (calibration error), a resolution term (how much predictions vary across groups), and an uncertainty term (base rate entropy). Lower is better; a model that always predicts the base rate achieves the uncertainty-only baseline.
The Kolmogorov-Smirnov statistic measures the maximum vertical distance between the cumulative distribution functions of the positive and negative class scores. It identifies the threshold at which the model achieves the greatest separation between classes. KS is widely used in credit scoring as a quick summary of discriminatory power, though it only captures performance at a single operating point.
Once you are satisfied with the model's ranking ability, the next step is choosing a probability cut-off and evaluating performance at that specific operating point. For a given threshold, every prediction falls into one of four cells:
| Predicted 1 | Predicted 0 | |
|---|---|---|
| True 1 | True Positive (TP) | False Negative (FN) |
| True 0 | False Positive (FP) | True Negative (TN) |
From these four counts we derive the rates that drive model selection:
Which cell matters most depends on the business context. In credit risk, a false negative (approving a borrower who defaults) is far more expensive than a false positive (declining a good applicant). In medical screening, missing a true positive can be life-threatening. The choice of metric should reflect these asymmetric costs.
The fraction of correct predictions:
Accuracy is intuitive but dangerously misleading on imbalanced data. If only 1% of transactions are fraudulent, a model that predicts "not fraud" for every observation achieves 99% accuracy while catching zero fraud. This is why accuracy should never be the primary metric for rare-event problems.
F1 captures the trade-off between precision and recall at a specific operating point. It is the harmonic mean of the two, giving equal weight to both:
Unlike accuracy, F1 ignores true negatives entirely, making it more sensitive to model performance on the minority class. The harmonic mean ensures that a model scoring poorly on either precision or recall cannot achieve a high F1. Note that F1 is a point on the PR curve: it tells you how well the model performs at your chosen threshold, while AUPR tells you how well it performs across all thresholds.
MCC is one of the few threshold-dependent metrics that remains informative regardless of class balance. It is essentially the Pearson correlation between the observed and predicted binary vectors, and is related to the $\chi^2$ statistic of the $2 \times 2$ confusion table:
MCC ranges from $-1$ (total disagreement) through $0$ (random) to $+1$ (perfect). Unlike F1, it accounts for all four quadrants of the confusion matrix. A model that achieves high accuracy by always predicting the majority class will score MCC $\approx$ 0, immediately flagging the problem. For rare-event modeling, MCC and the PR curve together give the most honest picture of model performance.
Many real-world classification problems involve rare events: loan defaults (1 to 5%), fraud (0.1 to 1%), equipment failure, rare diseases. When the event rate drops below a few percent, standard metrics and training procedures break down in predictable ways:
We generate 4000 observations from the make_moons distribution with Gaussian noise ($\sigma = 0.3$), standardize the features, and split 60/40 into training and test sets. While this synthetic dataset has balanced classes, the non-linear decision boundary and overlapping regions make it a useful testbed for comparing model flexibility. Three models are fit to the same data:
Toggle between models and train/test splits below to see how each one performs across all the metrics discussed above.
| Metric | Train | Test |
|---|