Machine learning analysis of 284,807 credit card transactions to detect the 0.17% that are fraudulent. Comparing supervised and unsupervised approaches on highly imbalanced data.
The central challenge of this dataset is extreme class imbalance — for every fraudulent transaction, there are roughly 578 legitimate ones.
Only 492 out of 284,807 transactions are fraudulent (0.173%). This means a model that simply labels every transaction as legitimate would be "99.8% accurate" while catching zero fraud. Accuracy is therefore a misleading metric for this problem. Instead, we evaluate models using precision (of the transactions flagged as fraud, how many actually are?), recall (of all actual fraud cases, how many did we catch?), and AUPRC (Area Under the Precision-Recall Curve) — a single number summarizing the precision-recall tradeoff where higher is better.
Before building any models, we explore the data to understand how fraudulent transactions differ from legitimate ones in amount, timing, and feature patterns.
We trained three supervised models on SMOTE-balanced data, each with different strengths, plus an unsupervised Isolation Forest for anomaly detection.
Logistic Regression finds a linear decision boundary between classes — it serves as a fast, interpretable baseline. Random Forest combines many decision trees through majority voting, handling non-linear patterns while resisting overfitting. XGBoost builds trees sequentially, each correcting the previous one's mistakes — it is considered state-of-the-art for tabular data. Isolation Forest takes a different approach entirely: instead of learning from labeled fraud examples, it learns what "normal" looks like and flags anything unusual as anomalous.
| Model | ROC-AUC | AUPRC | Fraud Caught | False Alarms | Missed Fraud | Precision | Recall |
|---|---|---|---|---|---|---|---|
| Random Forest | 0.9779 | 0.8553 | 79 / 98 (80.6%) | 14 | 19 | 84.95% | 80.61% |
| XGBoost | 0.9761 | 0.8477 | 86 / 98 (87.8%) | 142 | 12 | 37.72% | 87.76% |
| Logistic Regression | 0.9706 | 0.7281 | 90 / 98 (91.8%) | 1,534 | 8 | 5.54% | 91.84% |
| Isolation Forest (unsupervised) | N/A | 34 / 98 (34.7%) | 70 | 64 | 32.69% | 34.69% | |
Rather than evaluating models at a single threshold, these curves show how each model performs across all possible classification thresholds.
Even though most features are anonymized via PCA, we can still identify which ones the models rely on most for detecting fraud.
Since the original features (merchant category, geographic distance, purchase frequency, etc.) have been transformed into anonymous principal components, we cannot directly interpret what V14 or V17 "mean." However, feature importance scores reveal which transformed features carry the strongest fraud signal. If both models independently agree on the same features, that gives us confidence that these capture genuine fraud patterns rather than noise.