Credit Card Fraud Detection

Machine learning analysis of 284,807 credit card transactions to detect the 0.17% that are fraudulent. Comparing supervised and unsupervised approaches on highly imbalanced data.

Dataset: Kaggle — ULB Machine Learning Group · European cardholders, Sept 2013
284,807
Total Transactions
492
Fraudulent (0.173%)
Random Forest
Best Model (by AUPRC)
0.8553
Best AUPRC Score

The Imbalance Problem

The central challenge of this dataset is extreme class imbalance — for every fraudulent transaction, there are roughly 578 legitimate ones.

Only 492 out of 284,807 transactions are fraudulent (0.173%). This means a model that simply labels every transaction as legitimate would be "99.8% accurate" while catching zero fraud. Accuracy is therefore a misleading metric for this problem. Instead, we evaluate models using precision (of the transactions flagged as fraud, how many actually are?), recall (of all actual fraud cases, how many did we catch?), and AUPRC (Area Under the Precision-Recall Curve) — a single number summarizing the precision-recall tradeoff where higher is better.

Class Distribution

The doughnut chart shows the massive imbalance: 284,315 legitimate transactions vs. only 492 fraudulent ones. The fraud slice is so small it is barely visible at this scale.

Addressing Imbalance with SMOTE

SMOTE (Synthetic Minority Oversampling Technique) addresses this by generating new synthetic fraud examples. It picks a real fraud transaction, finds its nearest neighbors among other fraud cases, and creates a new example between them. This gives the model enough fraud examples to learn meaningful patterns. SMOTE is applied only to the training data — the test set stays untouched so it reflects the real-world class distribution.

Fraud Patterns

Before building any models, we explore the data to understand how fraudulent transactions differ from legitimate ones in amount, timing, and feature patterns.

Transaction Amount Distribution

Fraudulent transactions tend to be smaller. The median fraud amount is $9.25, much lower than the $22.00 median for legitimate transactions. The majority of fraud occurs under $100, while the largest fraud is around $2,125 compared to $25,691 for legitimate transactions. This makes intuitive sense: fraudsters often test with small amounts first to see if a stolen card works, and smaller transactions are less likely to trigger manual review.

Fraud Rate by Hour of Day

The fraud rate spikes during off-peak hours, peaking sharply at 2 AM (1.7%) with elevated rates through 3–5 AM (highlighted in red). Legitimate transactions drop off during nighttime hours, reflecting normal consumer behavior, but the proportion of transactions that turn out to be fraudulent is much higher during these quiet periods. This suggests fraudsters may prefer hours when monitoring is reduced and response times are slower.

Feature Correlation with Fraud

Since most features are anonymized via PCA, we can't interpret them directly — but we can identify which ones are most statistically related to fraud. V17 and V14 have the strongest negative correlations: when these values decrease, fraud becomes more likely. V11 and V4 have positive correlations: higher values suggest fraud. Most individual correlations are relatively weak (below |0.3|), meaning no single feature reliably indicates fraud on its own. This is exactly why we need machine learning — models can combine multiple weak signals into a strong overall prediction, like diagnosing an illness from a combination of symptoms rather than any single one.

Model Comparison

We trained three supervised models on SMOTE-balanced data, each with different strengths, plus an unsupervised Isolation Forest for anomaly detection.

Logistic Regression finds a linear decision boundary between classes — it serves as a fast, interpretable baseline. Random Forest combines many decision trees through majority voting, handling non-linear patterns while resisting overfitting. XGBoost builds trees sequentially, each correcting the previous one's mistakes — it is considered state-of-the-art for tabular data. Isolation Forest takes a different approach entirely: instead of learning from labeled fraud examples, it learns what "normal" looks like and flags anything unusual as anomalous.

AUPRC & ROC-AUC Scores

All three models achieve strong ROC-AUC scores (above 0.97), but their AUPRC scores differ more meaningfully. AUPRC is the more honest metric for imbalanced data because it focuses on how well the model identifies the rare fraud class. Random Forest leads with 0.8553.

Fraud Detection Results (98 fraud in test set)

This chart reveals the fundamental tradeoff in fraud detection: catching more fraud comes at the cost of more false alarms. Note the log scale — Logistic Regression's 1,534 false alarms dwarfs Random Forest's 14. In practice, investigating a false alarm costs time and resources, but missing a real fraud case means direct financial loss.

Detailed Model Performance

ModelROC-AUCAUPRCFraud CaughtFalse AlarmsMissed FraudPrecisionRecall
Random Forest0.97790.855379 / 98 (80.6%)141984.95%80.61%
XGBoost0.97610.847786 / 98 (87.8%)1421237.72%87.76%
Logistic Regression0.97060.728190 / 98 (91.8%)1,53485.54%91.84%
Isolation Forest (unsupervised)N/A34 / 98 (34.7%)706432.69%34.69%
Random Forest achieves the best precision-recall balance: 84.95% precision means that when it flags a transaction as fraud, it is correct 85 times out of 100, with only 14 false alarms total. Logistic Regression catches the most fraud (91.8%) but with extremely low precision (5.54%), meaning it generates over 100 false alarms for every real fraud it catches. The Isolation Forest, working without any fraud labels, still manages to catch about a third of fraud cases — demonstrating that anomaly detection can find fraudulent patterns purely from what "normal" looks like.

Performance Curves

Rather than evaluating models at a single threshold, these curves show how each model performs across all possible classification thresholds.

ROC Curves

The ROC curve plots the True Positive Rate (fraud caught) against the False Positive Rate (false alarms) at every possible threshold. A perfect model hugs the top-left corner; the dashed diagonal represents random guessing (AUC = 0.5). All three models perform well here, but ROC-AUC can be overly optimistic when classes are highly imbalanced — a model can appear strong on this metric while still producing many false positives in absolute terms.

Precision-Recall Curves

The Precision-Recall curve provides a more honest picture for imbalanced data. It focuses specifically on the fraud class: Recall (x-axis) shows how many fraud cases the model found, while Precision (y-axis) shows how accurate its fraud flags are. The dashed baseline represents random flagging (precision equal to the 0.17% fraud rate). Random Forest maintains high precision even at higher recall levels, which is why it achieves the best AUPRC.

Threshold Optimization (Random Forest)

By default, models use a 0.5 probability threshold: if the predicted fraud probability exceeds 50%, the transaction is flagged. But this default is rarely optimal. As the threshold moves left (lower), recall increases (more fraud caught) but precision drops (more false alarms). As it moves right (higher), precision improves but recall falls. The F1 score (yellow line) balances both metrics and peaks at a threshold of 0.72, achieving 95.0% precision and 77.6% recall. In practice, the right threshold depends on the relative cost of false positives vs. false negatives — a decision best made with business stakeholders.

Feature Importance

Even though most features are anonymized via PCA, we can still identify which ones the models rely on most for detecting fraud.

Since the original features (merchant category, geographic distance, purchase frequency, etc.) have been transformed into anonymous principal components, we cannot directly interpret what V14 or V17 "mean." However, feature importance scores reveal which transformed features carry the strongest fraud signal. If both models independently agree on the same features, that gives us confidence that these capture genuine fraud patterns rather than noise.

Random Forest

XGBoost

V14 dominates in both models — in XGBoost it accounts for over 55% of the total importance. V10, V4, V17, and V12 also rank highly across both. These are the same features that showed the strongest correlations with fraud in the exploratory analysis above, confirming they capture genuine patterns. The Amount and Hour engineered features also appear in XGBoost's rankings, confirming that our feature engineering added useful information despite their weak individual correlations. The fact that a small number of features dominate suggests that fraud has a distinct and recognizable statistical signature in this transaction data.

Key Takeaways