fraud-detection-machine-learning

Credit Card Fraud Detection Machine Learning

The data was downloaded from Kaggle:https://www.kaggle.com/mlg-ulb/creditcardfraud

The dataset contains transactions made by European credit cardholders credit cards in September 2013. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numeric input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, the original features and more background information about the data are unavailable. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

Goal

The objective is to classify the data transactions as either legitimate (class 0) or fradulent (class 1) using a machine learning model. Four machine learning models will be trained and tested to determine which will yield the best results:

Decision Trees
Random Forest
K Nearest Neighbours
K Means Clustering

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#Read the file
cards = pd.read_csv('creditcard.csv')

cards.head()

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	...	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0.0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	...	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
1	0.0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	...	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	1.0	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	...	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66
3	1.0	-0.966272	-0.185226	1.792993	-0.863291	-0.010309	1.247203	0.237609	0.377436	-1.387024	...	-0.108300	0.005274	-0.190321	-1.175575	0.647376	-0.221929	0.062723	0.061458	123.50
4	2.0	-1.158233	0.877737	1.548718	0.403034	-0.407193	0.095921	0.592941	-0.270533	0.817739	...	-0.009431	0.798278	-0.137458	0.141267	-0.206010	0.502292	0.219422	0.215153	69.99

5 rows × 31 columns

cards.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

Checking for Missing Values

#Checking the entire data frame for missing values
cards.isna().any().any()

False

Exploratory Data Analysis

In this stage, we will examine the data to identify any patterns, trends and relationships between the variables. It will help us analyze the data and extract insights that can be used to make decisions.

Data Visualization will give us a clear idea of what the data means by giving it visual context.

Class

#Creating a correlation dataframe 
cc = cards.corr() 

#Identifying the correlation of the other variables to Class
cc["Class"].sort_values()

V17      -0.326481
V14      -0.302544
V12      -0.260593
V10      -0.216883
V16      -0.196539
V3       -0.192961
V7       -0.187257
V18      -0.111485
V1       -0.101347
V9       -0.097733
V5       -0.094974
V6       -0.043643
Time     -0.012323
V24      -0.007221
V13      -0.004570
V15      -0.004223
V23      -0.002685
V22       0.000805
V25       0.003308
V26       0.004455
Amount    0.005632
V28       0.009536
V27       0.017580
V8        0.019875
V20       0.020090
V19       0.034783
V21       0.040413
V2        0.091289
V4        0.133447
V11       0.154876
Class     1.000000
Name: Class, dtype: float64

V17 and V14 have a low negative correlation with Class while all the other variables have a negligible correlation with class.

Note: Typically, a correlation value between 0.3-0.5 (either positive or negative) is considered to be a low correlation and anything below that is negligible.

Amount

Fraudulent Transaction Amount

#Creating a dataframe with all the fraud transactions
fraud_cards = cards[cards['Class']==1]

fraud_cards['Amount'].describe().round(2)

count     492.00
mean      122.21
std       256.68
min         0.00
25%         1.00
50%         9.25
75%       105.89
max      2125.87
Name: Amount, dtype: float64

fig_dims = (10, 6)
fig, ax = plt.subplots(figsize=fig_dims)
sns.histplot(x=fraud_cards['Amount'], binwidth=50)
plt.title("Distribution of Fraudulent Transaction Amounts", fontsize=15)
ax.set_xlabel("Amount ($)")

Text(0.5, 0, 'Amount ($)')

png

fig_dims = (12, 4)
fig, ax = plt.subplots(figsize=fig_dims)
ax.scatter(y='Time',x='Amount', data=cards)
ax.scatter(y='Time',x='Amount', data=fraud_cards,facecolor="red")
major_xticks = np.arange(0, 28000, 2000)
minor_xticks = np.arange(0, 28000, 500)
ax.set_xticks(major_xticks)
ax.set_xticks(minor_xticks, minor = True)
ax.set_xlabel("Amount ($)")
ax.set_ylabel("Transaction Time")
ax.set_title('Transaction Time vs. Amount',fontsize=20)

Text(0.5, 1.0, 'Transaction Time vs. Amount')

png

Fraud Amount: Key Takeaways

The average fraud amount is $122.21
The standard deviation is $256.68, which is quite large which indicates that the fraud amount can vary significantly.
The majority of fraudulent transactions are between zero and $50
There frauds occur at various times but overall the majority of fraudulent transactions appear to be under $1000.

Train and Test the Model

In this stage we will separate our dataset into training data that will be used to train the model and testing data that will be used to judge the accuracy of the results. We will now train and test the four machine learning models to determine which will yield the best results:

Decision Trees
Random Forest
K Nearest Neighbours
K Means Clustering

#Calculating the number of frauds/legitimate transactions in the dataset
cards['Class'].value_counts()

0    284315
1       492
Name: Class, dtype: int64

#Calculating the percentage of frauds/legitimate transactions in the dataset
round((cards['Class'].value_counts()/len(cards))*100,2)

0    99.83
1     0.17
Name: Class, dtype: float64

As seen above, only about 0.17% of the data is associated with fraudulent transactions. In order to train and test machine learning models, we need a balanced dataset. Therefore, we will create a balanced dataset from a subset of the data and train and test our model on that data.

Creating a Balanced Dataset

We have 492 fraudulent transactions. We will choose 250 transactions at random from these transactions and 250 transactions from the at random from the legitimate transactions to create a data set with 500 transactions.

#Creating a new balanced datset with 250 fraudulent transactions
cards_bal = cards[cards['Class']==1].sample(n=250)

#Adding 250 legitimate transactions
cards_bal = cards_bal.append(cards[cards['Class']==0].sample(n=250))

#Calculating the percentage of frauds/legitimate transactions in the dataset
round((cards_bal['Class'].value_counts()/len(cards_bal))*100,2)

1    50.0
0    50.0
Name: Class, dtype: float64

We now have a balanced dataset to work with.

Train Test Split

from sklearn.model_selection import train_test_split

X = cards_bal.drop('Class',axis=1)
y = cards_bal['Class']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

Decision Trees

Training a Decision Tree Model

#Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

#Create an instance of DecisionTreeClassifier() called dtree and fit it to the training data.
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

DecisionTreeClassifier()

Decision Tree: Predictions and Evaluation

decision_trees_predictions = dtree.predict(X_test)

Random Forests

Training the Random Forest Model

#Create an instance of the RandomForestClassifier class and fit it to our training data
from sklearn.ensemble import RandomForestClassifier

#n_estimators is the number of trees in the forest
rfc = RandomForestClassifier(n_estimators=100)

rfc.fit(X_train,y_train)

RandomForestClassifier()

Random Forests: Predictions and Evaluation

random_forests_predictions = rfc.predict(X_test)

K-Nearest Neighbours

Selecting a K Value

#Import KNeighbours Classifier
from sklearn.neighbors import KNeighborsClassifier

#K represents the Number of neighbors to use for kneighbors queries, we will loop over a range of values to determine which K value is best suited for this model
#Creating an empty dataframe to store the values
error_rate_df = pd.DataFrame(columns=['K', 'Error Rate'])

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate=np.mean(pred_i != y_test)   
    df = {'K':i, 'Error Rate':error_rate}
    error_rate_df = error_rate_df.append(df, ignore_index = True)

plt.figure(figsize=(10,6))
plt.plot(error_rate_df['K'],error_rate_df['Error Rate'],color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

Text(0, 0.5, 'Error Rate')

png

k = error_rate_df[error_rate_df['Error Rate'] == min(error_rate_df['Error Rate'])]['K']

#It is possible there may be multiple k values
if len(k)>1:
    #In case there are multiple k values we resent the index and select the first k value
    k = k.reset_index()
    k = int(k["K"][0])
else:
    k = int(k)

print("The error rate appears to be the lowest when k =", k,"Let's train the model with this k value.")

The error rate appears to be the lowest when k = 12 Let's train the model with this k value.

Training KNN Model

#Create an instance of the KNN classifier and fit it to our training data
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train,y_train)

KNeighborsClassifier(n_neighbors=12)

KNN: Predictions and Evaluations

knn_predictions = knn.predict(X_test)

K Means Clustering

Training the K Means Model

#Importing K Means classifier
from sklearn.cluster import KMeans

#Create an instance of the K Means classifier and fit it to our training data
#We only have two target values 1 and 0 so we set the number of clusters to 2
kmeans = KMeans(n_clusters=2)
kmeans.fit(X_train,y_train)

KMeans(n_clusters=2)

K Means: Predictions and Evaluations

kmeans_predictions = kmeans.fit_predict(X_test)

Comparing Performance

Now that we have trained four different models we can compare their performance to determine which model should be applied to the original dataset.

#Import Classification Report and Confusion Matrix
from sklearn.metrics import classification_report,confusion_matrix

#Converting each of the classification reports into dictionaries
report_dt = classification_report(y_test,decision_trees_predictions , output_dict=True)
report_rf = classification_report(y_test,random_forests_predictions , output_dict=True)
report_knn = classification_report(y_test,knn_predictions, output_dict=True)
report_kmeans = classification_report(y_test,kmeans_predictions, output_dict=True)

model = {'Model Name':['Decision Trees', 'Random Forests','KNN','K-Means'],
        'Accuracy':[report_dt['accuracy'], report_rf['accuracy'], report_knn['accuracy'], report_kmeans['accuracy']],
        'Precision':[report_dt['weighted avg']['precision'],report_rf['weighted avg']['precision'], report_knn['weighted avg']['precision'], report_kmeans['weighted avg']['precision']],
        'Recall':[report_dt['weighted avg']['recall'],report_rf['weighted avg']['recall'],report_knn['weighted avg']['recall'],report_kmeans['weighted avg']['recall']],
        'F1-Score':[report_dt['weighted avg']['f1-score'],report_rf['weighted avg']['f1-score'],report_knn['weighted avg']['f1-score'],report_kmeans['weighted avg']['f1-score']]}

model = pd.DataFrame(model)

model.round(2).sort_values(by=['Accuracy','Precision','Recall','F1-Score'],ascending=False)

	Model Name	Accuracy	Precision	Recall	F1-Score
1	Random Forests	0.89	0.89	0.89	0.89
0	Decision Trees	0.87	0.87	0.87	0.87
2	KNN	0.62	0.62	0.62	0.62
3	K-Means	0.42	0.41	0.42	0.41

As seen above the Random Forests Model yields the best results so we will test the model against the original data from the original dataset to compare its performance.

X = cards.drop('Class',axis=1)
y = cards['Class']

#Random Forests predictions
random_forests_predictions = rfc.predict(X)

#Assigning the classification report to a dictionary to call later
cards_cr = classification_report(y,random_forests_predictions, output_dict=True)

print(classification_report(y,random_forests_predictions))

              precision    recall  f1-score   support

           0       1.00      0.96      0.98    284315
           1       0.04      0.93      0.07       492

    accuracy                           0.96    284807
   macro avg       0.52      0.95      0.53    284807
weighted avg       1.00      0.96      0.98    284807

cards_accuracy = round(cards_cr['accuracy'],2)

print('The Random Forests Model has an accuracy of',cards_accuracy)

The Random Forests Model has an accuracy of 0.96

Overall, the Random Forests Model does the best job of predicting fraudulent credit card transactions.