Practical case ｜ Python credit card fraud detection based on machine learning!

Welcome to focus on Python, data analysis, data mining and fun tools!

When we buy products online, many people like to use credit cards. But credit card fraud often happens around us. Network security is becoming a vital part of our life.

In order to solve this problem, we need to use machine learning algorithm to build an abnormal behavior recognition system. If it is found suspicious, stop the operation.

In this article, I will share an end-to-end model training method, from data acquisition to final model screening. My favorite partners are welcome to pay attention, like and support.

About data

The data used in this article is kaggle data: https://www.kaggle.com/mlg-ulb/creditcardfraud , this data set is the real bank transactions of European cardholders in 2013. For security reasons, the data has been converted to PCA version, with 29 feature columns and 1 class column.

Import the necessary libraries

Here I will import all the necessary libraries. Since the credit card data feature is the converted version of PCA, we do not need to perform feature selection again. Otherwise, it is recommended to use RFE, RFECV, SelectKBest, and VIF score to find features that fit the model.

#Packages related to general operating system & warnings
import os 
import warnings
warnings.filterwarnings('ignore')
#Packages related to data importing, manipulation, exploratory data #analysis, data understanding
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from termcolor import colored as cl # text customization
#Packages related to data visualizaiton
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
#Setting plot sizes and type of plot
plt.rc("font", size=14)
plt.rcParams['axes.grid'] = True
plt.figure(figsize=(6,3))
plt.gray()
from matplotlib.backends.backend_pdf import PdfPages
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import metrics
from sklearn.impute import MissingIndicator, SimpleImputer
from sklearn.preprocessing import  PolynomialFeatures, KBinsDiscretizer, FunctionTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, LabelBinarizer, OrdinalEncoder
import statsmodels.formula.api as smf
import statsmodels.tsa as tsa
from sklearn.linear_model import LogisticRegression, LinearRegression, ElasticNet, Lasso, Ridge
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz, export
from sklearn.ensemble import BaggingClassifier, BaggingRegressor,RandomForestClassifier,RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier,GradientBoostingRegressor, AdaBoostClassifier, AdaBoostRegressor 
from sklearn.svm import LinearSVC, LinearSVR, SVC, SVR
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

Import dataset

Importing datasets is very simple. You only need to use the pandas module in python to import it, run the following command, and the data set can be downloaded at the end of the text.

data=pd.read_csv("creditcard.csv")

Data processing and understanding

With regard to these data, you may notice that the data set is unbalanced, because the vast majority of normal transactions in the data set, and only a small percentage of transactions are fraudulent.

Let's examine the data distribution.

Total_transactions = len(data)
normal = len(data[data.Class == 0])
fraudulent = len(data[data.Class == 1])
fraud_percentage = round(fraudulent/normal*100, 2)
print(cl('Total number of Trnsactions are {}'.format(Total_transactions), attrs = ['bold']))
print(cl('Number of Normal Transactions are {}'.format(normal), attrs = ['bold']))
print(cl('Number of fraudulent Transactions are {}'.format(fraudulent), attrs = ['bold']))
print(cl('Percentage of fraud Transactions is {}'.format(fraud_percentage), attrs = ['bold']))

We can also use the following code to check for null values.

data.info()

According to the count of each column, we have no null value. In addition, you can try to apply the feature selection method to check whether the results are optimized.

I observed that 28 features in the data are the converted version of PCA, but the field "Amount" is original. When checking the minimum and maximum values, I found that there was a great difference and might deviate from our results.

In this case, I sort it out as follows.

sc = StandardScaler()
amount = data['Amount'].values
data['Amount'] = sc.fit_transform(amount.reshape(-1, 1))

We also have a variable, time, which may be an external determinant. We abandon it in our modeling process.

We can also check for any duplicate data. There are 284807 rows in the dataset before any duplicate data is deleted.

duplicate removal

data.drop_duplicates(inplace=True)

Therefore, we have about 9000 duplicate transactions.

Separation of training and testing

Before splitting training and testing, we need to define dependent and independent variables. The dependent variable is also called X and the independent variable is called y.

X = data.drop('Class', axis = 1).values
y = data['Class'].values

Now let's split the training and test data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 1)

So, we now have two different data sets.

Build model

We will try different machine learning models. Defining a model is much easier. One line of code can define our model. Similarly, one line of code can fit the model on our data. We can also adjust these models by selecting different optimization parameters.

1) Decision tree

DT = DecisionTreeClassifier(max_depth = 4, criterion = 'entropy')
DT.fit(X_train, y_train)
dt_yhat = DT.predict(X_test)

Let's look at the accuracy of the decision tree model.

print('Accuracy score of the Decision Tree model is {}'.format(accuracy_score(y_test, tree_yhat)))
Accuracy score of the Decision Tree model is 0.999288989494457

View the F1 score of the decision tree model.

print('F1 score of the Decision Tree model is {}'.format(f1_score(y_test, tree_yhat)))
F1 score of the Decision Tree model is 0.776255707762557

To view the confusion matrix:

confusion_matrix(y_test, tree_yhat, labels = [0, 1])

2) Random forest

rf = RandomForestClassifier(max_depth = 4)
rf.fit(X_train, y_train)
rf_yhat = rf.predict(X_test)

Let's look at the accuracy of the random forest model.

print('Accuracy score of the Random Forest model is {}'.format(accuracy_score(y_test, rf_yhat)))
Accuracy score of the Random Forest model is 0.9993615415868594

View the F1 score of the random forest model.

print('F1 score of the Random Forest model is {}'.format(f1_score(y_test, rf_yhat)))
F1 score of the Random Forest model is 0.7843137254901961

3)XGBoost

xgb = XGBClassifier(max_depth = 4)
xgb.fit(X_train, y_train)
xgb_yhat = xgb.predict(X_test)

Let's look at the accuracy of the XGBoost model.

print('Accuracy score of the XGBoost model is {}'.format(accuracy_score(y_test, xgb_yhat)))
Accuracy score of the XGBoost model is 0.9995211561901445

View the F1 score of the XGBoost model.

print('F1 score of the XGBoost model is {}'.format(f1_score(y_test, xgb_yhat)))
F1 score of the XGBoost model is 0.8421052631578947

conclusion

We just got 99.95% credit card fraud detection accuracy. This figure is not surprising because our data is for one category.

According to our F1 score, XGBoost is the winner of our case. The only thing to note here is the data we use for model training. The data feature is a transformed version of PCA.

Technical exchange

Welcome to reprint, collect, gain, praise and support!

At present, a technical exchange group has been opened, with more than 2000 friends. The addition methods are as follows:

The following methods can be used. The best way to add is: source + Interest direction, which is convenient to find like-minded friends

Method 1: send the following pictures to wechat for long press recognition and reply to group addition;
Mode 2: directly add a small assistant micro signal: Python 666. Remarks: from CSDN
Mode three, WeChat search official account: Python learning and data mining, background reply: add group

Keywords: Python Data Mining

Added by lmhart on Thu, 23 Dec 2021 01:30:37 +0200

Programming VIP