kaggle Project: Heart Disease Prediction Classification Based on Random Forest Models

Kaggle case: predictive classification of cardiac patients based on a random forest

Hello, my name is Peter~

A kaggle case shared today: predictive classification of cardiac patients based on the RandomForest model. The knowledge points involved in this paper mainly include:

Data preprocessing and type transformation
Establishment and Interpretation of Random Forest Model
Drawing and Interpretation of Partially Dependent Diagram PDP
Use and Interpretation of AutoML Machine Learning SHAP Library (Personal Promotion)

Reading Guide

Of all the applications of machine-learning, diagnosing any serious disease using a black box is always going to be a hard sell. If the output from a model is the particular course of treatment (potentially with side-effects), or surgery, or the absence of treatment, people are going to want to know why.

In all applications of machine learning, it is always difficult to diagnose any serious disease using black boxes. If the output of the model is a specific course of treatment (which may have side effects), surgery, or efficacy, people will wonder why.

This dataset gives a number of variables along with a target condition of having or not having heart disease. Below, the data is first used in a simple random forest model, and then the model is investigated using ML explainability tools and techniques.

The dataset provides many variables and target conditions for people with or without heart disease. Next, the data is used for a simple random forest model, which is then studied using ML interpretability tools and techniques.

Original notebook address: https://www.kaggle.com/tentotheminus9/what-causes-heart-disease-explaining-the-model And you can refer to the original literature if you are interested in it.

Import Library

This case involves several libraries in different directions:

Data Preprocessing
A variety of visual drawings; Especially the visualization of shap, the use of model explanatory modules (this library will be written later)
Random Forest Model
Model evaluation, etc.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz 
from sklearn.metrics import roc_curve, auc 
from sklearn.metrics import classification_report 
from sklearn.metrics import confusion_matrix 
from sklearn.model_selection import train_test_split 
import eli5 
from eli5.sklearn import PermutationImportance
import shap 
from pdpbox import pdp, info_plots 
np.random.seed(123) 

pd.options.mode.chained_assignment = None

Data Exploration EDA

1. Importing data

2. Missing Values

The data is perfect, there are no missing values!

Field Meaning

Here we will focus on the meaning of each field. Peter's recently exported datasets have slightly different field names from those in the original notebook. We have made a one-to-one correspondence for you. Here are the specific Chinese meanings:

Age: age
Sex sex 1=male 0=female
Type of cp chest pain; 4 Value scenarios
- 1: Typical angina pectoris
- 2: Atypical angina
- 3: Non-angina pectoris
- 4: Asymptomatic
trestbps resting blood pressure
chol Serum Cholesterol
fbs fasting blood glucose >120mg/dl:1=true; 0=false
restecg resting electrocardiogram (values 0,1,2)
Maximum heart rate achieved by thalach
Exercise-induced angina (1=yes;0=no)
ST value caused by oldpeak relative to resting exercise (ST value is related to location on electrocardiogram)
slope of ST Segment of slope Motion Peak
- 1:upsloping tilt up
- 2:flat flat
- 3:downsloping down
ca The number of major vessels(0-3)
thal A blood disorder called thalassemia, a blood disease called thalassemia (3 = normal; 6 = fixed defect; 7 = reversible defect)
target was not ill (0=no; 1=yes)

The original English meaning of notebook;

Below is the correspondence that Peter organized. The current version is the standard in this article:

Field Conversion

Transform Encoding

Convert some fields one by one. Take the sex field for example: 0 becomes female and 1 becomes male in the data

# 1,sex

df["sex"][df["sex"] == 0] = "female"
df["sex"][df["sex"] == 1] = "male"

Field Type Conversion

# Specify data type
df["sex"] = df["sex"].astype("object")
df["cp"] = df["cp"].astype("object")
df["fbs"] = df["fbs"].astype("object")
df["restecg"] = df["restecg"].astype("object")
df["exang"] = df["exang"].astype("object")
df["slope"] = df["slope"].astype("object")
df["thal"] = df["thal"].astype("object")

Generate dummy variable

# Generate dummy variable
df = pd.get_dummies(df,drop_first=True)
df

Random Forest

Fractional data

# Generating feature and dependent variable datasets
X = df.drop("target",1)
y = df["target"]

# The slicing ratio is 8:2
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10)

X_train

modeling

rf = RandomForestClassifier(max_depth=5)
rf.fit(X_train, y_train)

Three key attributes

Three important attributes of a random forest:

Viewing trees in the forest: estimators_
Out-of-pocket accuracy score: oob_score_, Must be oob_ The scope parameter is only available when True is selected
Importance of variables: feature_importances_

Decision Tree Visualization

Here we select the visualization of the second tree:

# View the second tree
estimator = rf.estimators_[1]

# All attributes
feature_names = [i for i in X_train.columns]
#print(feature_names)

# Specify data type
y_train_str = y_train.astype('str')
# 0-no 1-disease
y_train_str[y_train_str == '0'] = 'no disease'
y_train_str[y_train_str == '1'] = 'disease'
# Value of training data
y_train_str = y_train_str.values
y_train_str[:5]

The specific code for the drawing is:

# Drawing Display

export_graphviz(
    estimator,   # Pass in the second tree
    out_file='tree.dot',   # Export file name
    feature_names = feature_names,  # Property Name
    class_names = y_train_str,  # Final categorical data
    rounded = True, 
    proportion = True, 
    label='root',
    precision = 2, 
    filled = True
)

from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

from IPython.display import Image
Image(filename = 'tree.png')

The visualization of decision trees allows us to see the specific classification process, but it does not solve which features or attributes are important. Later, we will explore the characteristic importance of some attributes.

Model Score Validation

Regarding the confusion matrix and using specificity and sensitivity to describe the performance of the classifier:

# model prediction
y_predict = rf.predict(X_test)
y_pred_quant = rf.predict_proba(X_test)[:,1]
y_pred_bin = rf.predict(X_test)

# Confusion Matrix
confusion_matrix = confusion_matrix(y_test,y_pred_bin)
confusion_matrix

# Calculate sensitivity and specificity 
total=sum(sum(confusion_matrix))
sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])
specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])

Draw ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred_quant)

fig, ax = plt.subplots()
ax.plot(fpr, tpr)

ax.plot([0,1],[0,1],
        transform = ax.transAxes,
        ls = "--",
        c = ".3"
       )

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])

plt.rcParams['font.size'] = 12

# Title
plt.title('ROC Curve')
# Names of the two axes
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
# Gridlines
plt.grid(True)

ROC curve values in this case:

auc(fpr, tpr)
# Result
0.9076923076923078

According to the general ROC curve evaluation criteria, the performance of the case is good:

0.90 - 1.00 = excellent
0.80 - 0.90 = good
0.70 - 0.80 = fair
0.60 - 0.70 = poor
0.50 - 0.60 = fail

Supplementary Knowledge Points: Evaluation Index of Classifier

Consider a two-class situation where the categories are 1 and 0, and we treat 1 and 0 as positive and negative classes, respectively. Based on the actual and predicted results, there are four final results, as shown in the table below:

Common evaluation criteria:

1. ACC:classification accuracy, describing the classification accuracy of the classifier

The formula is ACC=(TP+TN)/(TP+FP+FN+TN)

2,BER: balanced error rate
The formula is: BER=1/2*(FPR+FN/(FN+TP))

3. TPR:true positive rate, describing the proportion of all positive cases identified to all positive cases
The formula is TPR=TP/ (TP+FN)

4. FPR:false positive rate, which describes the proportion of all negative cases in which negative cases are identified as being correct
The formula is FPR= FP / (FP + TN)

5. TNR:true negative rate, describing the proportion of all negative cases identified
The formula is: TNR= TN / (FP + TN)

6,PPV: Positive predictive value
The formula is PPV=TP/ (TP + FP)

7,NPV: Negative predictive value
Calculation formula: NPV=TN / (FN + TN)

TPR is sensitivity and TNR is specificity.

Classic graphics from Wikipedia:

Interpretability

Permutation Importance

The following is an explanation of the results of the machine learning model. The first thing to look at is the importance of each variable to the model. Permutation Importance for Key Considerations:

Partial dependence plots (PDP)

One-dimensional PDP

Partial Dependencies are used to explain the relationship between a feature and a target value y, usually by drawing a Partial Dependence Plot(PDP). That is, the value of PDP in X1 is the average value predicted by the original model after replacing the first variable in the training set with X1.

Important: View the relationship between a single feature and a target value

Field ca

base_features = df.columns.values.tolist()
base_features.remove("target")

feat_name = 'ca'  # ca-num_major_vessels original
pdp_dist = pdp.pdp_isolate(
    model=rf,  # Model
    dataset=X_test,  # Test Set
    model_features=base_features,  # The characteristic variable; Remove Target Value 
    feature=feat_name  # Specify a single field
)

pdp.pdp_plot(pdp_dist, feat_name)  # Two parameters passed in
plt.show()

As the ca field increases, the risk of disease decreases. The ca field means the number of blood vessels (num_major_vessels), which means that as the number of blood vessels increases, the prevalence decreases.

Field age

feat_name = 'age'

pdp_dist = pdp.pdp_isolate(
    model=rf, 
    dataset=X_test, 
    model_features=base_features, 
    feature=feat_name)

pdp.pdp_plot(pdp_dist, feat_name)
plt.show()

For the age field, the original description:

That's a bit odd. The higher the age, the lower the chance of heart disease? Althought the blue confidence regions show that this might not be true (the red baseline is within the blue zone).

Translator: This is a bit strange. The older you get, the less likely you are to have heart disease? Although the blue confidence interval indicates that this may not be true (the red baseline is in the blue area)

Field oldpeak

feat_name = 'oldpeak'

pdp_dist = pdp.pdp_isolate(
    model=rf, 
    dataset=X_test, 
    model_features=base_features, 
    feature=feat_name)

pdp.pdp_plot(pdp_dist, feat_name)
plt.show()

The oldpeak field also showed that the higher the value, the lower the risk.

This variable is called the "ST depression induced by relative rest exercise". Under normal conditions, the higher the value, the higher the risk. However, the above image shows the opposite result.

The authors conclude that the depression amount may also be related to the slope type. The original abstract is as follows, so the author draws a 2D-PDP graph

Perhaps it's not just the depression amount that's important, but the interaction with the slope type? Let's check with a 2D PDP

2D-PDP Diagram

Looking at slope_upsloping, slope_ The relationship between flat and oldpeak:

inter1  =  pdp.pdp_interact(
    model=rf,  # Model
    dataset=X_test,  # Feature dataset
    model_features=base_features,  # Features
    features=['slope_upsloping', 'oldpeak'])

pdp.pdp_interact_plot(
    pdp_interact_out=inter1, 
    feature_names=['slope_upsloping', 'oldpeak'], 
    plot_type='contour')
plt.show()

## ------------

inter1  =  pdp.pdp_interact(
    model=rf, 
    dataset=X_test,
    model_features=base_features, 
    features=['slope_flat', 'oldpeak']
)

pdp.pdp_interact_plot(
    pdp_interact_out=inter1, 
    feature_names=['slope_flat', 'oldpeak'], 
    plot_type='contour')
plt.show()

From the two graphs, we can see that when the oldpeak value is low, the prevalence is higher (yellow), which is a strange phenomenon. The author then made the following visualization explorations of SHAP: analysis of individual variables.

SHAP Visualization

For an introduction to SHAP, refer to the article: https://zhuanlan.zhihu.com/p/83412330 and https://blog.csdn.net/sinat_26917383/article/details/115400327

SHAP is a model interpretation package developed by Python that can interpret the output of any machine learning model. Some of the functions used by SHAP are as follows:

Explainer

An explainer needs to be created prior to model interpretation in SHAP, which supports many types of explainers, such as deep, gradient, kernel, linear, tree, sampling. In this case, let's take Tree as an example:

# Incoming Random Forest Model rf
explainer = shap.TreeExplainer(rf)  
# Data passed in eigenvalues in explainer to calculate shap values
shap_values = explainer.shap_values(X_test)  
shap_values

Feature Importance

Take the mean of the absolute value of each feature's SHAP value as the importance of that feature and get a standard bar graph (multi-class produces stacked bar graphs:

CONCLUSION: The highest SHAP value in the ca field can be visually observed.

summary_plot

The summary plot plots the SHAP values of each feature for each sample, which provides a better understanding of the overall pattern and allows for the discovery of predicted outliers.

Each row represents a feature, and the horizontal coordinate is the SHAP value
A point represents a sample, and a color represents the value of the eigenvalue (high red, low blue)

individual difference

To examine the impact of different characteristic attributes of a single patient on its outcome, the original text describes:

Next, let's pick out individual patients and see how the different variables are affecting their outcomes

def heart_disease_risk_factors(model, patient):

    explainer = shap.TreeExplainer(model)  # Set up explainer
    shap_values = explainer.shap_values(patient)  # Calculate shape values
    shap.initjs()
    return shap.force_plot(
      explainer.expected_value[1],
      shap_values[1], 
      patient)

Results from two patients show that:

P1: The prediction accuracy is as high as 29% (baseline is 57%), more factors are concentrated in ca, thal_ Fixed_ Blue parts such as defect, oldpeak, etc.
P3: Prediction accuracy is as high as 82%, more influential factors are sel_male=0, thalach=143, etc.

By comparing different patients, we can see the predictive rate and major influencing factors among different patients.

dependence_plot

To understand how a single feature affects the output of a model, we can compare the SHAP value of the feature with that of all samples in the dataset:

Visual Exploration of Diversity

The following figure explores predictions and influencing factors for multiple patients. With the interaction of jupyter notebook, we were able to observe the impact of different feature attributes on the first 50 patients:

shap_values = explainer.shap_values(X_train.iloc[:50])
shap.force_plot(explainer.expected_value[1], 
                shap_values[1], 
                X_test.iloc[:50])

Keywords: Machine Learning kaggle

Added by bob2588 on Mon, 07 Feb 2022 20:19:22 +0200

Programming VIP