Decision tree visualization and model evaluation SE SP AUC

catalogue

2.1 decision tree

2.1.1 sample data set

2.1.2 decision tree visualization

2.2 model classification performance prediction

2.2.1 model stability

2.2.2 SE, SP and ACC classification performance prediction

#Code implementation

2.1 decision tree

2.1.1 sample data set

The total sample data used in this decision tree construction is 351, of which the number of male samples is 283 and the number of female samples is 68; The number of training set samples is 245 and the number of test set samples is 106; Among them, the test intensity accounts for about 30% of the total number of samples.

2.1.2 decision tree visualization

2.2 model classification performance prediction

2.2.1 model stability

Performance measurement is an evaluation criterion to measure the generalization ability of the model, which reflects the task requirements; Using different performance measures will often lead to different evaluation results.

      Firstly, the score () is used to input the data and labels of the test samples, and the classification score of the model predicted by the test samples is returned;

Step 2: after getting the score, do ten cross validation to see the stability of the model. Using cross in sklearn_ val_ The score function performs cross validation. Input the data characteristics and data labels. Here, cv is set to 10. Perform cross validation for ten times, and the returned test score is also 0.8019. Therefore, it can be seen that the stability of the model is good;

listScore
Test set0.8019
Ten cross validation0.8019
Adjust tree depth0.8063

Step 3: adjust the parameters. This is mainly aimed at the depth of the decision tree. In order to see whether it is over fitting or under fitting, we compare the performance of the training set and the test set. According to table 2.2.1, the result is 0.8063. It can be seen from figure 2.2.1 that there is a tendency of over fitting.

2.2.2 SE, SP and ACC classification performance prediction

In this prediction task, the function predict () in the decision tree is used to predict 106 samples of the test set and return the prediction label of the samples. The evaluation indexes TP, TN, FP and FN of the sample are calculated according to the real marking and prediction results of the test set, and then the evaluation indexes such as SE, SP and ACC are calculated.

The list is as follows, where 1 represents positive class and 0 represents negative class:

Where TP represents the positive sample with correct prediction; TN represents the negative sample with correct prediction; FP represents the negative sample of prediction error; FN represents a positive sample of prediction errors.

Sensitivity (SE) = TP/(TP+FN) #tpr

Specificity (SP) = TN/(TN+FP) # tnr=1-fpr

Accuracy (ACC) = (TP+TN) / (TP+FP+TN+FN)

The prediction performance of the model is shown in Table 4

Sensitivity SE

Specific SP

Accuracy ACC

0.942

0.300

0.821

  Table 4 performance evaluation of decision tree classification

It can be seen from the table that the ACC accuracy rate shows that 82.1% of the total number of samples can be correctly predicted; According to the sensitivity SE, the prediction accuracy of the model for boys (positive samples) is as high as 94.2%; According to the specific SP, the classification accuracy of the model for girls (negative samples) is only 30%, which may be that the number of girls (negative samples) is too small in the process of model training, resulting in the inaccurate training model, so the accuracy is not high.

#Code implementation

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import math
'''
/**************************task1**************************/
2. The decision tree algorithm for division and selection based on information gain rate is realized
 Huan color, like sports, like literature) are classified, and the calculation model predicts performance (package)
contain SE susceptibility=TP/(TP+FN),SP Specificity=TN/(TN+FP),ACC=right/all=(TP+TN)/(TP+FP+TN+FN),The results are illustrated in a friendly way
1.Construction tree step (data processing. Knowledge) classification diagram 
2.Prediction performance
/**************************task1**************************/
'''
import pandas as pd
import numpy as np
data = pd.read_csv('data_favorite.txt', header=0, sep=' ')
# data.dropna(how='any', inplace=True)  # type(data)--pandas.core.frame.DataFrame
# Processing non numeric
data["color"] = pd.factorize(data["color"])[0].astype(np.uint16)
# print(data)

# Split data
# First split the data and labels,
X = data.iloc[:, data.columns != "sex"]
y = data.iloc[:, data.columns == "sex"]

# First, convert the data read by pandas into array
X = np.array(X)
y = np.array(y)

# Then divide the data according to the classical 37 points. Because it is randomly selected, the index is chaotic.
from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, y, test_size=0.3)
# print('Xtrain:',Xtrain)
# print('Xtest:', Xtest)
# print('Ytrain:',Ytrain)
# print('Ytest:', Ytest)
print(len(Xtrain))
print(len(Xtest))
# Fix the index of test set and training set
'''
for i in [Xtrain, Xtest, Ytrain, Ytest]:
    i.index = range(i.shape[0])
'''

# Train the model. After getting the score, do ten cross validation to see the stability of the model
clf = DecisionTreeClassifier(random_state=25)
clf = clf.fit(Xtrain, Ytrain)

#Calculate the evaluation index according to the real value and predicted value
def performance(labelArr, predictArr):  # Samples must be arrays narray Type class label is 1, 0 # labelArr[i] real category, predictArr[i] predicted category
    # labelArr[i] is actual value,predictArr[i] is predict value
    TP = 0.; TN = 0.; FP = 0.; FN = 0.
    for i in range(len(labelArr)):
        if labelArr[i] == 1 and predictArr[i] == 1:
            TP += 1.
        elif labelArr[i] == 1 and predictArr[i] == 0:
            FN += 1.
        elif labelArr[i] == 0 and predictArr[i] == 1:
            FP += 1.
        elif labelArr[i] == 0 and predictArr[i] == 0:
            TN += 1.
    SE = TP / (TP + FN)  # Sensitivity = TP/P  and P = TP + FN
    SP = TN / (FP + TN) # Specificity = TN/N  and N = TN + FP
    # MCC = (TP * TN - FP * FN) / math.sqrt((TP + FP) * (TP + FN) * (TN + FP) * (TN + FN))
    ACC = (TP + TN) / (TP + TN + FP + FN)
    return SE, SP, ACC

predict_label = clf.predict(Xtest)
print(predict_label)
# print(type(predict_label))
print('Ytest:', Ytest)
# print(type(Ytest))

print(performance(Ytest, predict_label))  # The label of the test set feature judged by the decision tree and the actual label of the test set are input into performance
score_ = clf.score(Xtest, Ytest)
print('Training test scores:', score_)  # Ten cross validation

# Ten cross validation
score = cross_val_score(clf, X, y, cv=10).mean()
print('Ten cross validation:', score_)

# Adjust parameters
# Start with max_depth. In order to see whether it is over fitting or under fitting, it is better to compare the performance of the training set and the test set.
tr = []
te = []
for i in range(10):
    clf = DecisionTreeClassifier(random_state=25, max_depth=i+1,
                                 criterion="entropy")
    clf = clf.fit(Xtrain, Ytrain)
    score_tr = clf.score(Xtrain,Ytrain)
    score_te = cross_val_score(clf,X,y,cv=10).mean()
    tr.append(score_tr)
    te.append(score_te)
print(max(te))
plt.plot(range(1,11),tr,color="red",label="train")
plt.plot(range(1,11),te,color="blue",label="test")
plt.xticks(range(1,11))
plt.legend()
plt.title('Train-Test Accuracy')
plt.savefig(" Decision Tree Train-test Accuracy")
plt.show()
'''
'''
from sklearn import tree
tree.plot_tree(clf)
plt.show()
# Visual decision tree
import graphviz
clf = DecisionTreeClassifier(random_state=25, max_depth=i+1, criterion="entropy")
clf = clf.fit(Xtrain, Ytrain)

tree.export_graphviz(clf, out_file='tree.dot')
data_feature_names = ['color', 'sports', 'literature']
# Visualize data
dot_data = tree.export_graphviz(clf,
                                out_file=None,
                                feature_names=data_feature_names,
                                class_names=['girl', 'boy'],
                                filled=True,
                                rounded=True,
                                special_characters=True)
import pydotplus
graph = pydotplus.graph_from_dot_data(dot_data)

colors = ('turquoise', 'orange')
import collections
edges = collections.defaultdict(list)

for edge in graph.get_edge_list():
    edges[edge.get_source()].append(int(edge.get_destination()))

for edge in edges:
    edges[edge].sort()
    for i in range(2):
        dest = graph.get_node(str(edges[edge][i]))[0]
        dest.set_fillcolor(colors[i])

graph.write_png('tree.png')

Keywords: Python Machine Learning Decision Tree

Added by eludlow on Sat, 27 Nov 2021 05:56:59 +0200