"From bronze to King" Python machine learning engineer 04

Machine learning DAY04

Cross validation

Due to the uncertainty of data set division, if the randomly divided samples are just in a special class of samples, the reliability of the results predicted by the training model will be questioned. Therefore, it is necessary to conduct multiple cross validation, divide all samples in the sample space into n parts, use different training sets to train models, and output index scores when testing different test sets. sklearn provides cross validation related API s:

import sklearn.model_selection as ms
ms.cross_val_score(Model, Input set, Output set, cv=Fold number, scoring=Index name)->Index value array

Case: using cross validation, output the accuracy of the classifier:

# Divide training set and test set
train_x, test_x, train_y, test_y = \
    ms.train_test_split(
        x, y, test_size=0.25, random_state=7)
# Naive Bayes classifier 
model = nb.GaussianNB()
# Cross validation
# accuracy
ac = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='accuracy')
print(ac.mean())
#Training model with training set
model.fit(train_x, train_y)

Cross validation index

Accuracy: number of samples correctly classified / total number of samples
Precision_weighted: for each category, the number of correctly predicted samples is higher than the number of predicted samples
Recall_weighted: for each category, the predicted number of correct samples is higher than the actual number of samples
f1 _weighted:

2x precision x recall / (precision + recall)

In the process of cross validation, for each cross validation, calculate the precision rate, recall rate or f1 score of all categories, and then take the average of the corresponding index values of each category as the evaluation index of this cross validation, and then return all the evaluation indexes of cross validation to the caller in the form of array.

# Cross validation
# accuracy
ac = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='accuracy')
print(ac.mean())
# Precision rate
pw = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='precision_weighted')
print(pw.mean())
# recall 
rw = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='recall_weighted')
print(rw.mean())
# f1 score
fw = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='f1_weighted')
print(fw.mean())

Confusion matrix

Each row and column respectively correspond to each category in the sample output. The row represents the actual category and the list shows the prediction category.

	Category A	Category B	Category C
Category A	5	0	0
Category B	0	6	0
Category C	0	0	7

The above matrix is the ideal confusion matrix. The unsatisfactory confusion matrix is as follows:

	Category A	Category B	Category C
Category A	3	1	1
Category B	0	4	2
Category C	0	0	7

Precision = the value on the main diagonal / the sum of the columns in which the value is located

Recall rate = the value on the main diagonal / the sum of the rows in which the value is located

API for obtaining confusion matrix of model classification results:

import sklearn.metrics as sm
sm.confusion_matrix(Actual output, Prediction output)->Confusion matrix

Case: output the confusion matrix of classification results.

#Output the confusion matrix and draw the confusion matrix map
cm = sm.confusion_matrix(test_y, pred_test_y)
print(cm)
mp.figure('Confusion Matrix', facecolor='lightgray')
mp.title('Confusion Matrix', fontsize=20)
mp.xlabel('Predicted Class', fontsize=14)
mp.ylabel('True Class', fontsize=14)
mp.xticks(np.unique(pred_test_y))
mp.yticks(np.unique(test_y))
mp.tick_params(labelsize=10)
mp.imshow(cm, interpolation='nearest', cmap='jet')
mp.show()

Classification Report

sklearn.metrics provides API related to classification report, which can not only get confusion matrix, but also get the results of cross validation precision rate, recall rate and f1 score, so as to easily analyze which samples are abnormal samples.

# Get classification Report
cr = sm.classification_report(Actual output, Prediction output)

Case: output classification report:

# Get classification Report
cr = sm.classification_report(test_y, pred_test_y)
print(cr)

Decision tree classification

The decision tree classification model will find the leaf nodes matching the sample characteristics, and then classify them by voting. The common characteristic information of cars and the classification of cars are counted in the sample file. These data are used to train the model based on the decision tree classification algorithm to predict the grade of cars.

Car price	maintenance costs	Number of doors	Passenger capacity	trunk	Security	Car level

Case: training model based on decision tree classification algorithm to predict car grade.

Read the text data, encode the label of each column, and train the model based on the random forest classifier for cross validation.

import numpy as np
import sklearn.preprocessing as sp
import sklearn.ensemble as se
import sklearn.model_selection as ms

data = np.loadtxt('../data/car.txt', delimiter=',', dtype='U10')
data = data.T
encoders = []
train_x, train_y = [],[]
for row in range(len(data)):
    encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        train_x.append(encoder.fit_transform(data[row]))
    else:
        train_y = encoder.fit_transform(data[row])
    encoders.append(encoder)
train_x = np.array(tr	ain_x).T
# Random forest classifier
model = se.RandomForestClassifier(max_depth=6, n_estimators=200, random_state=7)
print(ms.cross_val_score(model, train_x, train_y, cv=4, scoring='f1_weighted').mean())
model.fit(train_x, train_y)

Customize the test set, test the test set with the trained model, and output the results.

data = [
    ['high', 'med', '5more', '4', 'big', 'low', 'unacc'],
    ['high', 'high', '4', '4', 'med', 'med', 'acc'],
    ['low', 'low', '2', '4', 'small', 'high', 'good'],
    ['low', 'med', '3', '4', 'med', 'high', 'vgood']]

data = np.array(data).T
test_x, test_y = [],[]
for row in range(len(data)):
    encoder = encoders[row]
    if row < len(data) - 1:
        test_x.append(encoder.transform(data[row]))
    else:
        test_y = encoder.transform(data[row])
test_x = np.array(test_x).T
pred_test_y = model.predict(test_x)
print((pred_test_y == test_y).sum() / pred_test_y.size)
print(encoders[-1].inverse_transform(test_y))
print(encoders[-1].inverse_transform(pred_test_y))

Validation curve

Validation curve: model performance = f (super parameter)

API required for validation curve:

train_scores, test_scores = ms.validation_curve(
    model,		# Model 
    Input set, Output set, 
    'n_estimators', 		#Super parameter name
    np.arange(50, 550, 50),	#Hyperparametric sequence
    cv=5		#Fold number
)

train_ Structure of scores:

Super parameter value	First fold	Second fold	Third fold	Fourth fold	Fifth fold
50	0.91823444	0.91968162	0.92619392	0.91244573	0.91040462
100	0.91968162	0.91823444	0.91244573	0.92619392	0.91244573
...	...	...	...	...	...

test_ The structure of scores and train_scores have the same structure.

Case: in the case of car rating, use the verification curve to select better parameters.

# Get about n_ Validation curve of estimators
model = se.RandomForestClassifier(max_depth=6, random_state=7)
n_estimators = np.arange(50, 550, 50)
train_scores, test_scores = ms.validation_curve(model, train_x, train_y, 'n_estimators', n_estimators, cv=5)
print(train_scores, test_scores)
train_means1 = train_scores.mean(axis=1)
for param, score in zip(n_estimators, train_means1):
    print(param, '->', score)

mp.figure('n_estimators', facecolor='lightgray')
mp.title('n_estimators', fontsize=20)
mp.xlabel('n_estimators', fontsize=14)
mp.ylabel('F1 Score', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(n_estimators, train_means1, 'o-', c='dodgerblue', label='Training')
mp.legend()
mp.show()

# Get about Max_ Verification curve of depth
model = se.RandomForestClassifier(n_estimators=200, random_state=7)
max_depth = np.arange(1, 11)
train_scores, test_scores = ms.validation_curve(
    model, train_x, train_y, 'max_depth', max_depth, cv=5)
train_means2 = train_scores.mean(axis=1)
for param, score in zip(max_depth, train_means2):
    print(param, '->', score)

mp.figure('max_depth', facecolor='lightgray')
mp.title('max_depth', fontsize=20)
mp.xlabel('max_depth', fontsize=14)
mp.ylabel('F1 Score', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(max_depth, train_means2, 'o-', c='dodgerblue', label='Training')
mp.legend()
mp.show()

learning curve

Learning curve: model performance = f (training set size)

API required for learning curve:

_, train_scores, test_scores = ms.learning_curve(
    model,		# Model 
    Input set, Output set, 
    train_sizes=[0.9, 0.8, 0.7],	# Training set size sequence
    cv=5		# Fold number
)

train_ Structure of scores:

Case: in the case of car rating, the learning curve is used to select the optimal parameter of training set size.

# Get learning curve
model = se.RandomForestClassifier( max_depth=9, n_estimators=200, random_state=7)
train_sizes = np.linspace(0.1, 1, 10)
_, train_scores, test_scores = ms.learning_curve(
    model, x, y, train_sizes=train_sizes, cv=5)
test_means = test_scores.mean(axis=1)
for size, score in zip(train_sizes, train_means):
    print(size, '->', score)
mp.figure('Learning Curve', facecolor='lightgray')
mp.title('Learning Curve', fontsize=20)
mp.xlabel('train_size', fontsize=14)
mp.ylabel('F1 Score', fontsize=14)
mp.tick_params(labelsize=10)
mp.grid(linestyle=':')
mp.plot(train_sizes, test_means, 'o-', c='dodgerblue', label='Training')
mp.legend()
mp.show()

Case: predict workers' wage income.

Read adult Txt, select different types of encoders and training models according to different forms of characteristics to predict workers' wages.

Custom label encoder. If it is a digital string, the encoder is used to retain the meaning of characteristic digital values.

class DigitEncoder():

    def fit_transform(self, y):
        return y.astype(int)

    def transform(self, y):
        return y.astype(int)

    def inverse_transform(self, y):
        return y.astype(str)

Read the file, sort out the sample data, and label code each column in the sample matrix.

num_less, num_more, max_each = 0, 0, 7500
data = []

txt = np.loadtxt('../data/adult.txt', dtype='U20', delimiter=', ')
for row in txt:
    if(' ?' in row):
        continue
    elif(str(row[-1]) == '<=50K'):
        num_less += 1
        data.append(row)
    elif(str(row[-1]) == '>50K'):
        num_more += 1
        data.append(row)
   
data = np.array(data).T
encoders, x = [], []
for row in range(len(data)):
    if str(data[row, 0]).isdigit():
        encoder = DigitEncoder()
    else:
        encoder = sp.LabelEncoder()
    if row < len(data) - 1:
        x.append(encoder.fit_transform(data[row]))
    else:
        y = encoder.fit_transform(data[row])
    encoders.append(encoder)

The training set and test set are divided, the learning model is constructed based on Naive Bayesian classification algorithm, and the cross validation score is output to validate the test set.

x = np.array(x).T
train_x, test_x, train_y, test_y = ms.train_test_split(
    x, y, test_size=0.25, random_state=5)
model = nb.GaussianNB()
print(ms.cross_val_score(
    model, x, y, cv=10, scoring='f1_weighted').mean())
model.fit(train_x, train_y)
pred_test_y = model.predict(test_x)
print((pred_test_y == test_y).sum() / pred_test_y.size)

Simulate the sample data and predict the income level.

data = [['39', 'State-gov', '77516', 'Bachelors',
         '13', 'Never-married', 'Adm-clerical', 'Not-in-family',
         'White', 'Male', '2174', '0', '40', 'United-States']]
data = np.array(data).T
x = []
for row in range(len(data)):
    encoder = encoders[row]
    x.append(encoder.transform(data[row]))
x = np.array(x).T
pred_y = model.predict(x)
print(encoders[-1].inverse_transform(pred_y))

Support vector machine (SVM)

Principle of support vector machine

Seeking the optimal classification boundary

Correct: most samples can be classified correctly.

Generalization: maximize support vector spacing.

Fairness: equidistant from the support vector.

Simple: linear, line or plane, split hyperplane.
Dimension increasing transformation based on kernel function

Through the feature transformation called kernel function, a new feature is added to make the linear non separable problem in low dimensional space become a linear separable problem in high dimensional space.

Linear kernel function: linear, which does not improve the dimension through the kernel function, but only seeks the linear classification boundary in the original dimension space.

API related to SVM classification based on linear kernel function:
```
import sklearn.svm as svm
model = svm.SVC(kernel='linear')
model.fit(train_x, train_y)
```
Case: simple2 Txt.
```
import numpy as np
import sklearn.model_selection as ms
import sklearn.svm as svm
import sklearn.metrics as sm
import matplotlib.pyplot as mp
x, y = [], []
data = np.loadtxt('../data/multiple2.txt', delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]
train_x, test_x, train_y, test_y = \
    ms.train_test_split(x, y, test_size=0.25, random_state=5)
# Support vector machine classifier based on linear kernel function
model = svm.SVC(kernel='linear')
model.fit(train_x, train_y)
n = 500
l, r = x[:, 0].min() - 1, x[:, 0].max() + 1
b, t = x[:, 1].min() - 1, x[:, 1].max() + 1
grid_x = np.meshgrid(np.linspace(l, r, n),
                     np.linspace(b, t, n))
flat_x = np.column_stack((grid_x[0].ravel(), grid_x[1].ravel()))    
flat_y = model.predict(flat_x)
grid_y = flat_y.reshape(grid_x[0].shape)
pred_test_y = model.predict(test_x)
cr = sm.classification_report(test_y, pred_test_y)
print(cr)
mp.figure('SVM Linear Classification', facecolor='lightgray')
mp.title('SVM Linear Classification', fontsize=20)
mp.xlabel('x', fontsize=14)
mp.ylabel('y', fontsize=14)
mp.tick_params(labelsize=10)
mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray')
mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80)
mp.show()
```
Polynomial kernel function: poly, which increases the higher power of the characteristics of the original sample through the polynomial function
y = x 1 + x 2 y = x 1 2 + 2 x 1 x 2 + x 2 2 y = x 1 3 + 3 x 1 2 x 2 + 3 x 1 x 2 2 + x 2 3 y = x_1+x_2 \\ y = x_1^2 + 2x_1x_2 + x_2^2 \\ y = x_1^3 + 3x_1^2x_2 + 3x_1x_2^2 + x_2^3 y=x1+x2y=x12+2x1x2+x22y=x13+3x12x2+3x1x22+x23
Case, training sample2 based on polynomial kernel function Txt.
```
# Support vector machine classifier based on linear kernel function
model = svm.SVC(kernel='poly', degree=3)
model.fit(train_x, train_y)
```
Radial basis function: rbf, which increases the distribution probability of original sample features through Gaussian distribution function

Case, training sample2 based on radial basis kernel function Txt.
```
# Support vector machine classifier based on radial basis kernel function
# C: Regular intensity
# gamma: standard deviation of normal distribution curve
model = svm.SVC(kernel='rbf', C=600, gamma=0.01)
model.fit(train_x, train_y)
```

Sample category equalization

Through the equalization of category weight, the weight of samples with small proportion is higher, while the weight of samples with large proportion is lower, so as to average the contribution of different category samples to the classification model and improve the performance of the model. (up or down sampling)

API related to sample category Equalization:

model = svm.SVC(kernel='linear', class_weight='balanced')
model.fit(train_x, train_y)

Case: modify the support vector machine case of linear kernel function, and read imbalance based on sample category equalization Txt training model.

... ...
... ...
data = np.loadtxt('../data/imbalance.txt', delimiter=',', dtype='f8')
x = data[:, :-1]
y = data[:, -1]
train_x, test_x, train_y, test_y = \
    ms.train_test_split(x, y, test_size=0.25, random_state=5)
# Support vector machine classifier based on linear kernel function
model = svm.SVC(kernel='linear', class_weight='balanced')
model.fit(train_x, train_y)
... ...
... ...


The environment is average, the heating is not hot, the air conditioning is OK, generally OK.     Praise. fifty-six%
The room is beautiful and the bed is very comfortable. Very clean and hygienic. We strongly recommend staying! Praise. ninety-nine%

Confidence probability

According to the distance between the sample and the class boundary, the credibility of the predicted category is quantified. The closer the sample is to the boundary, the lower the confidence probability. On the contrary, the farther the sample is from the boundary, the higher the confidence probability.

Obtain the API related to the confidence probability of each sample:

# When obtaining the model, the super parameter probability=True is given
model = svm.SVC(kernel='rbf', C=600, gamma=0.01, probability=True)
Prediction results = model.predict(Input sample matrix)
# Call model predict_ Proba (sample matrix) can obtain the confidence probability matrix of each sample
 Confidence probability matrix = model.predict_proba(Input sample matrix)

The format of confidence probability matrix is as follows:

	Category 1	Category 2
Sample 1	0.8	0.2
Sample 2	0.9	0.1
Sample 3	0.5	0.5

Case: modify the SVM case based on radial basis function kernel function, add test samples, output the execution probability of each test sample, and give labels.

# Sorting test samples
prob_x = np.array([
    [2, 1.5],
    [8, 9],
    [4.8, 5.2],
    [4, 4],
    [2.5, 7],
    [7.6, 2],
    [5.4, 5.9]])
pred_prob_y = model.predict(prob_x)
probs = model.predict_proba(prob_x)
print(probs)

# Draw and label each test sample
mp.scatter(prob_x[:,0], prob_x[:,1], c=pred_prob_y, cmap='jet_r', s=80, marker='D')
for i in range(len(probs)):
    mp.annotate(
        '{}% {}%'.format(
            round(probs[i, 0] * 100, 2),
            round(probs[i, 1] * 100, 2)),
        xy=(prob_x[i, 0], prob_x[i, 1]),
        xytext=(12, -12),
        textcoords='offset points',
        horizontalalignment='left',
        verticalalignment='top',
        fontsize=9,
        bbox={'boxstyle': 'round,pad=0.6',
              'fc': 'orange', 'alpha': 0.8})

Keywords: Python Machine Learning AI

Added by leewad on Sat, 25 Dec 2021 06:07:22 +0200

Programming VIP