Machine learning DAY04
Cross validation
Due to the uncertainty of data set division, if the randomly divided samples are just in a special class of samples, the reliability of the results predicted by the training model will be questioned. Therefore, it is necessary to conduct multiple cross validation, divide all samples in the sample space into n parts, use different training sets to train models, and output index scores when testing different test sets. sklearn provides cross validation related API s:
import sklearn.model_selection as ms ms.cross_val_score(Model, Input set, Output set, cv=Fold number, scoring=Index name)->Index value array
Case: using cross validation, output the accuracy of the classifier:
# Divide training set and test set train_x, test_x, train_y, test_y = \ ms.train_test_split( x, y, test_size=0.25, random_state=7) # Naive Bayes classifier model = nb.GaussianNB() # Cross validation # accuracy ac = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='accuracy') print(ac.mean()) #Training model with training set model.fit(train_x, train_y)
Cross validation index
-
Accuracy: number of samples correctly classified / total number of samples
-
Precision_weighted: for each category, the number of correctly predicted samples is higher than the number of predicted samples
-
Recall_weighted: for each category, the predicted number of correct samples is higher than the actual number of samples
-
f1 _weighted:
2x precision x recall / (precision + recall)
In the process of cross validation, for each cross validation, calculate the precision rate, recall rate or f1 score of all categories, and then take the average of the corresponding index values of each category as the evaluation index of this cross validation, and then return all the evaluation indexes of cross validation to the caller in the form of array.
# Cross validation # accuracy ac = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='accuracy') print(ac.mean()) # Precision rate pw = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='precision_weighted') print(pw.mean()) # recall rw = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='recall_weighted') print(rw.mean()) # f1 score fw = ms.cross_val_score( model, train_x, train_y, cv=5, scoring='f1_weighted') print(fw.mean())
Confusion matrix
Each row and column respectively correspond to each category in the sample output. The row represents the actual category and the list shows the prediction category.
Category A | Category B | Category C | |
---|---|---|---|
Category A | 5 | 0 | 0 |
Category B | 0 | 6 | 0 |
Category C | 0 | 0 | 7 |
The above matrix is the ideal confusion matrix. The unsatisfactory confusion matrix is as follows:
Category A | Category B | Category C | |
---|---|---|---|
Category A | 3 | 1 | 1 |
Category B | 0 | 4 | 2 |
Category C | 0 | 0 | 7 |
Precision = the value on the main diagonal / the sum of the columns in which the value is located
Recall rate = the value on the main diagonal / the sum of the rows in which the value is located
API for obtaining confusion matrix of model classification results:
import sklearn.metrics as sm sm.confusion_matrix(Actual output, Prediction output)->Confusion matrix
Case: output the confusion matrix of classification results.
#Output the confusion matrix and draw the confusion matrix map cm = sm.confusion_matrix(test_y, pred_test_y) print(cm) mp.figure('Confusion Matrix', facecolor='lightgray') mp.title('Confusion Matrix', fontsize=20) mp.xlabel('Predicted Class', fontsize=14) mp.ylabel('True Class', fontsize=14) mp.xticks(np.unique(pred_test_y)) mp.yticks(np.unique(test_y)) mp.tick_params(labelsize=10) mp.imshow(cm, interpolation='nearest', cmap='jet') mp.show()
Classification Report
sklearn.metrics provides API related to classification report, which can not only get confusion matrix, but also get the results of cross validation precision rate, recall rate and f1 score, so as to easily analyze which samples are abnormal samples.
# Get classification Report cr = sm.classification_report(Actual output, Prediction output)
Case: output classification report:
# Get classification Report cr = sm.classification_report(test_y, pred_test_y) print(cr)
Decision tree classification
The decision tree classification model will find the leaf nodes matching the sample characteristics, and then classify them by voting. The common characteristic information of cars and the classification of cars are counted in the sample file. These data are used to train the model based on the decision tree classification algorithm to predict the grade of cars.
Car price | maintenance costs | Number of doors | Passenger capacity | trunk | Security | Car level |
---|---|---|---|---|---|---|
Case: training model based on decision tree classification algorithm to predict car grade.
- Read the text data, encode the label of each column, and train the model based on the random forest classifier for cross validation.
import numpy as np import sklearn.preprocessing as sp import sklearn.ensemble as se import sklearn.model_selection as ms data = np.loadtxt('../data/car.txt', delimiter=',', dtype='U10') data = data.T encoders = [] train_x, train_y = [],[] for row in range(len(data)): encoder = sp.LabelEncoder() if row < len(data) - 1: train_x.append(encoder.fit_transform(data[row])) else: train_y = encoder.fit_transform(data[row]) encoders.append(encoder) train_x = np.array(tr ain_x).T # Random forest classifier model = se.RandomForestClassifier(max_depth=6, n_estimators=200, random_state=7) print(ms.cross_val_score(model, train_x, train_y, cv=4, scoring='f1_weighted').mean()) model.fit(train_x, train_y)
- Customize the test set, test the test set with the trained model, and output the results.
data = [ ['high', 'med', '5more', '4', 'big', 'low', 'unacc'], ['high', 'high', '4', '4', 'med', 'med', 'acc'], ['low', 'low', '2', '4', 'small', 'high', 'good'], ['low', 'med', '3', '4', 'med', 'high', 'vgood']] data = np.array(data).T test_x, test_y = [],[] for row in range(len(data)): encoder = encoders[row] if row < len(data) - 1: test_x.append(encoder.transform(data[row])) else: test_y = encoder.transform(data[row]) test_x = np.array(test_x).T pred_test_y = model.predict(test_x) print((pred_test_y == test_y).sum() / pred_test_y.size) print(encoders[-1].inverse_transform(test_y)) print(encoders[-1].inverse_transform(pred_test_y))
Validation curve
Validation curve: model performance = f (super parameter)
API required for validation curve:
train_scores, test_scores = ms.validation_curve( model, # Model Input set, Output set, 'n_estimators', #Super parameter name np.arange(50, 550, 50), #Hyperparametric sequence cv=5 #Fold number )
train_ Structure of scores:
Super parameter value | First fold | Second fold | Third fold | Fourth fold | Fifth fold |
---|---|---|---|---|---|
50 | 0.91823444 | 0.91968162 | 0.92619392 | 0.91244573 | 0.91040462 |
100 | 0.91968162 | 0.91823444 | 0.91244573 | 0.92619392 | 0.91244573 |
... | ... | ... | ... | ... | ... |
test_ The structure of scores and train_scores have the same structure.
Case: in the case of car rating, use the verification curve to select better parameters.
# Get about n_ Validation curve of estimators model = se.RandomForestClassifier(max_depth=6, random_state=7) n_estimators = np.arange(50, 550, 50) train_scores, test_scores = ms.validation_curve(model, train_x, train_y, 'n_estimators', n_estimators, cv=5) print(train_scores, test_scores) train_means1 = train_scores.mean(axis=1) for param, score in zip(n_estimators, train_means1): print(param, '->', score) mp.figure('n_estimators', facecolor='lightgray') mp.title('n_estimators', fontsize=20) mp.xlabel('n_estimators', fontsize=14) mp.ylabel('F1 Score', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.plot(n_estimators, train_means1, 'o-', c='dodgerblue', label='Training') mp.legend() mp.show()
# Get about Max_ Verification curve of depth model = se.RandomForestClassifier(n_estimators=200, random_state=7) max_depth = np.arange(1, 11) train_scores, test_scores = ms.validation_curve( model, train_x, train_y, 'max_depth', max_depth, cv=5) train_means2 = train_scores.mean(axis=1) for param, score in zip(max_depth, train_means2): print(param, '->', score) mp.figure('max_depth', facecolor='lightgray') mp.title('max_depth', fontsize=20) mp.xlabel('max_depth', fontsize=14) mp.ylabel('F1 Score', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.plot(max_depth, train_means2, 'o-', c='dodgerblue', label='Training') mp.legend() mp.show()
learning curve
Learning curve: model performance = f (training set size)
API required for learning curve:
_, train_scores, test_scores = ms.learning_curve( model, # Model Input set, Output set, train_sizes=[0.9, 0.8, 0.7], # Training set size sequence cv=5 # Fold number )
train_ Structure of scores:
Case: in the case of car rating, the learning curve is used to select the optimal parameter of training set size.
# Get learning curve model = se.RandomForestClassifier( max_depth=9, n_estimators=200, random_state=7) train_sizes = np.linspace(0.1, 1, 10) _, train_scores, test_scores = ms.learning_curve( model, x, y, train_sizes=train_sizes, cv=5) test_means = test_scores.mean(axis=1) for size, score in zip(train_sizes, train_means): print(size, '->', score) mp.figure('Learning Curve', facecolor='lightgray') mp.title('Learning Curve', fontsize=20) mp.xlabel('train_size', fontsize=14) mp.ylabel('F1 Score', fontsize=14) mp.tick_params(labelsize=10) mp.grid(linestyle=':') mp.plot(train_sizes, test_means, 'o-', c='dodgerblue', label='Training') mp.legend() mp.show()
Case: predict workers' wage income.
Read adult Txt, select different types of encoders and training models according to different forms of characteristics to predict workers' wages.
- Custom label encoder. If it is a digital string, the encoder is used to retain the meaning of characteristic digital values.
class DigitEncoder(): def fit_transform(self, y): return y.astype(int) def transform(self, y): return y.astype(int) def inverse_transform(self, y): return y.astype(str)
- Read the file, sort out the sample data, and label code each column in the sample matrix.
num_less, num_more, max_each = 0, 0, 7500 data = [] txt = np.loadtxt('../data/adult.txt', dtype='U20', delimiter=', ') for row in txt: if(' ?' in row): continue elif(str(row[-1]) == '<=50K'): num_less += 1 data.append(row) elif(str(row[-1]) == '>50K'): num_more += 1 data.append(row) data = np.array(data).T encoders, x = [], [] for row in range(len(data)): if str(data[row, 0]).isdigit(): encoder = DigitEncoder() else: encoder = sp.LabelEncoder() if row < len(data) - 1: x.append(encoder.fit_transform(data[row])) else: y = encoder.fit_transform(data[row]) encoders.append(encoder)
- The training set and test set are divided, the learning model is constructed based on Naive Bayesian classification algorithm, and the cross validation score is output to validate the test set.
x = np.array(x).T train_x, test_x, train_y, test_y = ms.train_test_split( x, y, test_size=0.25, random_state=5) model = nb.GaussianNB() print(ms.cross_val_score( model, x, y, cv=10, scoring='f1_weighted').mean()) model.fit(train_x, train_y) pred_test_y = model.predict(test_x) print((pred_test_y == test_y).sum() / pred_test_y.size)
- Simulate the sample data and predict the income level.
data = [['39', 'State-gov', '77516', 'Bachelors', '13', 'Never-married', 'Adm-clerical', 'Not-in-family', 'White', 'Male', '2174', '0', '40', 'United-States']] data = np.array(data).T x = [] for row in range(len(data)): encoder = encoders[row] x.append(encoder.transform(data[row])) x = np.array(x).T pred_y = model.predict(x) print(encoders[-1].inverse_transform(pred_y))
Support vector machine (SVM)
Principle of support vector machine
-
Seeking the optimal classification boundary
Correct: most samples can be classified correctly.
Generalization: maximize support vector spacing.
Fairness: equidistant from the support vector.
Simple: linear, line or plane, split hyperplane.
-
Dimension increasing transformation based on kernel function
Through the feature transformation called kernel function, a new feature is added to make the linear non separable problem in low dimensional space become a linear separable problem in high dimensional space.
Linear kernel function: linear, which does not improve the dimension through the kernel function, but only seeks the linear classification boundary in the original dimension space.
API related to SVM classification based on linear kernel function:
import sklearn.svm as svm model = svm.SVC(kernel='linear') model.fit(train_x, train_y)
Case: simple2 Txt.
import numpy as np import sklearn.model_selection as ms import sklearn.svm as svm import sklearn.metrics as sm import matplotlib.pyplot as mp x, y = [], [] data = np.loadtxt('../data/multiple2.txt', delimiter=',', dtype='f8') x = data[:, :-1] y = data[:, -1] train_x, test_x, train_y, test_y = \ ms.train_test_split(x, y, test_size=0.25, random_state=5) # Support vector machine classifier based on linear kernel function model = svm.SVC(kernel='linear') model.fit(train_x, train_y) n = 500 l, r = x[:, 0].min() - 1, x[:, 0].max() + 1 b, t = x[:, 1].min() - 1, x[:, 1].max() + 1 grid_x = np.meshgrid(np.linspace(l, r, n), np.linspace(b, t, n)) flat_x = np.column_stack((grid_x[0].ravel(), grid_x[1].ravel())) flat_y = model.predict(flat_x) grid_y = flat_y.reshape(grid_x[0].shape) pred_test_y = model.predict(test_x) cr = sm.classification_report(test_y, pred_test_y) print(cr) mp.figure('SVM Linear Classification', facecolor='lightgray') mp.title('SVM Linear Classification', fontsize=20) mp.xlabel('x', fontsize=14) mp.ylabel('y', fontsize=14) mp.tick_params(labelsize=10) mp.pcolormesh(grid_x[0], grid_x[1], grid_y, cmap='gray') mp.scatter(test_x[:, 0], test_x[:, 1], c=test_y, cmap='brg', s=80) mp.show()
Polynomial kernel function: poly, which increases the higher power of the characteristics of the original sample through the polynomial function
y = x 1 + x 2 y = x 1 2 + 2 x 1 x 2 + x 2 2 y = x 1 3 + 3 x 1 2 x 2 + 3 x 1 x 2 2 + x 2 3 y = x_1+x_2 \\ y = x_1^2 + 2x_1x_2 + x_2^2 \\ y = x_1^3 + 3x_1^2x_2 + 3x_1x_2^2 + x_2^3 y=x1+x2y=x12+2x1x2+x22y=x13+3x12x2+3x1x22+x23
Case, training sample2 based on polynomial kernel function Txt.# Support vector machine classifier based on linear kernel function model = svm.SVC(kernel='poly', degree=3) model.fit(train_x, train_y)
Radial basis function: rbf, which increases the distribution probability of original sample features through Gaussian distribution function
Case, training sample2 based on radial basis kernel function Txt.
# Support vector machine classifier based on radial basis kernel function # C: Regular intensity # gamma: standard deviation of normal distribution curve model = svm.SVC(kernel='rbf', C=600, gamma=0.01) model.fit(train_x, train_y)
Sample category equalization
Through the equalization of category weight, the weight of samples with small proportion is higher, while the weight of samples with large proportion is lower, so as to average the contribution of different category samples to the classification model and improve the performance of the model. (up or down sampling)
API related to sample category Equalization:
model = svm.SVC(kernel='linear', class_weight='balanced') model.fit(train_x, train_y)
Case: modify the support vector machine case of linear kernel function, and read imbalance based on sample category equalization Txt training model.
... ... ... ... data = np.loadtxt('../data/imbalance.txt', delimiter=',', dtype='f8') x = data[:, :-1] y = data[:, -1] train_x, test_x, train_y, test_y = \ ms.train_test_split(x, y, test_size=0.25, random_state=5) # Support vector machine classifier based on linear kernel function model = svm.SVC(kernel='linear', class_weight='balanced') model.fit(train_x, train_y) ... ... ... ... The environment is average, the heating is not hot, the air conditioning is OK, generally OK. Praise. fifty-six% The room is beautiful and the bed is very comfortable. Very clean and hygienic. We strongly recommend staying! Praise. ninety-nine%
Confidence probability
According to the distance between the sample and the class boundary, the credibility of the predicted category is quantified. The closer the sample is to the boundary, the lower the confidence probability. On the contrary, the farther the sample is from the boundary, the higher the confidence probability.
Obtain the API related to the confidence probability of each sample:
# When obtaining the model, the super parameter probability=True is given model = svm.SVC(kernel='rbf', C=600, gamma=0.01, probability=True) Prediction results = model.predict(Input sample matrix) # Call model predict_ Proba (sample matrix) can obtain the confidence probability matrix of each sample Confidence probability matrix = model.predict_proba(Input sample matrix)
The format of confidence probability matrix is as follows:
Category 1 | Category 2 | |
---|---|---|
Sample 1 | 0.8 | 0.2 |
Sample 2 | 0.9 | 0.1 |
Sample 3 | 0.5 | 0.5 |
Case: modify the SVM case based on radial basis function kernel function, add test samples, output the execution probability of each test sample, and give labels.
# Sorting test samples prob_x = np.array([ [2, 1.5], [8, 9], [4.8, 5.2], [4, 4], [2.5, 7], [7.6, 2], [5.4, 5.9]]) pred_prob_y = model.predict(prob_x) probs = model.predict_proba(prob_x) print(probs) # Draw and label each test sample mp.scatter(prob_x[:,0], prob_x[:,1], c=pred_prob_y, cmap='jet_r', s=80, marker='D') for i in range(len(probs)): mp.annotate( '{}% {}%'.format( round(probs[i, 0] * 100, 2), round(probs[i, 1] * 100, 2)), xy=(prob_x[i, 0], prob_x[i, 1]), xytext=(12, -12), textcoords='offset points', horizontalalignment='left', verticalalignment='top', fontsize=9, bbox={'boxstyle': 'round,pad=0.6', 'fc': 'orange', 'alpha': 0.8})