Bi-classifier, confusion matrix, accuracy, recall rate, ROC curve, PR curve, multi-classifier, error analysis

Handwritten Number Recognition, Machine Learning "Classification" Learning Notes - From Geron's "Machine Learning Practice"

"hello word" in the field of image recognition

MNIST

Get MNIST code, 70,000 handwritten digital pictures - - 28x28 pixels 0-255 black and white pixels


Scitkit-Learn loaded datasets are usually dictionary structures


(The dataset needs to be downloaded here. It takes a long time. data_home saves the path so that you can't find the data there if you set the path in advance.)

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784',version=1,data_home='./datasets',as_frame=False)
mnist.keys()
dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

Note: here mnist = fetch_openml('mnist_784', version=1,data_home='. /datasets', as_frame=False) There is no as_frame=False in the book, and when there is no such sentence, the data cannot be read after it


To see what the code on git-hub says
Warning: since Scikit-Learn 0.24, fetch_openml() returns a Pandas DataFrame by default. To avoid this and keep the same code as in the book, we use as_frame=False.

Visible is the difference between the data format and the version update

X,y = mnist["data"],mnist["target"]
X.shape
(70000, 784)
y.shape
(70000,)

Show pictures using imgshow() in Matplotlib

import matplotlib as mpl
import matplotlib.pyplot as plt
some_digit = X[0]
some_digit_image = some_digit.reshape(28,28)

plt.imshow(some_digit_image,cmap="binary")
plt.axis("off")
plt.show()

y[0]

'5'

Here the label is a character, machine learning wants the label to be a number

import numpy as np
y = y.astype(np.uint8)
# Divide training and test sets
X_train, X_test, y_train, y_test = X[:60000],X[60000:],y[:60000],y[60000:]

Training Binary Classifier

Starting with the bi-classification problem, distinguish the number 5 from the number 5
With the Random Gradient Down Classifier SGD, the advantage of SGD is that it can effectively handle very large datasets

y_train_5=(y_train == 5)
t_test_5 = (y_test == 5)

from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state = 27) # 27 is a random number seed, just define it here
sgd_clf.fit(X_train,y_train_5)
SGDClassifier(random_state=27)
sgd_clf.predict([some_digit])#Here is the number 5 shown in the picture above
array([ True])

Performance measurement

Measurement accuracy using cross-validation

The SGDClassifier model is evaluated using the cross_val_score() function and the K-fold validation (three folds)

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf,X_train,y_train_5,cv=3,scoring="accuracy")
array([0.9436 , 0.95535, 0.9681 ])

The correct rate of folding crossovers is more than 93%, which looks very good, but if each test set is language other than 5, the correct rate will be 90%, as shown below.

from sklearn.base import BaseEstimator

class Never5Classifier(BaseEstimator):
    def fit(self,X,y=None):
        return self
    def predict(self,X):
        return np.zeros((len(X),1),dtype=bool)#Forecast is all zero
    
never_5_clf = Never5Classifier()
cross_val_score(never_5_clf,X_train,y_train_5,cv=3,scoring="accuracy")
array([0.91125, 0.90855, 0.90915])

This means that accuracy usually cannot be the primary performance indicator of a classifier, and other indicators are needed to judge whether a prediction model is good or bad.

Confusion Matrix

Number of times an instance of Statistical Category A is divided into Instance B Categories

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf,X_train,y_train_5,cv=3)

from sklearn.metrics import confusion_matrix
confusion_matrix(y_train_5, y_train_pred)
array([[52775,  1804],
       [  855,  4566]], dtype=int64)

The result matrix above is the confusion matrix, which predicts 52775 (true negative type TN) for non-5, 1804 (false positive type FP) for non-5, and 855 (false negative type FN) for non-5. There are 4566 (true class TP) correctly predicted for non-5. It can be seen that the accuracy of prediction results is not what we want, but the accuracy of the above results is as high as 93%, so we should not use the accuracy rate to measure the quality of a model.

Accuracy and recall rate

essence degree = T P T P + F P Precision=\frac{TP}{TP+FP} Precision=TP+FPTP is the accuracy of correct prediction


Call return rate = T P T P + F N Recall Rate=\frac{TP}{TP+FN} Recall Rate = The rate at which the TP+FNTP classifier correctly detects positive class instances

from sklearn.metrics import precision_score, recall_score
print(precision_score(y_train_5,y_train_pred))
recall_score(y_train_5,y_train_pred)
0.7167974882260597





0.8422800221361373

What does accuracy and accuracy mean here? When it says a picture is 5, only 72.9 percent of the probability is accurate, and only 75.6 percent of the number 5 is detected by it


At this point, an evaluation index of comprehensive accuracy and accuracy is needed to evaluate the model, and F1-score is born. F1 score is the harmonic average of accuracy and recall rate.


F 1 = 2 1 essence degree + 1 Call return rate = 2 × essence degree × Call return rate essence degree + Call return rate = T P T P + F N + F P 2 F_1=\frac{2}{\frac{1}{precision}+\frac{1}{recall rate}=2\timesfrac{precision\times recall rate}{precision+recall rate}=\frac{TP}{TP+\frac{FN+FP}{2} F1 = precision 1 + recall 1 2 = 2 × Precision + Recall Precision × Recall Rate =TP+2FN+FP TP

from sklearn.metrics import f1_score
f1_score(y_train_5,y_train_pred)
0.7744890170469002

Accuracy/recall tradeoff

Scikit-learn s cannot directly set the threshold for F1score, but can access the decision scores it uses to predict. As shown below

y_scores = sgd_clf.decision_function([some_digit])
y_scores
array([1066.49326077])
threshold = 0
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
array([ True])
threshold = 1100
y_some_digit_pred = (y_scores > threshold)
y_some_digit_pred
array([False])

The threshold used by the SGDClassifier classifier demonstrates that increasing the threshold reduces the recall rate.
Here's how to score all the instances in the training set

y_scores = cross_val_predict(sgd_clf,X_train,y_train_5,cv=3,method="decision_function")
#Accuracy and recall rate for calculating all possible thresholds
from sklearn.metrics import precision_recall_curve
precisions,recalls,thresholds = precision_recall_curve(y_train_5,y_scores)

#Use Matplotlib to plot function diagrams of accuracy and recall relative to thresholds
def plot_precision_recall_vs_threshold(precision,recalls,thresholds):
    plt.plot(thresholds,precision[:-1],"b--",label="Precision")
    plt.plot(thresholds,recalls[:-1],"g-",label="Recall")
    plt.legend(loc="center right", fontsize=16) 
    plt.xlabel("Threshold", fontsize=16)        
    plt.grid(True)                              
    plt.axis([-50000, 50000, 0, 1])     

plot_precision_recall_vs_threshold(precisions,recalls,thresholds)
plt.show()

#Function Diagram of Accuracy Recall Rate
def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])
    plt.grid(True)
recall_90_precision = recalls[np.argmax(precisions >= 0.90)]
plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.plot([recall_90_precision, recall_90_precision], [0., 0.9], "r:")
plt.plot([0.0, recall_90_precision], [0.9, 0.9], "r:")
plt.plot([recall_90_precision], [0.9], "ro")
plt.show()

#Set precision to 90%

threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]
y_train_pred_90 = (y_scores >= threshold_90_precision)
precision_score(y_train_5,y_train_pred_90)
0.9001457725947521
recall_score(y_train_5,y_train_pred_90)
0.6834532374100719

So if someone needs 99% accuracy for you, you can ask, "What is the recall rate?" Killing him is a surprise.

ROC Curve

The ROC curve plots the true class rate (also recall rate) versus the false positive class rate (FPR), which is the ratio of negative class instances misclassified as positive = 1-TNR


ROC curves plot the relationship between sensitivity (recall) and (1-specificity)

from sklearn.metrics import roc_curve
fpr,tpr,threshold = roc_curve(y_train_5,y_scores)

def plot_roc_curve(fpr,tpr,label = None):
    plt.plot(fpr,tpr,linewidth=2,label=label)
    plt.plot([0,1],[0,1],'k--')
    plt.axis([0, 1, 0, 1])                                    
    plt.xlabel('False Positive Rate (Fall-Out)', fontsize=16) 
    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    
    plt.grid(True)   

plot_roc_curve(fpr,tpr)
plt.show()

Another way to compare classifiers is to measure the area under the curve (AUC), perfect AUC=1, and pure random classifier AUC=0.5

from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5,y_scores)
0.9604387033143528
Selection of ROC Curve and PR Curve

When positive classes are very rare or you are more concerned with positive classes than with false negative classes, choose PR curves, or ROC curves instead.

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier(random_state=27)
y_probas_forest =cross_val_predict(forest_clf,X_train,y_train_5,cv=3,method="predict_proba")
y_scores_forest = y_probas_forest[:,1]
fpr_forest,tpr_forest,thresholds_forest = roc_curve(y_train_5,y_scores_forest)

plt.plot(fpr,tpr,"b:",label="SGD")
plot_roc_curve(fpr_forest,tpr_forest,"Random forest")
plt.legend(loc="lower right")
plt.show()

[External chain picture transfer failed, source station may have anti-theft chain mechanism, it is recommended to save the picture and upload it directly (img-jkTeibfC-1638890528716)(output_46_0.png)]

roc_auc_score(y_train_5,y_scores_forest)
0.9983414796223264

RandomForestClassifier's ROC curve looks much better than SGDClassifier's, it's closer to the upper left corner, so its ROC-AUC score is much higher

multi-class classifier

The Multiclassifier is divided into OvO and OvR, where OvO is used, that is, the Multiclassifier here actually classifies 45 binary classifiers

from sklearn.svm import SVC

svm_clf = SVC()
svm_clf.fit(X_train,y_train)
svm_clf.predict([some_digit])
array([5], dtype=uint8)
some_digit_scores = svm_clf.decision_function([some_digit])
some_digit_scores
array([[ 1.72501977,  2.72809088,  7.2510018 ,  8.3076379 , -0.31087254,
         9.3132482 ,  1.70975103,  2.76765202,  6.23049537,  4.84771048]])

The top score is really 5

np.argmax(some_digit_scores)
5

When training a classifier, a list of target classes is stored in the classes attribute, sorted by the size of the values

svm_clf.classes_
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8)

If you want to force Scikit-learn s to use one-to-one or one-to-one residual policies, you can use either the OneVsOneClasssifier or the OneVsRestClassifier class.

from sklearn.multiclass import OneVsRestClassifier

ovr_clf = OneVsRestClassifier(SVC())
ovr_clf.fit(X_train,y_train)
ovr_clf.predict([some_digit])
array([5], dtype=uint8)
len(ovr_clf.estimators_)
10

The multiclass problem of training SGDClassifier or RandomForestClassifier is similar to the one above

sgd_clf.fit(X_train,y_train)
sgd_clf.predict([some_digit])
#SGD classifiers can directly divide instances into multiple classes, so you don't have to decide whether to OvO or OvR
array([3], dtype=uint8)
sgd_clf.decision_function([some_digit])
array([[-16594.39761568, -22903.10175344, -15146.89058029,
          1185.04960985, -20053.1928768 ,    508.90204236,
        -23168.38978204, -19229.31273118, -10995.42427777,
         -5902.26098972]])

To evaluate this classifier, as always, use cross-validation to evaluate

cross_val_score(sgd_clf,X_train,y_train,cv=3,scoring="accuracy")
array([0.8714 , 0.8818 , 0.86235])

Simple scaling of the input further improves accuracy

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))
cross_val_score(sgd_clf,X_train_scaled,y_train,cv=3,scoring="accuracy")
array([0.90025, 0.89075, 0.901  ])

error analysis

If you want to further improve the model, one way to do this is to analyze its error type, first look at the confusion matrix

y_train_pred = cross_val_predict(sgd_clf,X_train_scaled,y_train,cv=3)
conf_mx = confusion_matrix(y_train,y_train_pred)
conf_mx
array([[5572,    0,   23,    6,    9,   48,   36,    6,  222,    1],
       [   0, 6399,   39,   21,    4,   44,    4,    7,  214,   10],
       [  27,   27, 5243,   90,   71,   24,   65,   36,  368,    7],
       [  22,   17,  117, 5217,    2,  209,   26,   39,  411,   71],
       [  10,   14,   48,    8, 5190,   12,   35,   24,  338,  163],
       [  26,   15,   29,  167,   54, 4449,   73,   14,  536,   58],
       [  30,   15,   46,    2,   44,   96, 5547,    3,  134,    1],
       [  20,   11,   50,   25,   49,   12,    3, 5692,  192,  211],
       [  16,   65,   51,   89,    3,  126,   24,   10, 5425,   42],
       [  21,   18,   30,   61,  116,   36,    1,  178,  382, 5106]],
      dtype=int64)

Many numbers seem cumbersome and are presented using matshow() from matplotlib

plt.matshow(conf_mx,cmap=plt.cm.gray)
plt.show()

Compare Error Rate

row_sums = conf_mx.sum(axis=1,keepdims = True)
norm_conf_mx = conf_mx / row_sums

Fill the diagonal with 0, keep errors, redraw the results

np.fill_diagonal(norm_conf_mx,0)
plt.matshow(norm_conf_mx,cmap=plt.cm.gray)
plt.show()

Analyzing confusion matrices often helps you gain insight into how to improve your classifier.

Keywords: Python Machine Learning

Added by jolinar on Wed, 08 Dec 2021 11:43:56 +0200