Machine Learning (Lesson3 classification problem)

task

A handwritten numeral classifier is trained using MNIST data set.

Prepare dataset

The data set is an array of 70000784, 70000 pictures, each with 728 pixels, which can be displayed by resizing into 2828 pictures. All the training processes here use a 728 length vector. However, it can be seen that this method is obviously not well applied to the two-dimensional information of the image, which is why the convolutional neural network works well.

# The old version used mnist = fetch_mldata('MNIST original')
from sklearn.datasets import fetch_openml
mnist = fetch_openml('MNIST_784')
X, y = mnist["data"], mnist["target"]
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
-----------------------------
mnist['data'] #Training set 70000 * 78470000 black and white images, one size 28 * 28
mnist['target'] #['1', '3', '4'...] label

Training a binary classifier

A simple linear model and random gradient descent are used to train a model to judge whether it is 5 or not

import numpy as np
# Disorder order
shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]
# Two classifier
y_train_5 = (y_train == '5') # True for all 5s, False for all other digits.
y_test_5 = (y_test == '5')
# To classify with linear model is to classify with quadratic form
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(X_train, y_train_5)
sgd_clf.predict([50])

Performance evaluation

Cross validation
The most basic cross validation, "accuracy" is the accuracy rate, and the correct proportion of all predictions. The problem with this evaluation method is that if I guess all the answers are not 5, there will be 90% accuracy, so we need a better evaluation method

# Cross validation
from sklearn.model_selection import cross_val_score
# Classification problem, using accuracy accuracy as score
scores = cross_val_score(sgd_clf,X_train,y_train_5,scoring="accuracy", cv=3)
scores

Accuracy
That is, the correct proportion is detected in all samples detected as target objects
recall
Is the proportion of all targets detected

from sklearn.metrics import precision_score, recall_score
print(precision_score(y_train_5, y_train_pred)) #Accuracy, the proportion found in all 5
recall_score(y_train_5, y_train_pred)  #Recall rate, all considered 5, the proportion of 5

Obviously, we want both accuracy and recall to be high, but the relationship between them presents a seesaw relationship

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
method="decision_function")
from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)
def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision")
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall")
    plt.xlabel("Threshold")
    plt.legend(loc="upper left")
    plt.ylim([0, 1])
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)

Therefore, we need to select an appropriate threshold (the threshold can be obtained by using precision_recall_curve, that is, the credibility of the structure). Sometimes we want high accuracy (such as face recognition), and sometimes we want high recall rate (such as quality inspection, we'd rather check more than Miss)

PR curve
The curve of accuracy and recall should be as close to the upper right corner as possible

plt.plot(precisions,recalls)

F1 score
The harmonic average of recall and accuracy can be used as the detection score

from sklearn.metrics import f1_score
f1_score(y_train_5, y_train_pred)

ROC curve
FPR is the rate at which negative examples are erroneously divided into positive examples
TNR is the rate at which counterexamples are correctly classified
It should be as close to the upper left corner as possible
Use the PR curve first when there are few positive examples, or when you pay more attention to false positive examples than false negative examples (you'd rather kill one by mistake). ROC curve is used in other cases.

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)
def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
plot_roc_curve(fpr, tpr, None)
from sklearn.metrics import roc_auc_score
roc_auc_score(y_train_5, y_scores)

Multi classification problem

Some algorithms (such as random forest classifier or naive Bayesian classifier) can directly deal with multi class classification problems. Other algorithms (such as SVM classifier or linear classifier) are strict binary classifiers. Then, there are many strategies that allow you to use two classifiers to perform multi class classification.
One strategy is to train a binary classification for each, and then see which is the most reliable. OvA
The other is to train a two category for every two, and then compete. A lot of classifiers need to be trained. OvO
Some algorithms (such as SVM classifier) are difficult to expand in the size of the training set, so OvO is better for these algorithms, because it can train more on small data sets than on huge data sets. However, for most binary classifiers, OvA is a better choice.

Let's train any one, except SVM, which is OVA by default

sgd_clf.fit(X_train, y_train) # y_train, not y_train_5

You can also force the use of ovo or ova policies

from sklearn.multiclass import OneVsOneClassifier
ovo_clf = OneVsOneClassifier(SGDClassifier(random_state=42))
ovo_clf.fit(X_train, y_train)
len(ovo_clf.estimators_)

There is no problem of ova or ovo in the multi classifier of random forest

Error analysis of multi classification problem

Confusion matrix a [i] [j], the number of classes I divided into j

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64)) #Simple regularization
y_train_pred = cross_val_predict(sgd_clf, X_train_scaled, y_train, cv=3)
conf_mx = confusion_matrix(y_train, y_train_pred)
plt.matshow(conf_mx, cmap=plt.cm.gray)
plt.show()

It can be seen from this picture that the classifier works well, but we need to deal with it a little if we want to see more useful data.
First divide each value of the confusion matrix by the total number of corresponding categories, and then fill the diagonal with 0 (because the diagonal number is too large)

row_sums = conf_mx.sum(axis=1, keepdims=True)
norm_conf_mx = conf_mx / row_sums
np.fill_diagonal(norm_conf_mx, 0)
plt.matshow(norm_conf_mx, cmap=plt.cm.gray)
plt.show()

You can see which numbers are easily misclassified

Special case analysis: we can take out the wrong numbers separately to analyze which cases will be wrong. Can we add this feature through some means so that the training can distinguish these cases.

Multi label classification

A picture may have many labels. A classifier is trained to recognize three faces, Alice, Bob and Charlie; Then when it is input a picture containing Alice and Bob, it should output [1, 0, 1]

from sklearn.neighbors import KNeighborsClassifier
y_train_large = (y_train >= 7)
y_train_odd = (y_train % 2 == 1)
y_multilabel = np.c_[y_train_large, y_train_odd]
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_multilabel)
knn_clf.predict([some_digit])

That is, the tag becomes a multivariable array, and the others remain unchanged

We use the average value of F1 as the evaluation score, or use the weighted average value (number of tags), average = "weighted".

y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)
f1_score(y_train, y_train_knn_pred, average="macro")
y_train_knn_pred = cross_val_predict(knn_clf, X_train, y_train, cv=3)
f1_score(y_train, y_train_knn_pred, average="weighted")

Multiple output classification

Generalization of multi label problem, each label has multiple values
For example, input a noisy image and output a clean image. Each point is a class with 0-255 values

import random as rnd
noise = rnd.randint(0, 100, (len(X_train), 784))
noise = rnd.randint(0, 100, (len(X_test), 784))
X_train_mod = X_train + noise
X_test_mod = X_test + noise
y_train_mod = X_train
y_test_mod = X_test
knn_clf.fit(X_train_mod, y_train_mod)
clean_digit = knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)

Keywords: Python Machine Learning

Added by rtown on Sat, 20 Nov 2021 11:20:02 +0200

Programming VIP