Machine learning Note 6: integrated learning at the bottom of Python


In the supervised learning algorithm of machine learning, our goal is to learn a stable model with good performance in all aspects, but the actual situation is often not so ideal. Sometimes we can only get multiple preferred models (weak supervised model, which performs better in some aspects). Three cobblers make Zhuge Liang. Ensemble learning is to combine multiple weak supervision models here in order to get a better and more comprehensive strong supervision model.

1. Import data

def read_xlsx(csv_path):
    data = pd.read_csv(csv_path)
    return data

2. Divide training set and test set

For convenience, the package of sklearn is directly called here. For the underlying implementation methods, see my previous articles.

from sklearn.model_selection import train_test_split
x = data.iloc[:, :-1]
y = data.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y)

3. Integrated learning

3.1 maximum voting method

The maximum voting method is usually used to classify problems. In this technique, multiple models are used to predict each data point. The prediction of each model is regarded as a "vote". The predictions obtained by most models are used as the final prediction results.

def voting(x_train,x_test,y_train):
    model1 = DecisionTreeClassifier()
    model2 = KNeighborsClassifier()
    model3 = MultinomialNB(),y_train),y_train),y_train)

    a = model1.predict(x_test)
    b = model2.predict(x_test)
    c = model3.predict(x_test)
    labels = []
    for i in range(len(x_test)):
        ypred = []
        counts = np.bincount(ypred)
        label = np.argmax(counts)
    return labels

In addition to the maximum voting method, there are also the average method and the weighted average method.

The average method is similar to the maximum voting technique, which averages the multiple predictions of each data point. This model takes the average value from all models as the final prediction. The average method can be used to predict in regression problems or to calculate the probability of classification problems.

The weighted average method is an extension of the average method. Assign different weights to all models and define the prediction importance of each model. For example, if two of your colleagues are commentators and the others have no experience in this field, the answers of these two friends are more important than others.

3.2 Bagging

In the Bagging method, the bootstrap method is used to take put back sampling from the overall data set to obtain n data sets, and a model is learned on each data set. The final prediction results are obtained by using the output of N models. Specifically, n models are used to predict the voting for the classification problem, and N models are used to predict the average for the regression problem.

#Yes, put it back for sampling
def random_sampling(x,y, m):
    x = np.array(x)
    y = np.array(y)
    a = np.random.permutation(len(x))
    subset = x[a]
    label = y[a]
    return subset, label

#Train each subset
def model(x_train,y_train,x_test):
    model = DecisionTreeClassifier()
    yclass = []
    for i in range(20):
        subset, label = random_sampling(x_train, y_train, 150), label)
        a = model.predict(x_test)
        a = list(a)
    data = DataFrame(yclass)
    ypred = []
    for col in data.columns:
        mean = data[col].mean()   #The results are averaged
    return ypred

4. Calculation accuracy

def accuracy(ypred, y_test):
    correct = 0
    y_test = list(y_test)
    for x in range(len(y_test)):
        if ypred[x] == y_test[x]:
            correct += 1
    accuracy = (correct / float(len(y_test))) * 100.0
    print("Accuracy:", accuracy, "%")
    return accuracy


Ensemble learning improves the weak classifier better.

Keywords: Python Algorithm Machine Learning Decision Tree

Added by simmsy on Wed, 19 Jan 2022 19:45:47 +0200