Integrated learning and Bagging

Integrated learning

To be exact, ensemble learning is not a real machine learning algorithm, at least it does not introduce any new learning algorithm in essence. The overall idea of ensemble learning is to complete the final task through multiple basic (weak) learners. Just as it used to be a single challenge between an algorithm and a problem, now it is a group fight between multiple algorithms (repeated or non repeated). It is academically called ensemble learning or multi classifier system.
Ensemble learning can be roughly divided into three learning methods

  • Bagging

  • Boosting

  • Stacking

Here we are divided into three parts. First, we introduce the idea of Bagging algorithm and its examples. Later, we will introduce Boosting and Stacking respectively

Bagging

Now let's imagine a simple classification problem. In the past, when we used KNN, DT, SVM and other algorithms for prediction, only one base learner was generated. Our final decision results will also depend entirely on the learner. Now we take the method of putting back the samples for 10 times of algorithm learning, so do we get 10 different base learners? No matter how we use so many basic learners in the future, we can get them first.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split,GridSearchCV
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from math import floor
# Get the data. The data here follow the Titanic data. You can download the data from the link in this article
# https://blog.csdn.net/qq_40725653/article/details/117922390?spm=1001.2014.3001.5501
dataset = pd.read_csv(r"E:\Jupyter\Pytorch\ML_Algorithm\Decision_tree\titanic\train.csv")
dataset = dataset.drop(columns=['Name','SibSp','Parch','Ticket','Fare','Cabin','Embarked'])
# Simple data processing to obtain features and labels
features = dataset.iloc[:,-3:]
target = dataset.loc[:,["Survived"]]
# Missing value processing, filled with average value
for line in features.columns:
    if np.any(features[line].isnull())==True:
        features[line].fillna(value=floor(round(features[line].mean(),1)),inplace=True)

from sklearn.feature_extraction import DictVectorizer     
# Data segmentation
x_train,x_test,y_train,y_test = train_test_split(features,target,test_size=0.2,random_state=1)
# Characteristic Engineering
transfer = DictVectorizer(sparse=False)
x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.fit_transform(x_test.to_dict(orient="records"))
# Set up the model, ten times cross validation
cif = BaggingClassifier(base_estimator=DecisionTreeClassifier(),bootstrap=False,bootstrap_features=True,random_state=2)
# You can change the basic algorithm here
#cif = BaggingClassifier(base_estimator=SVC(),bootstrap=False,bootstrap_features=True,random_state=2)
params = {'n_estimators':[10,50,100,150,200]}
cif = GridSearchCV(estimator=cif,param_grid=params,cv=10)
# train
y_train = np.array(y_train).ravel()
y_test = np.array(y_test).ravel()
cif.fit(x_train,y_train)

# forecast
y_pred = cif.predict(x_test)
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test,y_pred))

Voting

Through the above methods, we can generate a variety of basic models. Let's assume that DT, RF and SVM have been generated for the same classification problem. Then there will be three output results, but we only need one label in the end. How can we fit multiple outputs to finally generate a label? Combination strategy

Combined with strategy 1: average method, for numerical output tasks such as regression type, the most common strategy is to use simple average method
1: Simple average method
H ( x ) = 1 T ∑ i = 1 T h i ( x ) H(x)=\dfrac{1}{T}\sum_{i=1}^Thi(x) H(x)=T1​∑i=1T​hi(x)
2: Weighted average method
H ( x ) = ∑ i = 1 T w i h i ( x ) H(x)=\sum_{i=1}^Tw_ih_i(x) H(x)=∑i=1T​wi​hi​(x)
among w i w_i wi # is a single learner h i h_i The weight of hi is usually required w i ≥ 0 , ∑ i = 1 T = 1 w_i≥0,\sum_{i=1}^T=1 wi ≥ 0, Σ i=1T = 1, where the weight is generally learned from the training data. There is a similar mechanism in the voting law, which will be explained in detail in boosting

Combined with strategy 2: voting method, for the classification problem, predict a tag from the category tag set through a certain voting method as the output of the category.
In bagging algorithm, the way of equal voting is adopted. That is, each generated learner is given the same weight, each base learner casts a vote with equal weight, and finally determines the category according to the total result. There are two common combined voting methods: hard voting and soft voting.
1: Hard voting, who gets the most votes belongs to which category
2: For soft voting, first there must be probability output, and finally calculate the weight * prediction probability of all learners to obtain the probability predicted for each category. In the end, which category has a large probability output belongs to which category. Equivalent to a posteriori probability P ( c i ∣ x ) P(c_i|x) P(ci​∣x)( c i c_i ci: prediction category, x: sample).

# The previous import should also be used
from sklearn.metrics import plot_roc_curve
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Generative decision tree, SVM, random forest learner
tree = DecisionTreeClassifier(criterion="gini",max_depth=3,random_state=1)
tree.fit(train_set,train_target)

forest = RandomForestClassifier(n_estimators=100,criterion="gini",max_depth=3,random_state=2)
forest.fit(train_set,train_target)

svc = SVC(C=1.0,kernel="rbf",degree=3,random_state=3,probability=True)
svc.fit(train_set,train_target)

from sklearn.ensemble import VotingClassifier
# Generate a voting classifier, and use the keyword parameter voting = 'soft' /'hard 'to control soft and hard voting
vote = VotingClassifier(estimators=[('dt',tree),('rf',forest),('svc',svc)],voting="soft")
vote.fit(train_set,train_target)

clf = [tree,forest,svc,vote]
for model in clf:
    model.fit(train_set,train_target)
    y_pred = model.predict(test_set)
    score = accuracy_score(test_target,y_pred)
    print(model.__class__.__name__,score)
# Draw their roc curve by the way
ax = plt.gca()
ax.grid(linestyle="dashed",alpha=0.6)
svc_display = plot_roc_curve(tree,test_set,test_target,ax=ax,color="red")
forest_display = plot_roc_curve(forest,test_set,test_target,ax=ax)
tree_display = plot_roc_curve(tree,test_set,test_target,ax=ax)
plt.show()

Note: there must be probability estimation to enable soft voting. Our common decision tree, random forest, will calculate probability, and SVM Classification depends on distance, so
 We need to start probability=True(Enable probability estimation)

Random Forest (Bagging deformation)

Random forest is an extended deformation algorithm of Bagging. It takes the decision tree as the base learner and introduces random attribute selection (k features are selected from m features each time) in the training process of the decision tree, so as to construct different decision trees. Finally, it depends on multiple decision trees to form a forest to predict the final result.

# Same as the previous code import package
# Get the data. The data here follow the Titanic data. You can download the data from the link in this article
# https://blog.csdn.net/qq_40725653/article/details/117922390?spm=1001.2014.3001.5501
dataset = pd.read_csv(r"E:\Jupyter\Pytorch\ML_Algorithm\Decision_tree\titanic\train.csv")
dataset = dataset.drop(columns=['Name','SibSp','Parch','Ticket','Fare','Cabin','Embarked'])
# Simple data processing to obtain features and labels
features = dataset.iloc[:,-3:]
target = dataset.loc[:,["Survived"]]
# Missing value processing, filled with average value
for line in features.columns:
    if np.any(features[line].isnull())==True:
        features[line].fillna(value=floor(round(features[line].mean(),1)),inplace=True)

from sklearn.feature_extraction import DictVectorizer     
# Data segmentation
x_train,x_test,y_train,y_test = train_test_split(features,target,test_size=0.2,random_state=1)
# Characteristic Engineering
transfer = DictVectorizer(sparse=False)
x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.fit_transform(x_test.to_dict(orient="records"))
estimator = RandomForestClassifier(max_features="auto")
param_grid = {"n_estimators":[10,30,50,80,100,120,150,100],"max_depth":[2,4,6,8]}
estimator = GridSearchCV(estimator,param_grid=param_grid,cv=10)
# Training model
estimator.fit(train_set,train_target)
# Model evaluation
score = estimator.score(test_set,test_target)
estimator.best_estimator_

Bagging summary

1: Sample data or features for learning
2: Average voting right of learner
3: In parallel learning, there is no strong dependence among learners
4: Bagging It is mainly used to reduce variance and over fitting to improve the generalization performance of the model

Keywords: Algorithm Machine Learning AI Decision Tree

Added by anujgarg on Thu, 30 Dec 2021 02:53:03 +0200