Integrated learning 6 -- Blending and Stacking

github address: DataScicence
Integrated learning 5-Xgboost principle and parameter tuning
Integrated learning 4-forward step-by-step algorithm and GBDT principle and case
Principles and cases of integrated learning 3-Boosting
Principle and case analysis of integrated learning 2-bagging
Principle and case analysis of integrated learning 1- voting method

Blending

principle

Steps:

  • Divide data into trains_ data,Validate_data,Test_data three parts
  • First layer model:
    • Using multiple base models in Train_data to get multiple models M k M^k Mk
    • Validate separately_ data,Test_ Data input model to get the prediction label A k , B k A^k,B^k Ak,Bk
  • Layer 2 model:
    • Use prediction labels on validation sets A k A^k Ak is used as the input and the real label is used as the output. The model is trained on the meta model N N N
    • take B k B^k Bk is used as the input and N models are input to obtain the final prediction label on the test set T T T
  • Evaluation model effect

Examples

import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns

Import data

from sklearn.datasets import load_iris
data = load_iris()
X = data.data[:,1:3]
y = data.target
print(X.shape,y.shape)
(150, 2) (150,)

Data division

from sklearn.model_selection import train_test_split

X_train_,X_test,y_train_,y_test = train_test_split(X,y,test_size = 0.2)

X_train,X_val,y_train,y_val = train_test_split(X_train_,y_train_)
print(X_train.shape,X_val.shape,X_test.shape)
(90, 2) (30, 2) (30, 2)

First layer classifier

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

# Model M^k
clfs = [SVC(probability = True),DecisionTreeClassifier(),KNeighborsClassifier()]
val_features = np.zeros((X_val.shape[0],len(clfs)))  # A^k
test_features = np.zeros((X_test.shape[0],len(clfs)))  #B^k

for i,clf in enumerate(clfs):
    clf.fit(X_train,y_train)
    val_feature = clf.predict_proba(X_val)[:, 1]
    test_feature = clf.predict_proba(X_test)[:,1]
    val_features[:,i] = val_feature
    test_features[:,i] = test_feature

Second layer classifier

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

#Model N
lr.fit(val_features,y_val)
# Output predicted results
from sklearn.model_selection import cross_val_score
cross_val_score(lr,test_features,y_test,cv=5)
array([0.83333333, 0.66666667, 0.66666667, 0.83333333, 0.66666667])

Draw decision boundary

The Blending boundary here is not accurate, because the input of lr model is the prediction result of the first layer model, not the training data

from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools
clfs.append(lr)
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10,8))
for clf, lab, grd in zip(clfs, 
                         ['SVC', 
                          'DT', 
                          'KNN',
                          'Blending'],
                          itertools.product([0, 1], repeat=2)):
    clf.fit(X, y)
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X, y=y, clf=clf)
    plt.title(lab)
plt.show()

stacking

principle

In Blending, our training model only uses part of the data, and the data is not fully utilized. Therefore, we can use cross validation to use all the data

Steps:

  • Segment data into training set Train_data, test set_ data
  • First layer model
    • K-fold cross validation is used on M models m respectively_ Data, and predict on K verification sets to get the prediction label A k ( take k individual junction fruit heap Fold , And Discipline Practice number according to of long degree mutually with ) A ^ k (stack K results with the same length as the training data) AK (stack k results with the same length as the training data), predict on the test set and obtain the prediction label B k ( take K individual junction fruit plus power flat all , And measure try collection long degree mutually with ) B ^ k (weighted average of K results, the same length as the test set) BK (weighted average of K results, the same length as the test set)
    • The results of m models are stacked vertically A m × k , B m × k A^{m\times k},B^{m\times k} Am×k,Bm×k
  • Second layer model
    • use A m × k A^{m\times k} Am × k is used as input and trained on Meta model N to obtain model N
    • take B m × k B^{m\times k} Bm × k input model to get the final prediction result
  • Evaluation model effect

Examples

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB 
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingCVClassifier

model training

clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier()
clf3 = GaussianNB()
lr = LogisticRegression()


clf_stacking = StackingCVClassifier(classifiers=[clf1,clf2,clf3],
                                   meta_classifier=lr,
                                   cv=5)
for clf, label in zip([clf1, clf2, clf3, clf_stacking], ['KNN', 'Random Forest', 'Naive Bayes','StackingClassifier']):
    scores = cross_val_score(clf, X, y, cv=3, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.91 (+/- 0.01) [KNN]
Accuracy: 0.95 (+/- 0.01) [Random Forest]
Accuracy: 0.91 (+/- 0.02) [Naive Bayes]
Accuracy: 0.96 (+/- 0.02) [StackingClassifier]

Decision boundary

from mlxtend.plotting import plot_decision_regions
from matplotlib import gridspec
import itertools

gs = gridspec.GridSpec(2,2)
fig =  plt.figure(figsize=(10,8))

for clf,lab,grd in zip([clf1,clf2,clf3,clf_stacking],
                      ['KNN', 
                          'Random Forest', 
                          'Naive Bayes',
                          'StackingCVClassifier'],
                       itertools.product([0,1],repeat=2)
                      ):
    clf.fit(X,y)
    ax = plt.subplot(gs[grd[0],grd[1]])
    fig = plot_decision_regions(X=X,y=y,clf=clf)
    plt.title(lab)
plt.show()

Grid parameter adjustment of stacking model

from sklearn.model_selection import GridSearchCV

params = {'kneighborsclassifier__n_neighbors': [1, 5],
          'randomforestclassifier__n_estimators': [10, 50],
          'meta_classifier__C': [0.1, 10.0]}

grid = GridSearchCV(estimator=clf_stacking, 
                    param_grid=params, 
                    cv=5,
                    refit=True)
grid.fit(X, y)
print('Best parameters: %s' % grid.best_params_)
print('Accuracy: %.2f' % grid.best_score_)
Best parameters: {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
Accuracy: 0.95
clf_stacking.score(X,y)
0.9866666666666667

The base model uses different feature subsets

from sklearn.pipeline import make_pipeline
from mlxtend.feature_selection import ColumnSelector
iris = load_iris()
X = iris.data
y = iris.target

pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)),  # Select column 0,2
                      LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),  # Select columns 1, 2 and 3
                      LogisticRegression())

sclf = StackingCVClassifier(classifiers=[pipe1, pipe2], 
                            meta_classifier=LogisticRegression(),
                            random_state=42)

sclf.fit(X, y)
StackingCVClassifier(classifiers=[Pipeline(steps=[('columnselector',
                                                   ColumnSelector(cols=(0, 2))),
                                                  ('logisticregression',
                                                   LogisticRegression())]),
                                  Pipeline(steps=[('columnselector',
                                                   ColumnSelector(cols=(1, 2,
                                                                        3))),
                                                  ('logisticregression',
                                                   LogisticRegression())])],
                     meta_classifier=LogisticRegression(), random_state=42)
from sklearn.model_selection import cross_val_score
cross_val_score(sclf,X,y,cv=5)
array([0.96666667, 0.96666667, 0.9       , 0.96666667, 1.        ])

Keywords: Python Machine Learning

Added by rwcurry on Fri, 11 Feb 2022 00:36:19 +0200