Titanic data competition model fusion method with the most praise from kaggle (with code)

I've heard that many bosses acquire knowledge from kaggle and process it into their own competition system

In July this year, I started to participate in the big data competition. Now there are almost 10 competitions, all of which are structured competitions. Small competitions can still get into the Top ranking, but larger competitions are more difficult. The problem is that there is no system, so I plan to structure the kaggle. In the timing competition, I will simply translate and sort out the better notebook and summarize the useful information, I hope to have better results in the future competitions

Focus of this paper

Taking the Titanic competition data on kaggle as an example, this paper introduces how to integrate the stacking model. The number of likes on kaggle exceeds 5000. The original link: kaggle Titanic

After careful study, the following ideas can be used for reference:

Characteristic Engineering

  1. Classification features can be classified into 2 categories by whether they are missing values or not, and can also be filled into the largest number of categories
  2. Missing value filling method for numerical characteristic age
  3. Numerical equidistant and equifrequency discretization
  4. Text processing, some information in foreigners' names

Model

  1. A class sklearnhelp is created to improve code efficiency
  2. The first layer is five base models, and the second layer is xgboost (linear regression is also used in some articles)
  3. Each base model in layer 1 uses k-fold cross validation, outputs the predicted values of train and test, and then splices the data to construct a new train_new and test_new, label or original label
  4. Finally, the layer 2 model is trained with new data to output the final prediction value
# Import Toolkit
import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

import warnings
warnings.filterwarnings('ignore')

# 5 basic models
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.cross_validation import KFold

Feature Engineering and processing

Like most other kernel structures, we first explore the data at hand, identify possible feature engineering, and digitize text features

# Import training data and test data
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

# Store passenger ID
PassengerId = test['PassengerId']

train.head(3)

There is no doubt that we will extract information from classification features

Characteristic Engineering

# Combined training data and test data
full_data = [train, test]

# Name length
train['Name_length'] = train['Name'].apply(len)
test['Name_length'] = test['Name'].apply(len)
# Does the passenger have a cabin number on board
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

# Combine SibSp and Parch features
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
# Is it a person
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
# If the boarding port is empty, it is filled with class S
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
#  The missing fare value is filled as the mean   
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())
# Ticket prices are discretized by Quantile    
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)
# Age missing value filling
for dataset in full_data:
    age_avg = dataset['Age'].mean()
    age_std = dataset['Age'].std()
    age_null_count = dataset['Age'].isnull().sum()
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)
# Age is discretized by Quantile    
train['CategoricalAge'] = pd.cut(train['Age'], 5)
# Define functions to handle passenger names
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    # If the title exists, extract and return it.
    if title_search:
        return title_search.group(1)
    return ""
# Create a new feature, called title
for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
# Classify the uncommon as category 1 and replace the others
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')

    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

for dataset in full_data:
    # Gender numerical code
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
    
    # Encode the title value
    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)
    
    # Numerical code of boarding port
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    
    # Fare discretization
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] 						        = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] 							        = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
    
    # Age discretization
    dataset.loc[ dataset['Age'] <= 16, 'Age'] 					       = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4 ;
# feature selection 
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
# Delete the original text feature
train = train.drop(drop_elements, axis = 1)
# Delete 2 features created but not used (estimated to be used for viewing distribution)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)
test  = test.drop(drop_elements, axis = 1)

Now we clean and extract the relevant information, delete the text column, and all the features are numerical columns, which are suitable for the machine learning model. Before we continue, let's do some simple correlation and distribution diagrams

visualization

train.head(3)

Pearson correlation heat map

Generate correlation diagrams of some features to view the correlation between one feature and the next.

colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

Get information from diagram

One thing Pearson correlation diagram can tell us is that not many features have a strong correlation with each other. From the perspective of inputting these functions into the learning model, this is good because it means that there is not much redundant or redundant data in our training set, and we are glad that each feature has some unique information. The two most relevant characteristics here are family size and Parch (parents and children). For the purposes of this exercise, I will retain these two features.

Pairplots

Finally, let's generate some pairing graphs to observe the distribution of data features

g = sns.pairplot(train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Parch', u'Fare', u'Embarked',
       u'FamilySize', u'Title']], hue='Survived', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])

Integration and integration model (stacking)

# Some parameters
ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0 # Reappearance
NFOLDS = 5 # 
kf = KFold(ntrain, n_folds= NFOLDS, random_state=SEED)

# Create a machine learning extension class
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)
    
# Class to extend XGboost classifer

For those who already know this, please forgive me, but for those who have not created classes or objects in Python before, let me explain what the code given above does. When creating a basic classifier, I will only use models that already exist in the Sklearn library, so I will only extend this class.

Define init: the default constructor of the python standard calling class. This means that when you want to create an object (classifier), you must provide it with the parameters of clf (sklearn classifier you want), seed (random seed), and params (classifier parameters).

The rest of the code is just the methods of the class. They just call the corresponding methods that already exist in the sklearn classifier. In essence, we create an encapsulation class to extend various sklearn classifiers, so that when we implement multiple learners, we can reduce the number of times to write the same code repeatedly.

Out-of-Fold Predictions

Now, as mentioned in the introduction above, stacking uses the prediction of the basic classifier as the input of the second level model training. However, it is impossible to simply train the basic model based on the complete training data, generate predictions on the complete test set, and then output these predictions for the second level training. This will cause your underlying model predictions to "see" the test set, so there is a risk of over fitting when entering these predictions.

def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))

    for i, (train_index, test_index) in enumerate(kf):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)
# The code is actually very similar to page 89 of the book. It turns out that many of them are from kaggle

Generate layer 1 model

Now let's prepare five learning models as the first level classification. These models can be easily called through the Sklearn library, as shown below:

  1. Random Forest classifier
  2. Extra Trees classifier
  3. AdaBoost classifer
  4. Gradient Boosting classifer
  5. Support Vector Machine

parameter

For completeness, we will list a brief summary of the parameters here,

n_jobs: the number of jobs used by the kernel during training. If set to - 1, all jobs will be used

n_estimators: the number of classifiers in the learning model

max_depth: the maximum depth of the tree, or how much a node should expand. Note that if the number is set too high, there is a risk of over fitting

verbose: controls whether to output any text during the learning process. A value of 0 suppresses all text, while a value of 3 outputs the tree learning process at each iteration.

Please check the full description on the official Sklearn website. Here, you will find many other useful parameters that you can use.

# Enter the parameters of the classifier to which you belong
# Random Forest parameters
rf_params = {
    'n_jobs': -1,
    'n_estimators': 500,
     'warm_start': True, 
     #'max_features': 0.2,
    'max_depth': 6,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'verbose': 0
}

# Extra Trees Parameters
et_params = {
    'n_jobs': -1,
    'n_estimators':500,
    #'max_features': 0.5,
    'max_depth': 8,
    'min_samples_leaf': 2,
    'verbose': 0
}

# AdaBoost parameters
ada_params = {
    'n_estimators': 500,
    'learning_rate' : 0.75
}

# Gradient Boosting parameters
gb_params = {
    'n_estimators': 500,
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

# Support Vector Classifier parameters
svc_params = {
    'kernel' : 'linear',
    'C' : 0.025
    }

Furthermore, since having mentioned about Objects and classes within the OOP framework, let us now create 5 objects that represent our 5 learning models via our Helper Sklearn Class we defined earlier.

# Create 5 objects that represent our 4 models
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)

Building arrays using training and test sets

After preparing the basic model of the first layer, we can now prepare the training and test data by generating a NumPy array from the original data for input to the classifier, as follows:

# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values # Creates an array of the train data
x_test = test.values # Creats an array of the test data

Output layer 1 forecast

Now, we input the training and test data into five base classifiers, and use the out of fold prediction function we defined earlier to generate our first layer prediction. Allow the following code block to run for a few minutes.

# Create training set and test set predictions 
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier

print("Training is complete")

Feature importance generated from different classifiers

Now, after learning the first level classifier, we can take advantage of a very excellent function of Sklearn model, that is, to output the importance of various functions in the training set and test set with a very simple line of code.

According to the Sklearn document, most classifiers have a built-in attribute, which only needs to enter * *. feature_importances returns the functional importance. Therefore, we will call this very useful attribute through our function Earlian to draw the importance of the feature

rf_feature = rf.feature_importances(x_train,y_train)
et_feature = et.feature_importances(x_train, y_train)
ada_feature = ada.feature_importances(x_train, y_train)
gb_feature = gb.feature_importances(x_train,y_train)
rf_features = [0.10474135,  0.21837029,  0.04432652,  0.02249159,  0.05432591,  0.02854371
  ,0.07570305,  0.01088129 , 0.24247496,  0.13685733 , 0.06128402]
et_features = [ 0.12165657,  0.37098307  ,0.03129623 , 0.01591611 , 0.05525811 , 0.028157
  ,0.04589793 , 0.02030357 , 0.17289562 , 0.04853517,  0.08910063]
ada_features = [0.028 ,   0.008  ,      0.012   ,     0.05866667,   0.032 ,       0.008
  ,0.04666667 ,  0.     ,      0.05733333,   0.73866667,   0.01066667]
gb_features = [ 0.06796144 , 0.03889349 , 0.07237845 , 0.02628645 , 0.11194395,  0.04778854
  ,0.05965792 , 0.02774745,  0.07462718,  0.4593142 ,  0.01340093]

Create a dataframe

cols = train.columns.values
# Create a dataframe with features
feature_dataframe = pd.DataFrame( {'features': cols,
     'Random Forest feature importances': rf_features,
     'Extra Trees  feature importances': et_features,
      'AdaBoost feature importances': ada_features,
    'Gradient Boost feature importances': gb_features
    })

Determine the importance of interaction features by drawing scatter diagrams

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Random Forest feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Random Forest feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Extra Trees  feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Extra Trees  feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Extra Trees Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['AdaBoost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['AdaBoost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'AdaBoost Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

# Scatter plot 
trace = go.Scatter(
    y = feature_dataframe['Gradient Boost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
#       size= feature_dataframe['AdaBoost feature importances'].values,
        #color = np.random.randn(500), #set color equal to a variable
        color = feature_dataframe['Gradient Boost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')


Now, let's calculate the average value of all feature importance and store it as a new column in the feature importance data box.

# mean value

feature_dataframe['mean'] = feature_dataframe.mean(axis= 1) # axis = 1 computes the mean row-wise
feature_dataframe.head(3)

Feature importance bar chart

y = feature_dataframe['mean'].values
x = feature_dataframe['features'].values
data = [go.Bar(
            x= x,
             y= y,
            width = 0.5,
            marker=dict(
               color = feature_dataframe['mean'].values,
            colorscale='Portland',
            showscale=True,
            reversescale = False
            ),
            opacity=0.6
        )]

layout= go.Layout(
    autosize= True,
    title= 'Barplots of Mean Feature Importance',
    hovermode= 'closest',
#     xaxis= dict(
#         title= 'Pop',
#         ticklen= 5,
#         zeroline= False,
#         gridwidth= 2,
#     ),
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='bar-direct-labels')

Tier 2 forecast

First stage output as a new feature

Now that our first level prediction has been obtained, we can think that it essentially constructs a new set of features to be used as the training data of the next classifier. Therefore, according to the following code, our new column contains the first level prediction from the early classifier, and trains the next classifier on this basis.

base_predictions_train = pd.DataFrame( {'RandomForest': rf_oof_train.ravel(),
     'ExtraTrees': et_oof_train.ravel(),
     'AdaBoost': ada_oof_train.ravel(),
      'GradientBoost': gb_oof_train.ravel()
    })
base_predictions_train.head()

Correlation heat map

data = [
    go.Heatmap(
        z= base_predictions_train.astype(float).corr().values ,
        x=base_predictions_train.columns.values,
        y= base_predictions_train.columns.values,
          colorscale='Viridis',
            showscale=True,
            reversescale = True
    )
]
py.iplot(data, filename='labelled-heatmap')

x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test), axis=1)

Two level learning model based on XGBoost

Here, we choose the very famous boosted tree learning model library XGBoost. It is used to optimize the large-scale boosted tree algorithm.

gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
 n_estimators= 2000,
 max_depth= 4,
 min_child_weight= 2,
 #gamma=1,
 gamma=0.9,                        
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread= -1,
 scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)

Generate submission file

# Generate Submission File 
StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,
                            'Survived': predictions })
StackingSubmission.to_csv("StackingSubmission.csv", index=False)

The level is limited. Please understand if there are some unexplainable places. If it is helpful to you, you are welcome to like the collection!

Keywords: Python Data Analysis data visualization kaggle

Added by Whetto on Sun, 28 Nov 2021 11:22:46 +0200