Kaggle classic data analysis project: Titanic survival prediction! 1. Data overview and visualization 2 Data preprocessing 3 Model training 4 Model optimization (parameter adjustment)

↑ ↑ ↑ follow the "Star" Datawhale

Daily dry goods& Team learning every month

, don't miss it

Datawhale dry

Author: Chen Kai, member of Datawhale, Sun Yat sen University

Recently, many readers have left messages, hoping to have a complete data analysis project to practice. I have collected the recommendations of the organization members these days. As the most classic enlightenment data analysis project, Titanic survival prediction should be the most appropriate for beginners. More advanced data analysis projects will be shared later. If you already have a foundation, recommend:

1. Open source project "hands on data analysis":

https://github.com/datawhalechina/hands-on-data-analysis

  1. DCIC 2020 algorithm analysis competition : DCIC is a rare classic event in China to open real government data, which provides a good opportunity for capacity practice and academic research.

https://mp.weixin.qq.com/s/-fzQIlZRig0hqSm7GeI_Bw

The full text is as follows:

Combined with the survival prediction of Titanic, this paper starts from 1 Data exploration (data visualization), 2 Data preprocessing, 3 Model training, 4 The four steps of model parameter adjustment are completely sorted out:

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzR3J6SVRicUdpYm83WEZ2cXpvMXE3eThpYmljaFRpY2ZObDZsRk5nbDdpYjF1TGRSdUFpYWcxWEM2bTlJZzlVeFliQWNTZlpyZTFjSUV0NDVjZy82NDA?x-oss-
process=image/format,png)

1. Data overview and visualization

1.1 data overview

First, we import our training data and test data:

The dataset contains train CSV and test CSV two files, in the Datawhale public number reply data set, can get the package link, or can be downloaded directly on the official account of kaggle.

    train_data = pd.read_csv("input/train.csv", index_col=0)
    test_data = pd.read_csv("input/test.csv", index_col=0)
    train_data.head()

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTXRiSHhzZVpJa3dXcG5vR242WFl0VkpvQ0dvUktOWTMwaWJVcnM4WWJNTzJuSHRFcjZDUDBoSlEvNjQw?x-oss-
process=image/format,png)

    train_data.describe()

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTUYwYk9LbkdGd2FRZks3VXJFMlJ0enhXaEJtZ0V1RlVpY1hDbjY1Q0lsbWFLS1VtbEFrOW5NMEEvNjQw?x-oss-
process=image/format,png)
Through the describe() function, we can simply see which are numeric data and which are character data. Of course, character data should be converted into numeric data for processing, such as 0-1 encoded numeric data, but it should be noted that some numeric data may not need further processing, such as the Pclass feature, From the name, we can see that this is the characteristic of identifying the bin level. The value range is [1],
2, 3], this feature should not be simply regarded as a numerical data into the classification model and run directly, but should be transformed into one-
hot coding to identify different positions of passengers. This step will be completed in the data preprocessing step.

Let's look at the data with null value. This is what we need to further deal with later:

    train_data.isnull().sum().sort_values(ascending=False).head(4)

The display result is:

>     Cabin       687
>     Age         177
>     Embarked      2
>     Fare          0
>     dtype: int64
>  
[/code]

##  1.2 data visualization

In order to make this article look a little more (wrong), we can draw multi-point diagrams to show the data information. Readers who want to directly preprocess the data can skip this part, which is mostly from Kaggle An article on the official website notebook. 

###  1.2.1 gender and survival

First of all, we should remember the touching "women first" strategy in the film:

```code
    sns.barplot(x="Sex", y="Survived", data=train_data)

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTWRwNGJsOHlacjA0Q01XaWJ2MVEyMHQzczBkOWljNExpYlc3bkNMV2FHb0V3cUZpYkpZME1TRDJWdHcvNjQw?x-oss-
process=image/format,png) here we can see that the survival rate of women is much higher than that of men, which is also in line with the plot of the film.

1.2.2 position level (social level) and survival rate

We can also guess that passengers in different positions should have different rescue rates:

    #draw a bar plot of survival by Pclass
    sns.barplot(x="Pclass", y="Survived", data=train)
    
    #print percentage of people by Pclass that survived
    print("Percentage of Pclass = 1 who survived:", train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)[1]*100)
    
    print("Percentage of Pclass = 2 who survived:", train["Survived"][train["Pclass"] == 2].value_counts(normalize = True)[1]*100)
    
    print("Percentage of Pclass = 3 who survived:", train["Survived"][train["Pclass"] == 3].value_counts(normalize = True)[1]*100)

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTUVNYklvV3hhZHR5a2JpY2NhZWliOFNGSXlaZmEwYksyMzQzMHZEM2ZRUGFZWGJCajV6R3U3ZDZ3LzY0MA?x-oss-
process=image/format,png)

    Percentage of Pclass = 1 who survived: 62.96296296296296
    Percentage of Pclass = 2 who survived: 47.28260869565217
    Percentage of Pclass = 3 who survived: 24.236252545824847

The data result is still very realistic. Your position naturally has a higher survival rate. Otherwise, why do I spend this unjust money? Everyone is not equal in front of life and death.

As predicted, people with higher socioeconomic class had a higher rate of
survival. (62.9% vs. 47.3% vs. 24.2%)

1.2.3 number of family members and survival rate

    #draw a bar plot for SibSp vs. survival
    sns.barplot(x="SibSp", y="Survived", data=train)
    
    #I won't be printing individual percent values for all of these.
    print("Percentage of SibSp = 0 who survived:", train["Survived"][train["SibSp"] == 0].value_counts(normalize = True)[1]*100)
    
    print("Percentage of SibSp = 1 who survived:", train["Survived"][train["SibSp"] == 1].value_counts(normalize = True)[1]*100)
    
    print("Percentage of SibSp = 2 who survived:", train["Survived"][train["SibSp"] == 2].value_counts(normalize = True)[1]*100)
    Percentage of SibSp = 0 who survived: 34.53947368421053
    Percentage of SibSp = 1 who survived: 53.588516746411486
    Percentage of SibSp = 2 who survived: 46.42857142857143

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTWJCaWNranpyWDFiVXp6TnZuUjBDNm9CZk1EUkp0M2hGNkh0aWJmVWJGa0FaaDM5OG1tb0JSOFlBLzY0MA?x-oss-
process=image/format,png) it can be seen here that those with one brother and sister generally have a higher survival rate, so go and encourage parents to have a brother and sister~

In general, it's clear that people with more siblings or spouses aboard were
less likely to survive. However, contrary to expectations, people with no
siblings or spouses were less to likely to survive than those with one or
two. (34.5% vs 53.4% vs. 46.4%)

    #draw a bar plot for Parch vs. survival
    sns.barplot(x="Parch", y="Survived", data=train)
    plt.show()

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTWlibVpGRGpCUnFSZlk1TmlhVWlhaWFDVG16d0xNaEx0WWR3QUNqWGFrYmVWWnFockFTTUN4bksxNFEvNjQw?x-oss-
process=image/format,png) it seems that people who travel alone have a lower survival rate. Think of the wet eyes

People with less than four parents or children aboard are more likely to
survive than those with four or more. Again, people traveling alone are less
likely to survive than those with 1-3 parents or children.

1.2.4 age and survival

    #sort the ages into logical categories
    train["Age"] = train["Age"].fillna(-0.5)
    test["Age"] = test["Age"].fillna(-0.5)
    bins = [-1, 0, 5, 12, 18, 24, 35, 60, np.inf]
    labels = ['Unknown', 'Baby', 'Child', 'Teenager', 'Student', 'Young Adult', 'Adult', 'Senior']
    train['AgeGroup'] = pd.cut(train["Age"], bins, labels = labels)
    test['AgeGroup'] = pd.cut(test["Age"], bins, labels = labels)
    
    #draw a bar plot of Age vs. survival
    sns.barplot(x="AgeGroup", y="Survived", data=train)
    plt.show()

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTWh2STFTbVd5b0FxRFVKQ0tMYk1oSFNTMWljN2cyQ3p0aWFNS0JaSmxIVXhKMzZWeW80OE5RWGJ3LzY0MA?x-oss-
process=image/format,png)
This chart is drawn using a method of pandas: cut (), which can be used to segment the data. We have an obvious conclusion that the survival rate of infants is God damn high (I think a large part of the reason is that they don't occupy space)

1.2.5 existence of position characteristics and survival rate

This is a strange indicator. According to the author:

I think the idea here is that people with recorded cabin numbers are of
higher socioeconomic class, and thus more likely to survive.

Well, let's see:

    test["CabinBool"] = (test["Cabin"].notnull().astype('int'))
    
    #calculate percentages of CabinBool vs. survived
    print("Percentage of CabinBool = 1 who survived:", train["Survived"][train["CabinBool"] == 1].value_counts(normalize = True)[1]*100)
    
    print("Percentage of CabinBool = 0 who survived:", train["Survived"][train["CabinBool"] == 0].value_counts(normalize = True)[1]*100)
    #draw a bar plot of CabinBool vs. survival
    sns.barplot(x="CabinBool", y="Survived", data=train)
    plt.show()
>     Percentage of CabinBool = 1 who survived: 66.66666666666666
>     Percentage of CabinBool = 0 who survived: 29.985443959243085
>  
[/code]

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTVVHNjhNU2d4QXVhSFdHUEl3RkNzbGhRY1R2SmV0YldJeXd4VElOcWRkaWJpY3Z0aWMyenRCWk1Qdy82NDA?x-oss-
process=image/format,png)  

The brain hole is really big, and the result is really good~

###  1.2.6 thermodynamic diagram

We can also draw a beautiful thermal map for the data, although it is of little use:

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTWZGYXlpYXVwZU9taWFFajkxZ1JyZzRheHoxRnI4NGhpY1JpYVR3ZHU4aDZkVXlaWDFpYVI3ZEJpYjRody82NDA?x-oss-
process=image/format,png)  

#  2. Data preprocessing

##  2.1 splice dataset

First, let's talk about the focus of training Survived Feature extraction is the objective function we need to predict, and so is this part train_data and test_data Next, we can talk about the data of the training set and the test set to be spliced together for data preprocessing. Of course, we can't know the test data in practice, but we can handle it uniformly for convenience in the competition:

```code
    y_train = train_data.pop("Survived")
    data_all = pd.concat((train_data, test_data), axis=0)

2.2 process the Name feature and extract the Title

Looking from left to right, we can first see that the feature Name is more eye-catching. Many people may directly remove it, but careful observation shows that this column of features contains the prefix of the Name, such as "Mr.", ”Mrs. "," Miss ", etc. as long as you have studied English in grade one of primary school, you know that this feature will represent class status and marriage to a certain extent. We can map this feature as follows:

    title = pd.DataFrame()
    title["Title"] = data_all["Name"].map(lambda name:name.split(",")[1].split(".")[0].strip())
    # title.head()
    Title_Dictionary = {
        "Capt":       "Officer",
        "Col":        "Officer",
        "Major":      "Officer",
        "Jonkheer":   "Royalty",
        "Don":        "Royalty",
        "Sir" :       "Royalty",
        "Dr":         "Officer",
        "Rev":        "Officer",
        "the Countess":"Royalty",
        "Dona":       "Royalty",
        "Mme":        "Mrs",
        "Mlle":       "Miss",
        "Ms":         "Mrs",
        "Mr" :        "Mr",
        "Mrs" :       "Mrs",
        "Miss" :      "Miss",
        "Master" :    "Master",
        "Lady" :      "Royalty"
    }
    title[ 'Title' ] = title.Title.map(Title_Dictionary)
    title = pd.get_dummies(title.Title)
    # title.head()
    data_all = pd.concat((data_all, title), axis=1)
    data_all.pop("Name")
    data_all.head()

What does the above paragraph mean? We can classify many kinds of title features first, such as "Don", "Sir" and "Jonkheer". The occurrence times of these titles are very low, about less than ten per title. Therefore, we can classify those with similar meaning into one category for the convenience of model operation. Then we use get_dummies to convert these features into one-
hot vector, the results are as follows:

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTVF5aWNFdVdRbkw5cERxTjZzaWFUZGljY0JlU2p4eXRLaWNGUDI3M3VWaWF2S0o3WDNTcllNaWJKdHdEZy82NDA?x-oss-
process=image/format,png)

2.3 extracting other features

this
The Ticket feature is troublesome and lazy. Delete it first, and then the bin feature should be very useful. Think about it, of course, the distance from our different positions on the ship to the safe passage will vary with the bin position. We simply extract the positions A, B, C and D as features, regardless of the numbers in C85 and C123 (indicating the position in A warehouse), Of course, since some ships may have safe channels in positions A, B, C and D, it may be more suitable for us to extract the following figures. For convenience, we won't discuss this first:

    data_all["Cabin"].fillna("NA", inplace=True)
    data_all["Cabin"] = data_all["Cabin"].map(lambda s:s[0])
    data_all.pop("Ticket")

As mentioned earlier, Pclass is more suitable to appear as one hot feature. We first convert it into character feature and then classify it. Here, we conveniently make several reliable category labels as one hot feature:

    data_all["Pclass"] = data_all["Pclass"].astype(str)
    feature_dummies = pd.get_dummies(data_all[["Pclass", "Sex", "Embarked", "Cabin"]])
    # feature_dummies.head()
    data_all.drop(["Pclass", "Sex", "Embarked", "Cabin"], inplace=True, axis=1)
    data_all = pd.concat((data_all, feature_dummies), axis=1)
    data_all.head()

So we expanded the feature set from the original 11 columns to 27 columns. Oh, no, we forgot to fill in the missing values. It's not too late to do it now:

    mean_cols = data_all.mean()
    data_all = data_all.fillna(mean_cols)

Here, the average value is used to fill in the two features of Age and embanked. Because Age is just a numerical feature, this filling method is reasonable, and embanked has only two missing values, so filling it casually ~ doesn't matter.

2.4 re separate the training set and the test set

Before building the model, don't forget the training set and test set we put together. Oh, remember the index when we first read the data_ Col, it comes in handy here:

    train_df = data_all.loc[train_data.index]
    test_df = data_all.loc[test_data.index]
    print(train_df.shape, test_df.shape)

The print result is (891, 27) (418, 27), which is consistent with the size of the original training set and test set. This is the end of our rough data preprocessing. Let's build the model ~

3. Model training

3.1 Random Forest

First, import the package of sklearn

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import cross_val_score
    import sklearn

Then set different maximum tree depths for parameter tuning:

    %matplotlib inline
    depth_ = [1, 2, 3, 4, 5, 6, 7, 8]
    scores = []
    for depth in depth_:
        clf = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=0)
        test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
        scores.append(np.mean(test_score))
    plt.plot(depth_, scores)

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTURwM0dkZzZwaWJMcEFDUWUzdG9YRzZ4VmVlWW9MT2V5b1RVa2syU0tsT2ZUS3dOelEzeFJNWFEvNjQw?x-oss-
process=image/format,png)
We have obtained such a graph, which roughly reflects the maximum depth of the tree in the model, with 6 as the best. At this time, the verification accuracy can reach about 0.84. Of course, we can continue to adjust other parameters to obtain better results, but next, we will continue to discuss other models.

3.2 Gradient Boosting Classifier

The code is similar to the above:

    from sklearn.ensemble import GradientBoostingClassifier
    depth_ = [1, 2, 3, 4, 5, 6, 7, 8]
    scores = []
    for depth in depth_:
        clf = GradientBoostingClassifier(n_estimators=100, max_depth=depth, random_state=0)
        test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
        scores.append(np.mean(test_score))
    plt.plot(depth_, scores)

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTUJxRnl5cW55aEF2WjkzZm1TOExFR1BCaWNsU2FLMU44RFFBakFzaWNWaHJieERWbHdac3JOTldBLzY0MA?x-oss-
process=image/format,png) has the highest success rate, which seems to be close to 0.82

3.3 Bagging

Bagging puts a lot of small classifiers together, each train a random part of the data, and then combines their final results (majority voting system)

    from sklearn.ensemble import BaggingClassifier
    params = [1, 10, 15, 20, 25, 30, 40]
    test_scores = []
    
    for param in params:
        clf = BaggingClassifier(n_estimators=param)
        test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
        test_scores.append(np.mean(test_score))
    plt.plot(params, test_scores)

The results are unstable and bad:

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTU1mSmliNTF5NzdRYUdRRm91VlBWMTZPRXNzaHd3aWFIUmxMUWRNWWlhOGY2TmpQZU5tZFVFQjNuUS82NDA?x-oss-
process=image/format,png)

3.4 RidgeClassifier

Let's stop talking nonsense and try one by one:

    from sklearn.linear_model import RidgeClassifier
    alphas = np.logspace(-3, 2, 50)
    test_scores = []
    
    for alpha in alphas:
        clf = RidgeClassifier(alpha)
        test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
        test_scores.append(np.mean(test_score))
    plt.plot(alphas, test_scores)

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTU1RWHlmY2RMRzBpY2Rab1hHWlN4M2RsUFJ4NzNhdlJaM0lTMHlpYU5aUmYwWjVpY3VZdm9CbUVHUS82NDA?x-oss-
process=image/format,png)

3.5 RidgeClassifier + Bagging

    ridge = RidgeClassifier(alpha=5)
    params = [1, 10, 15, 20, 25, 30, 40]
    test_scores = []
    
    for param in params:
        clf = BaggingClassifier(n_estimators=param, base_estimator=ridge)
        test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
        test_scores.append(np.mean(test_score))
    plt.plot(params, test_scores)

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTUE3UFZNNjFHTmozWjFtdWJSMkRsQThSOHVBNXdXdVN1bWRJTUtBeGROUFBZSGQ1d3RtbUNnZy82NDA?x-oss-
process=image/format,png)

The result is slightly better than the Bagging strategy using the default model.

3.6 XGBClassifier

    from xgboost import XGBClassifier
    params = [1, 2, 3, 4, 5, 6]
    test_scores = []
    for param in params:
        clf = XGBClassifier(max_depth=param)
        test_score = cross_val_score(clf, train_df, y_train, cv=10, scoring="precision")
        test_scores.append(np.mean(test_score))
    plt.plot(params, test_scores)

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTVlpY3VXQWJYUlBqYzh3bkhrMFpMaWJ2MEp0cEVXMXpNUmljeFFxSGNxaWJqanppY1ppYnkzUmx6N20wQS82NDA?x-oss-
process=image/format,png)

3.7 neural network

Firstly, we built a simple neural network architecture based on Keras:

    import tensorflow as tf
    import keras
    from keras.models import Sequential
    from keras.layers import *
    
    tf.keras.optimizers.Adam(
        learning_rate=0.003, beta_1=0.9, beta_2=0.999, epsilon=1e-07, amsgrad=False,
        name='Adam',
    )
    model = Sequential()
    model.add(Dense(32, input_dim=train_df.shape[1],kernel_initializer = 'uniform', activation='relu'))
    model.add(Dense(32, kernel_initializer = 'uniform', activation = 'relu'))
    model.add(Dropout(0.4))
    model.add(Dense(32,kernel_initializer = 'uniform', activation = 'relu'))
    model.add(Dense(1, activation='sigmoid'))
        
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Then put the model into the train_df training results:

    history = model.fit(np.array(train_df), np.array(y_train), epochs=20, batch_size=50, validation_split = 0.2)

The results of the last round are:

    Epoch 20/20
    712/712 [==============================] - 0s 43us/step - loss: 0.4831 - accuracy: 0.7978 - val_loss: 0.3633 - val_accuracy: 0.8715

We can see that the experimental results are still good. Let's take a look at the model architecture:

    model.summary()
    Model: "sequential_1"
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #
    =================================================================
    dense_1 (Dense)              (None, 32)                896
    _________________________________________________________________
    dense_2 (Dense)              (None, 32)                1056
    _________________________________________________________________
    dropout_1 (Dropout)          (None, 32)                0
    _________________________________________________________________
    dense_3 (Dense)              (None, 32)                1056
    _________________________________________________________________
    dense_4 (Dense)              (None, 1)                 33
    =================================================================
    Total params: 3,041
    Trainable params: 3,041
    Non-trainable params: 0
    _________________________________________________________________

Test model:

    scores = model.evaluate(train_df, y_train, batch_size=32)
    print(scores)
    891/891 [==============================] - 0s 18us/step
    [0.4208374666645872, 0.8316498398780823]

You can see that the effect is similar to the best effect of random forest.

4. Model optimization (parameter adjustment)

Later, we can get better scores through the second level learning of these better models.

First, we put some good results obtained before (only one parameter is adjusted here):

    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier, AdaBoostClassifier
    from xgboost import XGBClassifier
    from sklearn.model_selection import cross_val_score
    from sklearn.linear_model import RidgeClassifier
    import sklearn
    
    classifier_num = 5
    clf = [0 for i in range(classifier_num)]
    clf[0] = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=0)
    clf[1] = GradientBoostingClassifier(n_estimators=100, max_depth=4, random_state=0)
    clf[2] = RidgeClassifier(5)
    clf[3] = BaggingClassifier(n_estimators=15, base_estimator=clf[2])
    clf[4] = XGBClassifier(max_depth=2)
    
    from sklearn.model_selection import train_test_split
    X_train, X_test, Y_train, Y_test = train_test_split(train_df, y_train, random_state=0)
    
    predictFrame = pd.DataFrame()
    for model in clf:
        model.fit(X_train, Y_train)
        predictFrame[str(model)[:13]] = model.predict(X_test)
    predictFrame.head()

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTWFzdjhGaWJjSHRxZUZTOGtuRW1WZ3VtYUhlaWFjY1AwMHFOZ0ZFWGpXcklBQUJWRGpuYlp5YmVBLzY0MA?x-oss-
process=image/format,png)

The name is optional. Anyway, as long as it is not repeated ~ then put this result into the next classifier for learning. I didn't try anything else, but directly put it into the random forest classifier:

    %matplotlib inline
    depth_ = [1, 2, 3, 4, 5, 6, 7, 8]
    scores = []
    for depth in depth_:
        clf_ = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=0)
        test_score = cross_val_score(clf_, predictFrame, Y_test, cv=10, scoring="precision")
        scores.append(np.mean(test_score))
    plt.plot(depth_, scores)

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzRnFwM1pON2ptM05ydEpxZUpycTBuTVIxa25ZajYwbzhaVGM3VGV5M2liQkV4U0hmeDFzckw2WDgzenY4VU8yUjNPbTI1dEQ5dFpGZEEvNjQw?x-oss-
process=image/format,png)

OK, set the parameter to 2, and then try the overall result:

    finalFrame = pd.DataFrame()
    XFrame = pd.DataFrame()
    for model in clf:
        model.fit(train_df, y_train)
        XFrame[str(model)[:13]] = model.predict(train_df)
        finalFrame[str(model)[:13]] = model.predict(test_df)
    final_clf = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
    final_clf.fit(XFrame, y_train)
    result = final_clf.predict(finalFrame)

Splice the result and passengerId together into a Dataframe and directly output the results. It is slightly better than using the random forest results without fusion before. However, we only used a few simple machine learning algorithms and did not carefully adjust the parameters. In fact, there is still a lot of optimization space, and readers can optimize by themselves.

This article PDF and dataset background reply "_Titanic" Downloadable

![](https://imgconvert.csdnimg.cn/aHR0cHM6Ly9tbWJpei5xcGljLmNuL21tYml6X3BuZy92STluWWU5NGZzR3h1M1A1WWliVE84OTlva1MwWDlXYUxtUUN0aWE0VThFdTF4V0N6OXQ4UXRxOVBINlQxYlRjeGliaWFDSWtHekF4cGVSa1JGWXFpYlZtd1N3LzY0MA?x-oss-
process=image/format,png)

"Dry goods learning, praise three companies ↓

Keywords: Data Analysis

Added by immobilarity on Fri, 21 Jan 2022 09:19:53 +0200