kaggle_ Introduction to Titanic combat

preface

In order to record my learning process, I have roughly sorted out the analysis process. The tool is to use Jupiter notebook, which I prefer, and then export it into md format and send it to csdn to share with you;
This is only a simple analysis process, so it is relatively simple. If there is something wrong, I hope you can include it more. You are also welcome to exchange opinions.

Table of Contents

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams['font.family'] = 'STSong'
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler


import warnings
warnings.filterwarnings('ignore')

data fetch

df_train = pd.read_csv('../titanic_dir/titanic_data/train.csv')
df_test = pd.read_csv('../titanic_dir/titanic_data/test.csv')
df_all = pd.concat([df_train, df_test])
df_all.reset_index(drop=True,inplace=True)

print('df_all'+'*'*20)
print(df_all.columns)

print('df_train'+'*'*20)
print(df_train.columns)

print('df_test'+'*'*20)
print(df_test.columns)
df_all********************
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
df_train********************
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
df_test********************
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
df_all
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
231.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
341.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
450.03Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
13041305NaN3Spector, Mr. WoolfmaleNaN00A.5. 32368.0500NaNS
13051306NaN1Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
13061307NaN3Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS
13071308NaN3Ware, Mr. FrederickmaleNaN003593098.0500NaNS
13081309NaN3Peter, Master. Michael JmaleNaN11266822.3583NaNC

1309 rows × 12 columns

df_all.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB

The missing rate of Age and Cabin data is large;

Embarked, the deletion rate of far is very small

feature correlation

Pearson correlation formula

df_all.corr('pearson')
PassengerIdSurvivedPclassAgeSibSpParchFare
PassengerId1.000000-0.005007-0.0383540.028814-0.0552240.0089420.031428
Survived-0.0050071.000000-0.338481-0.077221-0.0353220.0816290.257307
Pclass-0.038354-0.3384811.000000-0.4081060.0608320.018322-0.558629
Age0.028814-0.077221-0.4081061.000000-0.243699-0.1509170.178740
SibSp-0.055224-0.0353220.060832-0.2436991.0000000.3735870.160238
Parch0.0089420.0816290.018322-0.1509170.3735871.0000000.221539
Fare0.0314280.257307-0.5586290.1787400.1602380.2215391.000000

Data filling

df_all
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
231.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
341.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
450.03Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
13041305NaN3Spector, Mr. WoolfmaleNaN00A.5. 32368.0500NaNS
13051306NaN1Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
13061307NaN3Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS
13071308NaN3Ware, Mr. FrederickmaleNaN003593098.0500NaNS
13081309NaN3Peter, Master. Michael JmaleNaN11266822.3583NaNC

1309 rows × 12 columns

Fare

df_all[df_all['Fare'].isnull()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
10431044NaN3Storey, Mr. Thomasmale60.5003701NaNNaNS

There is only one missing data in Fare, and it comes from the test verification data, so the observed is empty. The correlation between Age, sibsp and parch and Fare is greater than 0.1, and these three values in this line of data are not empty, but the Age here is a continuous value, which is difficult to group. Considering that there is only one missing data, Age is discarded because gender is generally considered, So finally, Sex is also added, so the median after grouping the three attributes SibSp,Parch and Sex - groupby is filled here

df_all.groupby(['Parch', 'SibSp','Sex'])['Fare'].median()
# df_all.groupby(['Parch', 'Age', 'SibSp','Sex'])['Fare'].median().to_clipboard()
Parch  SibSp  Sex   
0      0      female     10.50000
              male        8.05000
       1      female     26.00000
              male       26.00000
       2      female     23.25000
              male       23.25000
       3      female     18.42500
              male       18.00000
1      0      female     39.40000
              male       33.00000
       1      female     26.00000
              male       23.00000
       2      female     23.00000
              male       25.25000
       3      female     25.46670
              male       21.55000
       4      male       34.40625
2      0      female     22.35830
              male       30.75000
       1      female     41.57920
              male       41.57920
       2      female    148.37500
              male      148.37500
       3      female    263.00000
              male       27.90000
       4      female     31.27500
              male       31.38750
       5      female     46.90000
              male       46.90000
       8      female     69.55000
              male       69.55000
3      0      female     29.12915
       1      female     34.37500
              male      148.37500
       2      female     18.75000
4      0      female     23.27085
       1      female    145.45000
              male      145.45000
5      0      female     34.40625
       1      female     31.33125
              male       31.33125
6      1      female     46.90000
              male       46.90000
9      1      female     69.55000
              male       69.55000
Name: Fare, dtype: float64

Parch: 0, sibsp: 0, sex: male: 8.05000

df_all['Fare'] = df_all['Fare'].fillna(8.05000)
df_all[df_all['Fare'].isnull()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked

Age

263 rows missing Age

According to the Pearson correlation table, there are only Fare and Age related in all digital fields

Age, embarked and cabin can be considered for subsequent optimization

df_all['Age'].isnull().value_counts()
False    1046
True      263
Name: Age, dtype: int64

Observe the Age field and Fare

Take a look at the data distribution of Fare

plt.figure(figsize=(15,10))
plt.hist(df_all['Fare'])
plt.show()

Most fare s are in the range of 0-50

# Isometric segmentation
df_all['Fare_cut'] = pd.qcut(df_all['Fare'],10)
plt.figure(figsize=(15,10))
df_all['Fare_cut'].value_counts().plot(kind='bar',rot=30)

<AxesSubplot:>

Far is equally divided into 10 intervals

df_all.groupby(['Fare_cut','Sex']).agg({'Age':'median'})
Age
Fare_cutSex
(-0.001, 7.57]female20.25
male25.00
(7.57, 7.854]female22.00
male25.00
(7.854, 8.05]female24.00
male28.00
(8.05, 10.5]female22.50
male24.50
(10.5, 14.454]female27.00
male30.00
(14.454, 21.558]female24.00
male26.00
(21.558, 26.82]female29.00
male36.00
(26.82, 41.579]female24.00
male29.00
(41.579, 78.02]female35.00
male32.00
(78.02, 512.329]female35.00
male37.00

Group by. Transform after grouping, fill in with the median Age in the group (transform is really easy to use!!! Readers are recommended to learn more about Baidu)

df_all['Age'] = df_all.groupby(['Fare_cut','Sex'])['Age'].transform(lambda x:x.fillna(x.median()))
df_all['Age'].isnull().value_counts()
False    1309
Name: Age, dtype: int64

The difference between apply and transform in groupby

Both the apply and transform operations are performed on the entire column of data in the group after grouping
list(df_all.groupby(['Fare_cut','Sex']))
[((Interval(-0.001, 7.57, closed='right'), 'female'),
        PassengerId  Survived  Pclass  \
  19             20       1.0       3   
  235           236       0.0       3   
  367           368       1.0       3   
  376           377       1.0       3   
  649           650       1.0       3   
  654           655       0.0       3   
  780           781       1.0       3   
  786           787       1.0       3   
  875           876       1.0       3   
  892           893       NaN       3   
  899           900       NaN       3   
  910           911       NaN       3   
  1004         1005       NaN       3   
  1182         1183       NaN       3   
  1238         1239       NaN       3   

The data operated by apply in group by can be series or dataframe

The data operated by the transform in groupby can only be series

The following apply groups directly

df_all.groupby(['Fare_cut','Sex'])['Age','Fare'].apply(lambda x:x.count())
AgeFare
Fare_cutSex
(-0.001, 7.57]female1515
male116116
(7.57, 7.854]female5050
male9494
(7.854, 8.05]female2222
male125125
(8.05, 10.5]female3030
male7878
(10.5, 14.454]female4949
male7979
(14.454, 21.558]female6161
male6666
(21.558, 26.82]female5252
male7979
(26.82, 41.579]female5252
male8282
(41.579, 78.02]female5353
male7575
(78.02, 512.329]female8282
male4949

In the following apply, the X passed in is a dataframe. If you need to operate on a single column, you can use x.iloc[:,0]

df_all.groupby(['Fare_cut','Sex'])['Age','Fare'].apply(lambda x:x.iloc[:,0].count()+x.iloc[:,1].count())
# df_all.groupby(['Fare_cut','Sex'])['Age','Fare'].apply(lambda x:print(type(x)))
Fare_cut          Sex   
(-0.001, 7.57]    female     30
                  male      232
(7.57, 7.854]     female    100
                  male      188
(7.854, 8.05]     female     44
                  male      250
(8.05, 10.5]      female     60
                  male      156
(10.5, 14.454]    female     98
                  male      158
(14.454, 21.558]  female    122
                  male      132
(21.558, 26.82]   female    104
                  male      158
(26.82, 41.579]   female    104
                  male      164
(41.579, 78.02]   female    106
                  male      150
(78.02, 512.329]  female    164
                  male       98
dtype: int64

Embarked

df_all[df_all['Embarked'].isnull()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFare_cut
61621.01Icard, Miss. Ameliefemale38.00011357280.0B28NaN(78.02, 512.329]
8298301.01Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0B28NaN(78.02, 512.329]
df_all.groupby(['Fare_cut','Sex','Embarked'])['Embarked'].count()
Fare_cut          Sex     Embarked
(-0.001, 7.57]    female  C             7
                          Q             3
                          S             5
                  male    C            42
                          Q             4
                          S            70
(7.57, 7.854]     female  C             0
                          Q            34
                          S            16
                  male    C             0
                          Q            38
                          S            56
(7.854, 8.05]     female  C             0
                          Q             8
                          S            14
                  male    C             6
                          Q             2
                          S           117
(8.05, 10.5]      female  C             1
                          Q             1
                          S            28
                  male    C             3
                          Q             2
                          S            73
(10.5, 14.454]    female  C            12
                          Q             2
                          S            35
                  male    C            11
                          Q             4
                          S            64
(14.454, 21.558]  female  C            13
                          Q             6
                          S            42
                  male    C            15
                          Q             4
                          S            47
(21.558, 26.82]   female  C             3
                          Q             3
                          S            46
                  male    C             8
                          Q             3
                          S            68
(26.82, 41.579]   female  C            15
                          Q             1
                          S            36
                  male    C            27
                          Q             5
                          S            50
(41.579, 78.02]   female  C            20
                          Q             0
                          S            33
                  male    C            17
                          Q             0
                          S            58
(78.02, 512.329]  female  C            42
                          Q             2
                          S            36
                  male    C            28
                          Q             1
                          S            20
Name: Embarked, dtype: int64

The missing two lines are both women and the cost is 80 yuan
Here, use the corresponding:
The ticket price is between (78.051, 512.329]
Among women, the largest number is C

# Filling the missing values in Embarked with S
df_all['Embarked'] = df_all['Embarked'].fillna('C')
df_all[df_all['Embarked'].isnull()]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFare_cut

Cabin

df_all['Cabin'].isnull().value_counts()
True     1014
False     295
Name: Cabin, dtype: int64

The bin attribute has many missing values. Here, M is used to fill in the missing values temporarily, and the others are replaced by the initial letter

df_all['Cabin'] = df_all['Cabin'].fillna('M')
# The missing value is replaced by M
df_all['Cabin_new'] = df_all['Cabin'].apply(lambda x:x[0])
df_all['Cabin_new'].value_counts()
M    1014
C      94
B      65
D      46
E      41
A      22
F      21
G       5
T       1
Name: Cabin_new, dtype: int64

End of missing value supplement

df_all.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   PassengerId  1309 non-null   int64   
 1   Survived     891 non-null    float64 
 2   Pclass       1309 non-null   int64   
 3   Name         1309 non-null   object  
 4   Sex          1309 non-null   object  
 5   Age          1309 non-null   float64 
 6   SibSp        1309 non-null   int64   
 7   Parch        1309 non-null   int64   
 8   Ticket       1309 non-null   object  
 9   Fare         1309 non-null   float64 
 10  Cabin        1309 non-null   object  
 11  Embarked     1309 non-null   object  
 12  Fare_cut     1309 non-null   category
 13  Cabin_new    1309 non-null   object  
dtypes: category(1), float64(3), int64(4), object(6)
memory usage: 135.1+ KB
df_all_copy = df_all.copy()
# Data backup

Rescued analysis

df_all
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFare_cutCabin_new
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500MS(-0.001, 7.57]M
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C(41.579, 78.02]C
231.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250MS(7.854, 8.05]M
341.01Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S(41.579, 78.02]C
450.03Allen, Mr. William Henrymale35.0003734508.0500MS(7.854, 8.05]M
.............................................
13041305NaN3Spector, Mr. Woolfmale28.000A.5. 32368.0500MS(7.854, 8.05]M
13051306NaN1Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C(78.02, 512.329]C
13061307NaN3Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500MS(-0.001, 7.57]M
13071308NaN3Ware, Mr. Frederickmale28.0003593098.0500MS(7.854, 8.05]M
13081309NaN3Peter, Master. Michael Jmale36.011266822.3583MC(21.558, 26.82]M

1309 rows × 14 columns

survived_sum = df_train['Survived'].value_counts().sum()
df_train['Survived'].value_counts()/ survived_sum
0    0.616162
1    0.383838
Name: Survived, dtype: float64
(df_train['Survived'].value_counts()/ survived_sum).plot(kind='bar')
<AxesSubplot:>

The probability of being rescued in training concentration is about 38%

Feature Engineering

Fare

del df_all['Fare_cut']
df_all[:2]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedCabin_new
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500MSM
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85CC

When filling in the missing value for the Age attribute, the far field is divided into 10 intervals, and a new attribute far_cut is created, which is deleted here

Cabin

df_all['Cabin'] = df_all['Cabin_new']
del df_all['Cabin_new']
df_all[:2]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500MS
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833CC

Replace bin with the value of the newly built attribute bin_new

Age

Not yet

New feature --- family_size

df_all['Family_Size'] = df_all['SibSp'] + df_all['Parch'] + 1

Add 1 to SibSp and Parch to obtain the value of Family_Size

df_all[:2]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamily_Size
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500MS2
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833CC2

New feature --- title

df_all['Name'].str.split(', ', expand=True)
01
0BraundMr. Owen Harris
1CumingsMrs. John Bradley (Florence Briggs Thayer)
2HeikkinenMiss. Laina
3FutrelleMrs. Jacques Heath (Lily May Peel)
4AllenMr. William Henry
.........
1304SpectorMr. Woolf
1305Oliva y OcanaDona. Fermina
1306SaetherMr. Simon Sivertsen
1307WareMr. Frederick
1308PeterMaster. Michael J

1309 rows × 2 columns

Extract the prefix in the Name field

df_all['Title'] = df_all['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0]
df_all['Title'].value_counts()
Mr              757
Miss            260
Mrs             197
Master           61
Dr                8
Rev               8
Col               4
Major             2
Mlle              2
Ms                2
Lady              1
Capt              1
Jonkheer          1
Mme               1
Dona              1
Sir               1
the Countess      1
Don               1
Name: Title, dtype: int64
plt.subplots(figsize=(20, 10))
sns.barplot(x=df_all['Title'].value_counts().index, y=df_all['Title'].value_counts().values)
plt.title('Number of prefixes')
plt.show()

Here, I put the common Mr, Miss, Mrs and Ms into one category, and the others into one category

df_all['Title'].replace(['Mr','Miss','Mrs','Ms'],'cate1',inplace=True)
df_all['Title'] = df_all['Title'].apply(lambda x:'cate1' if x=='cate1' else 'cate2')
df_all['Title'].value_counts()
cate1    1216
cate2      93
Name: Title, dtype: int64

cate1 stands for Mr, Miss, Mrs and Ms, and cate2 stands for others

df_all[:3]
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedFamily_SizeTitle
010.03Braund, Mr. Owen Harrismale22.010A/5 211717.2500MS2cate1
121.01Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833CC2cate1
231.03Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250MS1cate1

Delete feature --- passengerid, name, ticket

df_all.drop(columns=['PassengerId','Name','Ticket'],inplace=True)
df_all[:3]
SurvivedPclassSexAgeSibSpParchFareCabinEmbarkedFamily_SizeTitle
00.03male22.0107.2500MS2cate1
11.01female38.01071.2833CC2cate1
21.03female26.0007.9250MS1cate1

Classification feature coding

Nominal variable and distance variable are suitable for OneHotEncoder

Distance variable, suitable for LabelEncoder

df_all
SurvivedPclassSexAgeSibSpParchFareCabinEmbarkedFamily_SizeTitle
00.03male22.0107.2500MS2cate1
11.01female38.01071.2833CC2cate1
21.03female26.0007.9250MS1cate1
31.01female35.01053.1000CS2cate1
40.03male35.0008.0500MS1cate1
....................................
1304NaN3male28.0008.0500MS1cate1
1305NaN1female39.000108.9000CC1cate2
1306NaN3male38.5007.2500MS1cate1
1307NaN3male28.0008.0500MS1cate1
1308NaN3male36.01122.3583MC3cate2

1309 rows × 11 columns

Sex,Cabin,Embarked,Title

The four feature s to be encoded here should be encoded with onehot

OneHotEncoder

cat_features_list = ['Sex', 'Cabin', 'Embarked', 'Title']
df_all_encode = pd.DataFrame()
for feature in cat_features_list:
    data_encode = OneHotEncoder().fit_transform(df_all[feature].values.reshape(-1, 1)).toarray()
    value_count = df_all[feature].unique().size
    new_columns = ['{}_{}'.format(feature,i) for i in range(1,value_count+1)]
    print(new_columns)
    df_encode = pd.DataFrame(data_encode,columns=new_columns)
#     print(df_encode)
    df_all_encode = pd.concat([df_all_encode,df_encode],axis=1)

['Sex_1', 'Sex_2']
['Cabin_1', 'Cabin_2', 'Cabin_3', 'Cabin_4', 'Cabin_5', 'Cabin_6', 'Cabin_7', 'Cabin_8', 'Cabin_9']
['Embarked_1', 'Embarked_2', 'Embarked_3']
['Title_1', 'Title_2']

The data encoded by OneHotEncoder is as follows

df_all_encode
Sex_1Sex_2Cabin_1Cabin_2Cabin_3Cabin_4Cabin_5Cabin_6Cabin_7Cabin_8Cabin_9Embarked_1Embarked_2Embarked_3Title_1Title_2
00.01.00.00.00.00.00.00.00.01.00.00.00.01.01.00.0
11.00.00.00.01.00.00.00.00.00.00.01.00.00.01.00.0
21.00.00.00.00.00.00.00.00.01.00.00.00.01.01.00.0
31.00.00.00.01.00.00.00.00.00.00.00.00.01.01.00.0
40.01.00.00.00.00.00.00.00.01.00.00.00.01.01.00.0
...................................................
13040.01.00.00.00.00.00.00.00.01.00.00.00.01.01.00.0
13051.00.00.00.01.00.00.00.00.00.00.01.00.00.00.01.0
13060.01.00.00.00.00.00.00.00.01.00.00.00.01.01.00.0
13070.01.00.00.00.00.00.00.00.01.00.00.00.01.01.00.0
13080.01.00.00.00.00.00.00.00.01.00.01.00.00.00.01.0

1309 rows × 16 columns

Splice the data generated by OneHotEncoder encoding and df_all, and delete the four feature s before encoding: ['Sex', 'Cabin', 'embanked', 'Title']

df_all = pd.concat([df_all,df_all_encode],axis=1)
df_all.drop(columns=['Sex','Cabin','Embarked','Title'],inplace=True)
df_all[:3]
SurvivedPclassAgeSibSpParchFareFamily_SizeSex_1Sex_2Cabin_1...Cabin_5Cabin_6Cabin_7Cabin_8Cabin_9Embarked_1Embarked_2Embarked_3Title_1Title_2
00.0322.0107.250020.01.00.0...0.00.00.01.00.00.00.01.01.00.0
11.0138.01071.283321.00.00.0...0.00.00.00.00.01.00.00.01.00.0
21.0326.0007.925011.00.00.0...0.00.00.01.00.00.00.01.01.00.0

3 rows × 23 columns

df_all.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 23 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Survived     891 non-null    float64
 1   Pclass       1309 non-null   int64  
 2   Age          1309 non-null   float64
 3   SibSp        1309 non-null   int64  
 4   Parch        1309 non-null   int64  
 5   Fare         1309 non-null   float64
 6   Family_Size  1309 non-null   int64  
 7   Sex_1        1309 non-null   float64
 8   Sex_2        1309 non-null   float64
 9   Cabin_1      1309 non-null   float64
 10  Cabin_2      1309 non-null   float64
 11  Cabin_3      1309 non-null   float64
 12  Cabin_4      1309 non-null   float64
 13  Cabin_5      1309 non-null   float64
 14  Cabin_6      1309 non-null   float64
 15  Cabin_7      1309 non-null   float64
 16  Cabin_8      1309 non-null   float64
 17  Cabin_9      1309 non-null   float64
 18  Embarked_1   1309 non-null   float64
 19  Embarked_2   1309 non-null   float64
 20  Embarked_3   1309 non-null   float64
 21  Title_1      1309 non-null   float64
 22  Title_2      1309 non-null   float64
dtypes: float64(19), int64(4)
memory usage: 235.3 KB

Data segmentation

df_train = df_all.loc[:890]
df_test = df_all.loc[891:]

del df_test['Survived']
y_train = df_train['Survived'].values
del df_train['Survived']
df_train.shape
(891, 22)
df_test.shape
(418, 22)
y_train.shape
(891,)

Standardization

x_train = StandardScaler().fit_transform(df_train)
x_test = StandardScaler().fit_transform(df_test)

print('x_train shape: {}'.format(x_train.shape))
print('y_train shape: {}'.format(y_train.shape))
print('x_test shape: {}'.format(x_test.shape))
x_train shape: (891, 22)
y_train shape: (891,)
x_test shape: (418, 22)

model

Decision tree

df_result = pd.DataFrame()
from sklearn import tree 
decision_tree_model = tree.DecisionTreeClassifier()
decision_tree_model.fit(x_train,y_train)
y_predict_with_dtree = decision_tree_model.predict(x_test)

df_result['y_predict_with_dtree'] = y_predict_with_dtree
df_result
y_predict_with_dtree
00.0
10.0
21.0
31.0
41.0
......
4130.0
4141.0
4150.0
4160.0
4171.0

418 rows × 1 columns

logistic regression

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression().fit(x_train,y_train)
y_predict_with_logisticReg = lr.predict(x_test)

df_result['y_predict_with_logisticReg'] = y_predict_with_logisticReg
df_result
y_predict_with_dtreey_predict_with_logisticReg
00.00.0
10.00.0
21.00.0
31.00.0
41.01.0
.........
4130.00.0
4141.01.0
4150.00.0
4160.00.0
4171.00.0

418 rows × 2 columns

Support vector machine

from sklearn import svm
svm_model = svm.SVC()
svm_model.fit(x_train,y_train)
y_predict_with_svm = svm_model.predict(x_test)

df_result['y_predict_with_svm'] = y_predict_with_svm
df_result
y_predict_with_dtreey_predict_with_logisticRegy_predict_with_svm
00.00.00.0
10.00.01.0
21.00.00.0
31.00.00.0
41.01.00.0
............
4130.00.00.0
4141.01.01.0
4150.00.00.0
4160.00.00.0
4171.00.01.0

418 rows × 3 columns

KNN

from sklearn import neighbors
knnmodel = neighbors.KNeighborsClassifier(n_neighbors=2) #The n_neighbors parameter is the number of classifications
knnmodel.fit(x_train,y_train)

y_predict_with_knn = knnmodel.predict(x_test)
df_result['y_predict_with_knn'] = y_predict_with_knn
df_result
y_predict_with_dtreey_predict_with_logisticRegy_predict_with_svmy_predict_with_knn
00.00.00.00.0
10.00.01.00.0
21.00.00.00.0
31.00.00.00.0
41.01.00.00.0
...............
4130.00.00.00.0
4141.01.01.01.0
4150.00.00.00.0
4160.00.00.00.0
4171.00.01.01.0

418 rows × 4 columns

Random forest

from sklearn.ensemble import RandomForestClassifier
model_randomforest = RandomForestClassifier().fit(x_train,y_train)
y_predict_with_random_forest = model_randomforest.predict(x_test)
df_result['y_predict_with_random_forest'] = y_predict_with_random_forest
df_result
y_predict_with_dtreey_predict_with_logisticRegy_predict_with_svmy_predict_with_knny_predict_with_random_forest
00.00.00.00.00.0
10.00.01.00.00.0
21.00.00.00.00.0
31.00.00.00.00.0
41.01.00.00.00.0
..................
4130.00.00.00.00.0
4141.01.01.01.01.0
4150.00.00.00.00.0
4160.00.00.00.00.0
4171.00.01.01.01.0

418 rows × 5 columns

Result verification

Here, in order to quickly obtain the accuracy of my prediction data locally, I obtained the prediction results with 100% accuracy from the records submitted by other gods of kaggle as the verification data;

Therefore, the accuracy of the above five models is as follows, and the random forest score is the highest: 0.7871.

df_check = pd.read_csv(r'../titanic_dir/titanic_data/correct_submission_titanic.csv')
df_check = df_check['Survived']
# df_check
for column in df_result:
    df_concat = pd.concat([df_result[column],df_check],axis=1)
    df_concat['predict_tag'] = df_concat.apply(lambda x: 1 if x[0]==x[1] else 0,axis=1)
    right_rate = df_concat['predict_tag'].sum()/df_concat['predict_tag'].count()
    print(column,'The accuracy is:')
    print(np.round(right_rate,4))
y_predict_with_dtree The accuracy is:
0.7057
y_predict_with_logisticReg The accuracy is:
0.7703
y_predict_with_svm The accuracy is:
0.7656
y_predict_with_knn The accuracy is:
0.7656
y_predict_with_random_forest The accuracy is:
0.7871

Result regression optimization

1

When filling in the missing value, Bin: the Cabin has only been dealt with briefly, and there is no good analysis on how to deal with it. I feel that different positions of the Cabin still have a great impact on the probability of being rescued. Moreover, the proportion of missing values of this attribute is very large, and the method of missing value treatment should have a great impact on the results. When optimizing later, we should spend more time thinking about how to do it ;

2

When you call the model, you simply call the five models without selecting some parameters. During later optimization, you can filter the parameters for each model, and then predict. The results should be much better.

Keywords: Python Machine Learning AI kaggle

Added by pesale86 on Sat, 20 Nov 2021 22:47:26 +0200