Kaggle Titanic

Kaggle Titanic (1)

Question:
On April 15, 1912, the Titanic sank, and everyone on board did not have enough lifeboats, resulting in the death of 1502 of the 2224 passengers and crew. Although there are some luck factors in surviving, it seems that some people are more likely to survive than others.

Build a prediction model to answer the question: "what kind of people are more likely to survive?" Use of passenger data (i.e. name, age, gender, socio-economic class, etc.)

Available datasets:

  • Training set (train.csv)
  • Test set (test.csv)

Solution 1:

score: 0.78468
Leaderboard: 1700/14296(11.89%)

The specific solutions are as follows:

1, Engineering data situation and characteristics

Example: pandas is a NumPy based tool created to solve data analysis tasks.

#Import related libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split 
from sklearn.ensemble import RandomForestClassifier
# Training data
data_train=pd.read_csv("/kaggle/input/titanic/train.csv")
# test data
data_test = pd.read_csv("/kaggle/input/titanic/test.csv")
# Check the data volume and missing condition of each column attribute
print(data_train.info())
print(data_test.info())
  1. We found that this is a binary classification problem, with survival of 1 and non survival of 0, a total of 1309 data (891 training sets and 418 verification sets)
  2. The training set provides a total of 11 features, including 6 numerical data (PassengerId, Pclass, Age, SibSp, Parch, Fare) and 5 text data (Name, Sex, Ticket information, Cabin information) Embarked (port of embarkation)
  3. The prediction data is numerical (Survived)
  4. Compared with the training set, the verification set lacks the Cabin feature
  5. PassengerId (passenger ID), Name (Name) and Ticket (Ticket information) are unique. The three categories have little significance and can be considered not to be included in the subsequent analysis; There are a lot of missing Cabin data, so this feature is not considered
  6. There are seven useful features left: Pclass (passenger grade), Sex (gender), Embarked (port of embarkation) are obvious categorical data, while Age (Age), SibSp (number of cousins), Parch (number of parents and children) are implicit categorical data; Fare is numerical data
  7. Age and Embarked information are missing and need to be processed

2, Characteristic Engineering

7 features currently intended to be utilized:

  • Numerical type: Pclass (passenger grade), Age (Age), SibSp (number of cousins), Parch (number of parents and children), Fare (Fare)
  • Text type: Sex, Embarked

1. Processing missing data

1) Use random forest prediction to supplement Age data. The code is as follows:

from sklearn.ensemble import RandomForestRegressor

def set_missing_age(df):
    # Take out the numerical type characteristics and put them into the random forest for training
    age_df = df[['Age','Fare','Parch','SibSp','Pclass']]
    # Passengers are divided into known age and unknown age
    known_age = age_df[age_df.Age.notnull()].values
    unknown_age = age_df[age_df.Age.isnull()].values

    # Target data y
    y = known_age[:,0]
    # Characteristic attribute data x
    x = known_age[:,1:]

    # Fitting using random forest
    rfr = RandomForestRegressor(random_state=0,n_estimators=2000,n_jobs=-1)
    rfr.fit(x,y)

    # Use the trained model to predict
    predictedAges = rfr.predict(unknown_age[:,1::])

    # Fill in missing raw data
    df.loc[(df.Age.isnull()),'Age'] = predictedAges

    return df

# Age missing value filling
data_train = set_missing_age(data_train)

2) The missing data of the boarding port is too few. Directly delete the two data of the actual login port

#Delete two pieces of data missing from the login port
data = data_train.drop(data_train[data_train.Embarked.isnull()].index)

2.Embarked (boarding port), Sex (gender), Pclass (passenger class) feature factorization, numerical data Age (Age) Fare (Fare) normalization

import sklearn.preprocessing as preprocessing
# Characteristic factorization
def set_numeralization(data):
    # Factorization is carried out for categorical attributes, including embarked, sex and pclass
    dummies_Embarked = pd.get_dummies(data['Embarked'], prefix='Embarked')
    dummies_Sex = pd.get_dummies(data['Sex'], prefix='Sex')
    dummies_Pclass = pd.get_dummies(data['Pclass'], prefix='Pclass')

    # Put the new attributes together
    df = pd.concat([data, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
    # Remove old attributes
    df.drop(['Pclass', 'Sex', 'Embarked'], axis=1, inplace=True)
    return df

# feature normalization 
def set_normalization(df):
    scaler = preprocessing.StandardScaler()
    age_scale_param = scaler.fit(df['Age'].values.reshape(-1,1))
    df['Age_scaled'] = scaler.fit_transform(df['Age'].values.reshape(-1,1),age_scale_param)
    fare_scale_param = scaler.fit(df['Fare'].values.reshape(-1,1))
    df['Fare_scaled'] = scaler.fit_transform(df['Fare'].values.reshape(-1,1),fare_scale_param)
    return df

# Characteristic Engineering
data = set_numeralization(data)
data = set_normalization(data)

Verification set feature engineering, code as follows:

data_test['Fare'].fillna(data_test['Fare'].median(),inplace=True)
data_test = set_missing_age(data_test)
data_test = data_test.drop(data_test[data_test.Embarked.isnull()].index)
data_test = set_numeralization(data_test)
data_test = set_normalization(data_test)

3. Check whether the training set and verification set after feature engineering are complete

print(data_test.info())
print(data.info())

3, Establish model and forecast

1) Using support vector machine for prediction, the accuracy is 0.78229

#from sklearn.ensemble import RandomForestClassifier
from sklearn import svm  

y = data["Survived"]

features = ["Pclass_1", "Pclass_2", "Pclass_3", "Sex_male", "Sex_female", "SibSp", "Parch", "Age", "Fare", "Embarked_C", "Embarked_Q","Embarked_S"]
X = pd.get_dummies(data[features])
X_test = pd.get_dummies(data_test[features])

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_scaled=scaler.fit(X).transform(X)
X_test_scaled=scaler.fit(X).transform(X_test)

#Change the model input format to DataFrame and view the standardized data
X_scaled=pd.DataFrame(X_scaled,columns=features)
#X_scaled.head()
X_test_scaled=pd.DataFrame(X_test_scaled,columns=features)
#X_test_scaled.head()

model = svm.SVC(C=3, kernel='rbf', gamma=0.1)
model.fit(X_scaled, y)
predictions = model.predict(X_test_scaled)

output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

2) The model parameters are predicted again through grid search optimization, and the accuracy is 0.78468

#Optimize hyperparameters C, kernel and gamma through grid search
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
model = SVC()

C=[1,2,5,10,20,50]
kernel = ['rbf', 'sigmoid']
#gamma = [0.001,0.01,0.1,1,10,100]
gamma = [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25]
Hyperparameter = dict(C=C, kernel=kernel, gamma=gamma)  # Wrap the super parameter range into a dictionary

#Grid search, cv score 0.8302
grs = GridSearchCV(model, param_grid=Hyperparameter, cv = 10, n_jobs=1, return_train_score = False)

grs.fit(np.array(X_scaled), np.array(y))

#Output optimal hyperparameter
print("Best parameters " + str(grs.best_params_))
#print(f'Best parameters: {grs.best_params_}')
#print(f'Best score: {grs.best_score_}')
gpd = pd.DataFrame(grs.cv_results_)
print("Estimated accuracy of this model for unseen data:{0:1.4f}".format(gpd['mean_test_score'][grs.best_index_]))

#The optimized SVM is used for prediction
predictions = grs.predict(X_test_scaled)
output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

3) The model parameters are optimized by random search and predicted again

#Random search. If the parameters are consistent with the grid search, the cv score is unstable. The accuracy of the test set is 0.8302 once in about 10 times
#You can use random search to explore the scope
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
model = SVC()

C=[1,2,5,10,20,50]
kernel = ['rbf', 'sigmoid']
#gamma = [0.001,0.01,0.1,1,10,100]
gamma = [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25]
Hyperparameter = dict(C=C, kernel=kernel, gamma=gamma)  # Wrap the super parameter range into a dictionary
#Hyperparameter = {"C": stats.uniform(500, 1500),"gamma": stats.uniform(0, 1),'kernel': ('linear', 'rbf')}

random = RandomizedSearchCV(estimator = model, param_distributions = Hyperparameter, cv = 10, random_state=42, n_jobs = -1)
random.fit(np.array(X_scaled), np.array(y))
print(f'Best parameters: {random.best_params_}')
print(f'Best score: {random.best_score_}')

predictions = random.predict(X_test_scaled)
output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

4) Genetic algorithm (GA) parameter optimization

The result of genetic algorithm (GA) is also random. After many experiments, the model with the highest cv accuracy of the test set (0.848314606741573) is selected for prediction, and the final accuracy is 0.78468

#Genetic algorithm (GA) optimization of super parameters
from tpot import TPOTClassifier 
from sklearn.model_selection import train_test_split 

tpot_config = {
    'sklearn.svm.SVC': {
    'C': [1,2,5,10,20,50],
    'kernel': ['rbf', 'sigmoid'],
    'gamma': [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25]
    }
}

X = X_scaled
y = y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) 
#The optimization model is not limited, and the final result is the optimal random forest classifier
#tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1) 
#The optimization model is limited to SVM
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, config_dict=tpot_config) 
tpot.fit(X_train, y_train) 
print(tpot.score(X_test, y_test)) 

#Repeat the optimization until the following results are obtained
# Best pipeline: SVC(SVC(input_matrix, C=20, gamma=0.625, kernel=sigmoid), C=2, gamma=0.039, kernel=rbf)
#     0.848314606741573
predictions = tpot.predict(X_test_scaled)

output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

5) Bayesian optimization parameter optimization

from sklearn.model_selection import BayesSearchCV
from sklearn.svm import SVC
model = SVC()

C=[1,2,5,10,20,50]
kernel = ['rbf', 'sigmoid']
#gamma = [0.001,0.01,0.1,1,10,100]
gamma = [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25]
Hyperparameter = dict(C=C, kernel=kernel, gamma=gamma)
#Hyperparameter = {"C": Real(1e-6, 1e+6, prior='log-uniform'), "gamma": Real(1e-6, 1e+1, prior='log-uniform'), "kernel": Categorical(['linear', 'rbf']),}

bayesian = BayesSearchCV(estimator = SVC(), search_spaces = Hyperparameter, cv = 10, random_state=42, n_jobs = -1)
bayesian.fit(np.array(X_scaled), np.array(y))
print(f'Best parameters: {bayesian.best_params_}')print(f'Best score: {bayesian.best_score_}')

predictions = bayesian.predict(X_test_scaled)
output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Best parameters: OrderedDict([('C', 5), ('gamma', 0.0783), ('kernel', 'rbf')])
Best score: 0.8301966292134833
real score: 0.78468

summary

Through feature engineering, model direct prediction (0.78229) and optimization of super parameters (0.78468), the accuracy has been improved by 0.2 percentage points. Here, only a single model is used for exploration, and the possibility of other multiple models will be explored below.

Keywords: Python Algorithm Machine Learning Deep Learning Data Mining

Added by lesmith on Sat, 26 Feb 2022 12:22:42 +0200