# Kaggle Titanic (1)

Question:
On April 15, 1912, the Titanic sank, and everyone on board did not have enough lifeboats, resulting in the death of 1502 of the 2224 passengers and crew. Although there are some luck factors in surviving, it seems that some people are more likely to survive than others.

Build a prediction model to answer the question: "what kind of people are more likely to survive?" Use of passenger data (i.e. name, age, gender, socio-economic class, etc.)

Available datasets:

• Training set (train.csv)
• Test set (test.csv)

Solution 1:

score: 0.78468

The specific solutions are as follows:

# 1, Engineering data situation and characteristics

Example: pandas is a NumPy based tool created to solve data analysis tasks.

```#Import related libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
```
```# Training data
# test data
# Check the data volume and missing condition of each column attribute
print(data_train.info())
print(data_test.info())
```
1. We found that this is a binary classification problem, with survival of 1 and non survival of 0, a total of 1309 data (891 training sets and 418 verification sets)
2. The training set provides a total of 11 features, including 6 numerical data (PassengerId, Pclass, Age, SibSp, Parch, Fare) and 5 text data (Name, Sex, Ticket information, Cabin information) Embarked (port of embarkation)
3. The prediction data is numerical (Survived)
4. Compared with the training set, the verification set lacks the Cabin feature
5. PassengerId (passenger ID), Name (Name) and Ticket (Ticket information) are unique. The three categories have little significance and can be considered not to be included in the subsequent analysis; There are a lot of missing Cabin data, so this feature is not considered
6. There are seven useful features left: Pclass (passenger grade), Sex (gender), Embarked (port of embarkation) are obvious categorical data, while Age (Age), SibSp (number of cousins), Parch (number of parents and children) are implicit categorical data; Fare is numerical data
7. Age and Embarked information are missing and need to be processed

# 2, Characteristic Engineering

7 features currently intended to be utilized:

• Numerical type: Pclass (passenger grade), Age (Age), SibSp (number of cousins), Parch (number of parents and children), Fare (Fare)
• Text type: Sex, Embarked

## 1. Processing missing data

1) Use random forest prediction to supplement Age data. The code is as follows:

```from sklearn.ensemble import RandomForestRegressor

def set_missing_age(df):
# Take out the numerical type characteristics and put them into the random forest for training
age_df = df[['Age','Fare','Parch','SibSp','Pclass']]
# Passengers are divided into known age and unknown age
known_age = age_df[age_df.Age.notnull()].values
unknown_age = age_df[age_df.Age.isnull()].values

# Target data y
y = known_age[:,0]
# Characteristic attribute data x
x = known_age[:,1:]

# Fitting using random forest
rfr = RandomForestRegressor(random_state=0,n_estimators=2000,n_jobs=-1)
rfr.fit(x,y)

# Use the trained model to predict
predictedAges = rfr.predict(unknown_age[:,1::])

# Fill in missing raw data
df.loc[(df.Age.isnull()),'Age'] = predictedAges

return df

# Age missing value filling
data_train = set_missing_age(data_train)
```

2) The missing data of the boarding port is too few. Directly delete the two data of the actual login port

```#Delete two pieces of data missing from the login port
data = data_train.drop(data_train[data_train.Embarked.isnull()].index)
```

## 2.Embarked (boarding port), Sex (gender), Pclass (passenger class) feature factorization, numerical data Age (Age) Fare (Fare) normalization

```import sklearn.preprocessing as preprocessing
# Characteristic factorization
def set_numeralization(data):
# Factorization is carried out for categorical attributes, including embarked, sex and pclass
dummies_Embarked = pd.get_dummies(data['Embarked'], prefix='Embarked')
dummies_Sex = pd.get_dummies(data['Sex'], prefix='Sex')
dummies_Pclass = pd.get_dummies(data['Pclass'], prefix='Pclass')

# Put the new attributes together
df = pd.concat([data, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
# Remove old attributes
df.drop(['Pclass', 'Sex', 'Embarked'], axis=1, inplace=True)
return df

# feature normalization
def set_normalization(df):
scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(df['Age'].values.reshape(-1,1))
df['Age_scaled'] = scaler.fit_transform(df['Age'].values.reshape(-1,1),age_scale_param)
fare_scale_param = scaler.fit(df['Fare'].values.reshape(-1,1))
df['Fare_scaled'] = scaler.fit_transform(df['Fare'].values.reshape(-1,1),fare_scale_param)
return df

# Characteristic Engineering
data = set_numeralization(data)
data = set_normalization(data)
```

Verification set feature engineering, code as follows:

```data_test['Fare'].fillna(data_test['Fare'].median(),inplace=True)
data_test = set_missing_age(data_test)
data_test = data_test.drop(data_test[data_test.Embarked.isnull()].index)
data_test = set_numeralization(data_test)
data_test = set_normalization(data_test)
```

## 3. Check whether the training set and verification set after feature engineering are complete

```print(data_test.info())
print(data.info())
```

# 3, Establish model and forecast

1) Using support vector machine for prediction, the accuracy is 0.78229

```#from sklearn.ensemble import RandomForestClassifier
from sklearn import svm

y = data["Survived"]

features = ["Pclass_1", "Pclass_2", "Pclass_3", "Sex_male", "Sex_female", "SibSp", "Parch", "Age", "Fare", "Embarked_C", "Embarked_Q","Embarked_S"]
X = pd.get_dummies(data[features])
X_test = pd.get_dummies(data_test[features])

from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
X_scaled=scaler.fit(X).transform(X)
X_test_scaled=scaler.fit(X).transform(X_test)

#Change the model input format to DataFrame and view the standardized data
X_scaled=pd.DataFrame(X_scaled,columns=features)
X_test_scaled=pd.DataFrame(X_test_scaled,columns=features)

model = svm.SVC(C=3, kernel='rbf', gamma=0.1)
model.fit(X_scaled, y)
predictions = model.predict(X_test_scaled)

output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
```

2) The model parameters are predicted again through grid search optimization, and the accuracy is 0.78468

```#Optimize hyperparameters C, kernel and gamma through grid search
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
model = SVC()

C=[1,2,5,10,20,50]
kernel = ['rbf', 'sigmoid']
#gamma = [0.001,0.01,0.1,1,10,100]
gamma = [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25]
Hyperparameter = dict(C=C, kernel=kernel, gamma=gamma)  # Wrap the super parameter range into a dictionary

#Grid search, cv score 0.8302
grs = GridSearchCV(model, param_grid=Hyperparameter, cv = 10, n_jobs=1, return_train_score = False)

grs.fit(np.array(X_scaled), np.array(y))

#Output optimal hyperparameter
print("Best parameters " + str(grs.best_params_))
#print(f'Best parameters: {grs.best_params_}')
#print(f'Best score: {grs.best_score_}')
gpd = pd.DataFrame(grs.cv_results_)
print("Estimated accuracy of this model for unseen data:{0:1.4f}".format(gpd['mean_test_score'][grs.best_index_]))

#The optimized SVM is used for prediction
predictions = grs.predict(X_test_scaled)
output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
```

3) The model parameters are optimized by random search and predicted again

```#Random search. If the parameters are consistent with the grid search, the cv score is unstable. The accuracy of the test set is 0.8302 once in about 10 times
#You can use random search to explore the scope
from sklearn.model_selection import RandomizedSearchCV
from sklearn.svm import SVC
model = SVC()

C=[1,2,5,10,20,50]
kernel = ['rbf', 'sigmoid']
#gamma = [0.001,0.01,0.1,1,10,100]
gamma = [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25]
Hyperparameter = dict(C=C, kernel=kernel, gamma=gamma)  # Wrap the super parameter range into a dictionary
#Hyperparameter = {"C": stats.uniform(500, 1500),"gamma": stats.uniform(0, 1),'kernel': ('linear', 'rbf')}

random = RandomizedSearchCV(estimator = model, param_distributions = Hyperparameter, cv = 10, random_state=42, n_jobs = -1)
random.fit(np.array(X_scaled), np.array(y))
print(f'Best parameters: {random.best_params_}')
print(f'Best score: {random.best_score_}')

predictions = random.predict(X_test_scaled)
output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
```

4) Genetic algorithm (GA) parameter optimization

The result of genetic algorithm (GA) is also random. After many experiments, the model with the highest cv accuracy of the test set (0.848314606741573) is selected for prediction, and the final accuracy is 0.78468

```#Genetic algorithm (GA) optimization of super parameters
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

tpot_config = {
'sklearn.svm.SVC': {
'C': [1,2,5,10,20,50],
'kernel': ['rbf', 'sigmoid'],
'gamma': [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25]
}
}

X = X_scaled
y = y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#The optimization model is not limited, and the final result is the optimal random forest classifier
#tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1)
#The optimization model is limited to SVM
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, config_dict=tpot_config)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

#Repeat the optimization until the following results are obtained
# Best pipeline: SVC(SVC(input_matrix, C=20, gamma=0.625, kernel=sigmoid), C=2, gamma=0.039, kernel=rbf)
#     0.848314606741573
predictions = tpot.predict(X_test_scaled)

output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
```

5) Bayesian optimization parameter optimization

```from sklearn.model_selection import BayesSearchCV
from sklearn.svm import SVC
model = SVC()

C=[1,2,5,10,20,50]
kernel = ['rbf', 'sigmoid']
#gamma = [0.001,0.01,0.1,1,10,100]
gamma = [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25]
Hyperparameter = dict(C=C, kernel=kernel, gamma=gamma)
#Hyperparameter = {"C": Real(1e-6, 1e+6, prior='log-uniform'), "gamma": Real(1e-6, 1e+1, prior='log-uniform'), "kernel": Categorical(['linear', 'rbf']),}

bayesian = BayesSearchCV(estimator = SVC(), search_spaces = Hyperparameter, cv = 10, random_state=42, n_jobs = -1)
bayesian.fit(np.array(X_scaled), np.array(y))
print(f'Best parameters: {bayesian.best_params_}')print(f'Best score: {bayesian.best_score_}')

predictions = bayesian.predict(X_test_scaled)
output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)