Kaggle Titanic (1)
Question:
On April 15, 1912, the Titanic sank, and everyone on board did not have enough lifeboats, resulting in the death of 1502 of the 2224 passengers and crew. Although there are some luck factors in surviving, it seems that some people are more likely to survive than others.
Build a prediction model to answer the question: "what kind of people are more likely to survive?" Use of passenger data (i.e. name, age, gender, socio-economic class, etc.)
Available datasets:
- Training set (train.csv)
- Test set (test.csv)
Solution 1:
score: 0.78468
Leaderboard: 1700/14296(11.89%)
The specific solutions are as follows:
1, Engineering data situation and characteristics
Example: pandas is a NumPy based tool created to solve data analysis tasks.
#Import related libraries import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier
# Training data data_train=pd.read_csv("/kaggle/input/titanic/train.csv") # test data data_test = pd.read_csv("/kaggle/input/titanic/test.csv") # Check the data volume and missing condition of each column attribute print(data_train.info()) print(data_test.info())
- We found that this is a binary classification problem, with survival of 1 and non survival of 0, a total of 1309 data (891 training sets and 418 verification sets)
- The training set provides a total of 11 features, including 6 numerical data (PassengerId, Pclass, Age, SibSp, Parch, Fare) and 5 text data (Name, Sex, Ticket information, Cabin information) Embarked (port of embarkation)
- The prediction data is numerical (Survived)
- Compared with the training set, the verification set lacks the Cabin feature
- PassengerId (passenger ID), Name (Name) and Ticket (Ticket information) are unique. The three categories have little significance and can be considered not to be included in the subsequent analysis; There are a lot of missing Cabin data, so this feature is not considered
- There are seven useful features left: Pclass (passenger grade), Sex (gender), Embarked (port of embarkation) are obvious categorical data, while Age (Age), SibSp (number of cousins), Parch (number of parents and children) are implicit categorical data; Fare is numerical data
- Age and Embarked information are missing and need to be processed
2, Characteristic Engineering
7 features currently intended to be utilized:
- Numerical type: Pclass (passenger grade), Age (Age), SibSp (number of cousins), Parch (number of parents and children), Fare (Fare)
- Text type: Sex, Embarked
1. Processing missing data
1) Use random forest prediction to supplement Age data. The code is as follows:
from sklearn.ensemble import RandomForestRegressor def set_missing_age(df): # Take out the numerical type characteristics and put them into the random forest for training age_df = df[['Age','Fare','Parch','SibSp','Pclass']] # Passengers are divided into known age and unknown age known_age = age_df[age_df.Age.notnull()].values unknown_age = age_df[age_df.Age.isnull()].values # Target data y y = known_age[:,0] # Characteristic attribute data x x = known_age[:,1:] # Fitting using random forest rfr = RandomForestRegressor(random_state=0,n_estimators=2000,n_jobs=-1) rfr.fit(x,y) # Use the trained model to predict predictedAges = rfr.predict(unknown_age[:,1::]) # Fill in missing raw data df.loc[(df.Age.isnull()),'Age'] = predictedAges return df # Age missing value filling data_train = set_missing_age(data_train)
2) The missing data of the boarding port is too few. Directly delete the two data of the actual login port
#Delete two pieces of data missing from the login port data = data_train.drop(data_train[data_train.Embarked.isnull()].index)
2.Embarked (boarding port), Sex (gender), Pclass (passenger class) feature factorization, numerical data Age (Age) Fare (Fare) normalization
import sklearn.preprocessing as preprocessing # Characteristic factorization def set_numeralization(data): # Factorization is carried out for categorical attributes, including embarked, sex and pclass dummies_Embarked = pd.get_dummies(data['Embarked'], prefix='Embarked') dummies_Sex = pd.get_dummies(data['Sex'], prefix='Sex') dummies_Pclass = pd.get_dummies(data['Pclass'], prefix='Pclass') # Put the new attributes together df = pd.concat([data, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1) # Remove old attributes df.drop(['Pclass', 'Sex', 'Embarked'], axis=1, inplace=True) return df # feature normalization def set_normalization(df): scaler = preprocessing.StandardScaler() age_scale_param = scaler.fit(df['Age'].values.reshape(-1,1)) df['Age_scaled'] = scaler.fit_transform(df['Age'].values.reshape(-1,1),age_scale_param) fare_scale_param = scaler.fit(df['Fare'].values.reshape(-1,1)) df['Fare_scaled'] = scaler.fit_transform(df['Fare'].values.reshape(-1,1),fare_scale_param) return df # Characteristic Engineering data = set_numeralization(data) data = set_normalization(data)
Verification set feature engineering, code as follows:
data_test['Fare'].fillna(data_test['Fare'].median(),inplace=True) data_test = set_missing_age(data_test) data_test = data_test.drop(data_test[data_test.Embarked.isnull()].index) data_test = set_numeralization(data_test) data_test = set_normalization(data_test)
3. Check whether the training set and verification set after feature engineering are complete
print(data_test.info()) print(data.info())
3, Establish model and forecast
1) Using support vector machine for prediction, the accuracy is 0.78229
#from sklearn.ensemble import RandomForestClassifier from sklearn import svm y = data["Survived"] features = ["Pclass_1", "Pclass_2", "Pclass_3", "Sex_male", "Sex_female", "SibSp", "Parch", "Age", "Fare", "Embarked_C", "Embarked_Q","Embarked_S"] X = pd.get_dummies(data[features]) X_test = pd.get_dummies(data_test[features]) from sklearn.preprocessing import StandardScaler scaler=StandardScaler() X_scaled=scaler.fit(X).transform(X) X_test_scaled=scaler.fit(X).transform(X_test) #Change the model input format to DataFrame and view the standardized data X_scaled=pd.DataFrame(X_scaled,columns=features) #X_scaled.head() X_test_scaled=pd.DataFrame(X_test_scaled,columns=features) #X_test_scaled.head() model = svm.SVC(C=3, kernel='rbf', gamma=0.1) model.fit(X_scaled, y) predictions = model.predict(X_test_scaled) output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions}) output.to_csv('submission.csv', index=False) print("Your submission was successfully saved!")
2) The model parameters are predicted again through grid search optimization, and the accuracy is 0.78468
#Optimize hyperparameters C, kernel and gamma through grid search from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC model = SVC() C=[1,2,5,10,20,50] kernel = ['rbf', 'sigmoid'] #gamma = [0.001,0.01,0.1,1,10,100] gamma = [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25] Hyperparameter = dict(C=C, kernel=kernel, gamma=gamma) # Wrap the super parameter range into a dictionary #Grid search, cv score 0.8302 grs = GridSearchCV(model, param_grid=Hyperparameter, cv = 10, n_jobs=1, return_train_score = False) grs.fit(np.array(X_scaled), np.array(y)) #Output optimal hyperparameter print("Best parameters " + str(grs.best_params_)) #print(f'Best parameters: {grs.best_params_}') #print(f'Best score: {grs.best_score_}') gpd = pd.DataFrame(grs.cv_results_) print("Estimated accuracy of this model for unseen data:{0:1.4f}".format(gpd['mean_test_score'][grs.best_index_])) #The optimized SVM is used for prediction predictions = grs.predict(X_test_scaled) output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions}) output.to_csv('submission.csv', index=False) print("Your submission was successfully saved!")
3) The model parameters are optimized by random search and predicted again
#Random search. If the parameters are consistent with the grid search, the cv score is unstable. The accuracy of the test set is 0.8302 once in about 10 times #You can use random search to explore the scope from sklearn.model_selection import RandomizedSearchCV from sklearn.svm import SVC model = SVC() C=[1,2,5,10,20,50] kernel = ['rbf', 'sigmoid'] #gamma = [0.001,0.01,0.1,1,10,100] gamma = [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25] Hyperparameter = dict(C=C, kernel=kernel, gamma=gamma) # Wrap the super parameter range into a dictionary #Hyperparameter = {"C": stats.uniform(500, 1500),"gamma": stats.uniform(0, 1),'kernel': ('linear', 'rbf')} random = RandomizedSearchCV(estimator = model, param_distributions = Hyperparameter, cv = 10, random_state=42, n_jobs = -1) random.fit(np.array(X_scaled), np.array(y)) print(f'Best parameters: {random.best_params_}') print(f'Best score: {random.best_score_}') predictions = random.predict(X_test_scaled) output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions}) output.to_csv('submission.csv', index=False) print("Your submission was successfully saved!")
4) Genetic algorithm (GA) parameter optimization
The result of genetic algorithm (GA) is also random. After many experiments, the model with the highest cv accuracy of the test set (0.848314606741573) is selected for prediction, and the final accuracy is 0.78468
#Genetic algorithm (GA) optimization of super parameters from tpot import TPOTClassifier from sklearn.model_selection import train_test_split tpot_config = { 'sklearn.svm.SVC': { 'C': [1,2,5,10,20,50], 'kernel': ['rbf', 'sigmoid'], 'gamma': [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25] } } X = X_scaled y = y X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) #The optimization model is not limited, and the final result is the optimal random forest classifier #tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1) #The optimization model is limited to SVM tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, config_dict=tpot_config) tpot.fit(X_train, y_train) print(tpot.score(X_test, y_test)) #Repeat the optimization until the following results are obtained # Best pipeline: SVC(SVC(input_matrix, C=20, gamma=0.625, kernel=sigmoid), C=2, gamma=0.039, kernel=rbf) # 0.848314606741573 predictions = tpot.predict(X_test_scaled) output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions}) output.to_csv('submission.csv', index=False) print("Your submission was successfully saved!")
5) Bayesian optimization parameter optimization
from sklearn.model_selection import BayesSearchCV from sklearn.svm import SVC model = SVC() C=[1,2,5,10,20,50] kernel = ['rbf', 'sigmoid'] #gamma = [0.001,0.01,0.1,1,10,100] gamma = [0.0195, 0.039, 0.0783, 0.156, 0.313, 0.625, 1.25] Hyperparameter = dict(C=C, kernel=kernel, gamma=gamma) #Hyperparameter = {"C": Real(1e-6, 1e+6, prior='log-uniform'), "gamma": Real(1e-6, 1e+1, prior='log-uniform'), "kernel": Categorical(['linear', 'rbf']),} bayesian = BayesSearchCV(estimator = SVC(), search_spaces = Hyperparameter, cv = 10, random_state=42, n_jobs = -1) bayesian.fit(np.array(X_scaled), np.array(y)) print(f'Best parameters: {bayesian.best_params_}')print(f'Best score: {bayesian.best_score_}') predictions = bayesian.predict(X_test_scaled) output = pd.DataFrame({'PassengerId': data_test.PassengerId, 'Survived': predictions}) output.to_csv('submission.csv', index=False) print("Your submission was successfully saved!")
Best parameters: OrderedDict([('C', 5), ('gamma', 0.0783), ('kernel', 'rbf')])
Best score: 0.8301966292134833
real score: 0.78468
summary
Through feature engineering, model direct prediction (0.78229) and optimization of super parameters (0.78468), the accuracy has been improved by 0.2 percentage points. Here, only a single model is used for exploration, and the possibility of other multiple models will be explored below.