kaggle House Price Forecasting Competition Items: Data Processing, Feature Selection



House price forecasting is an introductory competition on kaggle. Generally speaking, it gives you 79 features about house price, and then predicts house price according to the characteristics. The evaluation index of housing price forecast is root mean square error (RMSE), that is:

1. Data Exploratory Analysis

First read the data using pandas module

import pandas as pd

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

Display the first few rows of training set and test set respectively

train.head()

test.head()


There are discrete and continuous values in the above feature data, and there are a lot of missing values. Because of the large number of features, we only visualize the housing price data based on the year of construction.

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10,8))
sns.boxplot(train.YearBuilt, train.SalePrice)


From the year of construction alone, the general trend is that old houses are cheaper and new ones are more expensive.

2. Data cleaning

Abnormal Point Processing

Firstly, we observe whether there is a linear relationship based on the housing area visualization data.

plt.figure(figsize=(12,6))
plt.scatter(x=train.GrLivArea, y=train.SalePrice)##It can be used to observe linear relationships.
plt.xlabel("GrLivArea", fontsize=13)
plt.ylabel("SalePrice", fontsize=13)
plt.ylim(0,800000)


From the scatter plot above, we can see that the larger the area of the house, the higher the house price, but there are two abnormal points in the lower right corner of the picture which violate this linear relationship and need to be removed.

# Remove data with GrLivArea > 4000 and SalePrice < 300000
train.drop(train[(train["GrLivArea"]>4000)&(train["SalePrice"]<300000)].index,inplace=True)

data processing

For convenience, the training set and test set data are processed together.

full = pd.concat([train,test],ignore_index=True)

First remove the Id column, which is useless for prediction.

full.drop("Id",axis=1,inplace=True)
print(full.shape)  # (2917, 80)

There are 2917 samples including the training set test set. Each sample actually contains 79 features. Since the training set data contains the final housing price data, 80 samples are shown here.

Null Value Filling and Deleting

As we said before, there are many missing values in the sample. We need to deal with the missing values. First, we look at the missing values of those features and sort the number of missing values from low to high.

miss = full.isnull().sum() #Statistical Number of Null Values
miss[miss>0].sort_values(ascending=True)#Rank from low to high
GarageArea         1
SaleType           1
KitchenQual        1
BsmtFinSF1         1
BsmtFinSF2         1
GarageCars         1
TotalBsmtSF        1
Exterior2nd        1
Exterior1st        1
BsmtUnfSF          1
Electrical         1
Functional         2
Utilities          2
BsmtHalfBath       2
BsmtFullBath       2
MSZoning           4
MasVnrArea        23
MasVnrType        24
BsmtFinType1      79
BsmtFinType2      80
BsmtQual          81
BsmtCond          82
BsmtExposure      82
GarageType       157
GarageYrBlt      159
GarageFinish     159
GarageCond       159
GarageQual       159
LotFrontage      486
FireplaceQu     1420
SalePrice       1459
Fence           2346
Alley           2719
MiscFeature     2812
PoolQC          2908
dtype: int64

Because eigenvalues contain character and numeric types, they need to be processed separately.
None is used to fill in the following character types:

cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for col in cols1:
    full[col].fillna("None",inplace=True)

The following numerical type features are filled with 0:

cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for col in cols:
    full[col].fillna(0, inplace=True)

Fill in the null value of lotfrontage (using the mean of this column)

full["LotFrontage"].fillna(np.mean(full["LotFrontage"]),inplace=True)

Modal filling of the following columns

cols2 = ["MSZoning", "BsmtFullBath", "BsmtHalfBath", "Utilities", "Functional", "Electrical", "KitchenQual", "SaleType","Exterior1st", "Exterior2nd"]
for col in cols2:
    full[col].fillna(full[col].mode()[0], inplace=True)

Next, look at what hasn't been filled in, and find that only test data has no label (house price) column.

full.isnull().sum()[full.isnull().sum()>0] # So far we've filled in the null values.
SalePrice    1459
dtype: int64

3. Data Preprocessing - Converting Character Type to Numeric Type

Converting some digital features into category features. It is best to use LabelEncoder and get_dummies to implement these functions.

from sklearn.preprocessing import LabelEncoder #Label coding

lab = LabelEncoder()

Label encoding of the following features

full["Alley"] = lab.fit_transform(full.Alley)
full["PoolQC"] = lab.fit_transform(full.PoolQC)
full["MiscFeature"] = lab.fit_transform(full.MiscFeature)
full["Fence"] = lab.fit_transform(full.Fence)
full["FireplaceQu"] = lab.fit_transform(full.FireplaceQu)
full["GarageQual"] = lab.fit_transform(full.GarageQual)
full["GarageCond"] = lab.fit_transform(full.GarageCond)
full["GarageFinish"] = lab.fit_transform(full.GarageFinish)
full["GarageYrBlt"] = lab.fit_transform(full.GarageYrBlt)
full["GarageType"] = lab.fit_transform(full.GarageType)
full["BsmtExposure"] = lab.fit_transform(full.BsmtExposure)
full["BsmtCond"] = lab.fit_transform(full.BsmtCond)
full["BsmtQual"] = lab.fit_transform(full.BsmtQual)
full["BsmtFinType2"] = lab.fit_transform(full.BsmtFinType2)
full["BsmtFinType1"] = lab.fit_transform(full.BsmtFinType1)
full["MasVnrType"] = lab.fit_transform(full.MasVnrType)
full["BsmtFinType1"] = lab.fit_transform(full.BsmtFinType1)

full["MSZoning"] = lab.fit_transform(full.MSZoning)
full["BsmtFullBath"] = lab.fit_transform(full.BsmtFullBath)
full["BsmtHalfBath"] = lab.fit_transform(full.BsmtHalfBath)
full["Utilities"] = lab.fit_transform(full.Utilities)
full["Functional"] = lab.fit_transform(full.Functional)
full["Electrical"] = lab.fit_transform(full.Electrical)
full["KitchenQual"] = lab.fit_transform(full.KitchenQual)
full["SaleType"] = lab.fit_transform(full.SaleType)
full["Exterior1st"] = lab.fit_transform(full.Exterior1st)
full["Exterior2nd"] = lab.fit_transform(full.Exterior2nd)

In this way, most of the features are converted to numerical types.

full.head()


Finally, delete the column corresponding to the house price

full.drop("SalePrice",axis=1,inplace=True) #delete
full.shape  # (2917, 79)

4. pipeline Construction - pipeline

Facilitate the combination of various features and the processing of features, and facilitate the subsequent feature redo of machine learning.
Firstly, the transformation function is constructed:

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, make_pipeline #Building pipelines
from scipy.stats import skew #skewness


class labelenc(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    
    def fit(self,X,y=None):
        return self
    ##Make a label code for three years, here you can add it yourself.
    def transform(self,X):
        lab=LabelEncoder()
        X["YearBuilt"] = lab.fit_transform(X["YearBuilt"])
        X["YearRemodAdd"] = lab.fit_transform(X["YearRemodAdd"])
        X["GarageYrBlt"] = lab.fit_transform(X["GarageYrBlt"])
        X["BldgType"] = lab.fit_transform(X["BldgType"])
        
        return X

class skew_dummies(BaseEstimator, TransformerMixin):
    def __init__(self,skew=0.5):#skewness
        self.skew = skew
    
    def fit(self,X,y=None):
        return self
    
    def transform(self,X):
        #Removing the object data types, most of them are numeric.
        X_numeric=X.select_dtypes(exclude=["object"])
        #Anonymous functions, in the form of dictionaries
        skewness = X_numeric.apply(lambda x: skew(x))
        #Conditions are used to select skew >= 0.5 index, and all data are retrieved to prevent data loss.
        skewness_features = skewness[abs(skewness) >= self.skew].index
        #Find logarithm to make it more normal distribution
        X[skewness_features] = np.log1p(X[skewness_features])
        ##One-click solitary heat, solitary heat coding
        X = pd.get_dummies(X)
        return X

Building pipelines

pipe = Pipeline([##Meaning of Pipeline Construction
    ('labenc', labelenc()),
    ('skew_dummies', skew_dummies(skew=2)),
    ])

Pipeline processing of data

full2 = full.copy()
pipeline_data = pipe.fit_transform(full2)
pipeline_data.shape  # (2917, 178)

Re-dividing the processed data into training set and test set

n_train=train.shape[0] # Number of rows in training set
X = pipeline_data[:n_train] # Remove the training set after processing
test_X = pipeline_data[n_train:] # Data after n_train is taken out as test set
y= train.SalePrice
X_scaled = StandardScaler().fit(X).transform(X) # Do transformation
y_log = np.log(train.SalePrice) # log is used here, which is more in line with normal distribution.
#Get the test set
test_X_scaled = StandardScaler().fit_transform(test_X)

5. Feature Selection - Selection Based on Feature Importance

Feature selection of training set using algorithm

from sklearn.linear_model import Lasso

lasso=Lasso(alpha=0.001)
lasso.fit(X_scaled,y_log)

FI_lasso = pd.DataFrame({"Feature Importance":lasso.coef_}, index=pipeline_data.columns)

FI_lasso.sort_values("Feature Importance",ascending=False)#Sort from high to low
FI_lasso[FI_lasso["Feature Importance"]!=0].sort_values("Feature Importance").plot(kind="barh",figsize=(15,25))
plt.xticks(rotation=90)
plt.show()##Drawing display


After obtaining the feature importance map, feature selection and redo can be carried out.

class add_feature(BaseEstimator, TransformerMixin):#Define your own transformation function -- fit_transform is defined by yourself
    def __init__(self,additional=1):
        self.additional = additional
    
    def fit(self,X,y=None):
        return self
    
    def transform(self,X):
        if self.additional==1:
            X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]   
            X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]
            
        else:
            X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]   
            X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]
            
            X["+_TotalHouse_OverallQual"] = X["TotalHouse"] * X["OverallQual"]
            X["+_GrLivArea_OverallQual"] = X["GrLivArea"] * X["OverallQual"]
            X["+_oMSZoning_TotalHouse"] = X["MSZoning"] * X["TotalHouse"]
            X["+_oMSZoning_OverallQual"] = X["MSZoning"] + X["OverallQual"]
            X["+_oMSZoning_YearBuilt"] = X["MSZoning"] + X["YearBuilt"]
            X["+_oNeighborhood_TotalHouse"] = X["Neighborhood"] * X["TotalHouse"]
            X["+_oNeighborhood_OverallQual"] = X["Neighborhood"] + X["OverallQual"]
            X["+_oNeighborhood_YearBuilt"] = X["Neighborhood"] + X["YearBuilt"]
            X["+_BsmtFinSF1_OverallQual"] = X["BsmtFinSF1"] * X["OverallQual"]
            
            X["-_oFunctional_TotalHouse"] = X["Functional"] * X["TotalHouse"]
            X["-_oFunctional_OverallQual"] = X["Functional"] + X["OverallQual"]
            X["-_LotArea_OverallQual"] = X["LotArea"] * X["OverallQual"]
            X["-_TotalHouse_LotArea"] = X["TotalHouse"] + X["LotArea"]
            X["-_oCondition1_TotalHouse"] = X["Condition1"] * X["TotalHouse"]
            X["-_oCondition1_OverallQual"] = X["Condition1"] + X["OverallQual"]
            
           
            X["Bsmt"] = X["BsmtFinSF1"] + X["BsmtFinSF2"] + X["BsmtUnfSF"]
            X["Rooms"] = X["FullBath"]+X["TotRmsAbvGrd"]
            X["PorchArea"] = X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]
            X["TotalPlace"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] + X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]

    
            return X

Add all feature processing to the pipeline

pipe = Pipeline([
    ('labenc', labelenc()),
    ('add_feature', add_feature(additional=2)),
    ('skew_dummies', skew_dummies(skew=4)),
    ])

full3 = full.copy()
pipeline_data = pipe.fit_transform(full3)
pipeline_data.shape  # (2917, 159)

n_train=train.shape[0]#Number of rows in training set
X = pipeline_data[:n_train]#Remove the training set after processing
test_X = pipeline_data[n_train:]#Data after n_train is taken out as test set
y= train.SalePrice
X_scaled = StandardScaler().fit(X).transform(X)#Do transformation
y_log = np.log(train.SalePrice)##What we should pay attention to here is that it is more in line with the normal distribution.
#Get the test set
test_X_scaled = StandardScaler().fit_transform(test_X)

6. Building Decision Tree Model

from sklearn.tree import DecisionTreeRegressor#Import model

model = DecisionTreeRegressor()
model1 =model.fit(X_scaled,y_log)

predict = np.exp(model1.predict(test_X_scaled)) #np.exp is the inverse transformation after the logarithmic transformation above.
result=pd.DataFrame({'Id':test.Id, 'SalePrice':predict})
result.to_csv("submission.csv",index=False)

Upload the resulting submission.csv with a score of about 0.19760

Keywords: encoding Lambda

Added by mxl on Thu, 12 Sep 2019 12:29:04 +0300