Data preprocessing of 100 day machine learning day1

Hello, everyone. I'm xiaok. I've been learning on the mobile phone for nearly a month. I've been exposed to a lot of knowledge about machine learning, but most of them just stay in the cognitive stage. From today on, I want to gradually have a deeper understanding of learning machine learning. I will send some knowledge that I think is useful on the platform from time to time to share with you (refer to Avik Jain for the template content) Today, I'd like to talk about data preprocessing

Step 1: import library

import numpy as np
import pandas as pd

Step 2: load the dataset

The data set I selected here is the training set of rent prediction novice competition on the DC competition platform. The link is as follows https://js.dclab.run/v2/cmptDetail.html?id=361

data = pd.read_csv(r'C:\Users\admin\Desktop\train.csv')
print(data.head())#Review the first five rows of the dataset

The operation results are as follows:
   ID     Location rental mode     Number of bedrooms in the district  ...  Time floor decoration         distance      Label
0   0  118.0   NaN  11.0     1  ...   1   2   NaN  76.416667   5.602716
1   1  100.0   NaN  10.0     1  ...   1   1   NaN  70.916667  16.977929
2   2  130.0   NaN  12.0     2  ...   1   0   NaN  57.250000   8.998302
3   3   90.0   NaN   7.0     3  ...   1   2   NaN  65.833333   5.602716
4   4   31.0   NaN   3.0     2  ...   1   0   NaN        NaN   7.300509




It is not difficult to find that there are some 'NaN' in these data, which means null value. That is to say, it is lost. If we want to train the data, we have to process them first. This is also the basic processing of data preprocessing

Step 3: classify object types

Looking at the above data, we can find that some data, such as the representation of the house facing it, will not be in int type. We all know that the computer only knows numbers. He doesn't know any Chinese characters. Therefore, the first thing we need to do here is to separate the characteristics of string type and numerical type

number_columns = [col for col in data.columns if data[col].dtype!='object']
category_columns = [col for col in data.columns if data[col].dtype =='object']

Step 4: convert object type

Here I use the LabelEncoder() method. It's very simple. It only needs a few lines of code

from sklearn.preprocessing import LabelEncoder #Import libraries necessary for data processing
le = LabelEncoder()
for col in category_columns:
    data[col] = le.fit_transform(data[col])

Isn't it very simple? Just a few sentences, but if you want to improve your grades, these lines alone are useless QAQ

Step 5: data filling

The most critical step in data preprocessing is to fill in the blank values for me! I won't say much here, just go to the code

for item in data.columns:
    if type(item) == str:
        if data[item].isnull().sum()>0:
            data[item].fillna('None',inplace=True)
    else:
        if data[item].isnull().sum()>0:
            data[item].fillna(data[item].median(),inplace=True)

print(data.head())

Before filling:   ID     Location rental mode     Number of bedrooms in the district  ...  Time floor decoration         distance      Label
0   0  118.0   NaN  11.0     1  ...   1   2   NaN  76.416667   5.602716
1   1  100.0   NaN  10.0     1  ...   1   1   NaN  70.916667  16.977929
2   2  130.0   NaN  12.0     2  ...   1   0   NaN  57.250000   8.998302
3   3   90.0   NaN   7.0     3  ...   1   2   NaN  65.833333   5.602716
4   4   31.0   NaN   3.0     2  ...   1   0   NaN        NaN   7.300509


After filling:   ID     Location rental mode     Number of bedrooms in the district  ...  Time floor decoration         distance      Label
0   0  118.0  None  11.0     1  ...   1   2  None  76.416667   5.602716
1   1  100.0  None  10.0     1  ...   1   1  None  70.916667  16.977929
2   2  130.0  None  12.0     2  ...   1   0  None      57.25   8.998302
3   3   90.0  None   7.0     3  ...   1   2  None  65.833333   5.602716
4   4   31.0  None   3.0     2  ...   1   0  None       None   7.300509

Is there any change? Here I use the for loop to traverse each feature and process it one by one

Step 6: divide the data set

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0)

here test_size I prefer to use 0.3

Description of main parameters:
train_data:Sample set to be divided
train_target:Label set of the sample set to be divided
test_size:Proportion of samples, test If the proportion of data sets is an integer, it is the number of sample sets
random_state: Is the seed of a random number

Step 7: feature scaling

In a data set, there is often a large difference between features, which is not conducive to computer training, so we need to scale the features to a certain extent

from sklearn.preprocessing import StandardScaler
data_s = StandardScaler()
data = data_s.fit_transform(data)

The StandardScaler here is to de normalize the mean and variance, so as to prevent the range difference of eigenvalues in the two characteristic columns from being too large, resulting in slower convergence rate.

 

Part of this article is extracted from Avik Jain's machine learning

 

Friends with similar interests can add a vx to communicate

 

Keywords: Machine Learning

Added by Mark.P.W on Tue, 01 Feb 2022 15:41:43 +0200