Hello, everyone. I'm xiaok. I've been learning on the mobile phone for nearly a month. I've been exposed to a lot of knowledge about machine learning, but most of them just stay in the cognitive stage. From today on, I want to gradually have a deeper understanding of learning machine learning. I will send some knowledge that I think is useful on the platform from time to time to share with you (refer to Avik Jain for the template content) Today, I'd like to talk about data preprocessing
Step 1: import library
import numpy as np import pandas as pd
Step 2: load the dataset
The data set I selected here is the training set of rent prediction novice competition on the DC competition platform. The link is as follows https://js.dclab.run/v2/cmptDetail.html?id=361
data = pd.read_csv(r'C:\Users\admin\Desktop\train.csv') print(data.head())#Review the first five rows of the dataset The operation results are as follows: ID Location rental mode Number of bedrooms in the district ... Time floor decoration distance Label 0 0 118.0 NaN 11.0 1 ... 1 2 NaN 76.416667 5.602716 1 1 100.0 NaN 10.0 1 ... 1 1 NaN 70.916667 16.977929 2 2 130.0 NaN 12.0 2 ... 1 0 NaN 57.250000 8.998302 3 3 90.0 NaN 7.0 3 ... 1 2 NaN 65.833333 5.602716 4 4 31.0 NaN 3.0 2 ... 1 0 NaN NaN 7.300509
It is not difficult to find that there are some 'NaN' in these data, which means null value. That is to say, it is lost. If we want to train the data, we have to process them first. This is also the basic processing of data preprocessing
Step 3: classify object types
Looking at the above data, we can find that some data, such as the representation of the house facing it, will not be in int type. We all know that the computer only knows numbers. He doesn't know any Chinese characters. Therefore, the first thing we need to do here is to separate the characteristics of string type and numerical type
number_columns = [col for col in data.columns if data[col].dtype!='object'] category_columns = [col for col in data.columns if data[col].dtype =='object']
Step 4: convert object type
Here I use the LabelEncoder() method. It's very simple. It only needs a few lines of code
from sklearn.preprocessing import LabelEncoder #Import libraries necessary for data processing le = LabelEncoder() for col in category_columns: data[col] = le.fit_transform(data[col])
Isn't it very simple? Just a few sentences, but if you want to improve your grades, these lines alone are useless QAQ
Step 5: data filling
The most critical step in data preprocessing is to fill in the blank values for me! I won't say much here, just go to the code
for item in data.columns: if type(item) == str: if data[item].isnull().sum()>0: data[item].fillna('None',inplace=True) else: if data[item].isnull().sum()>0: data[item].fillna(data[item].median(),inplace=True) print(data.head()) Before filling: ID Location rental mode Number of bedrooms in the district ... Time floor decoration distance Label 0 0 118.0 NaN 11.0 1 ... 1 2 NaN 76.416667 5.602716 1 1 100.0 NaN 10.0 1 ... 1 1 NaN 70.916667 16.977929 2 2 130.0 NaN 12.0 2 ... 1 0 NaN 57.250000 8.998302 3 3 90.0 NaN 7.0 3 ... 1 2 NaN 65.833333 5.602716 4 4 31.0 NaN 3.0 2 ... 1 0 NaN NaN 7.300509 After filling: ID Location rental mode Number of bedrooms in the district ... Time floor decoration distance Label 0 0 118.0 None 11.0 1 ... 1 2 None 76.416667 5.602716 1 1 100.0 None 10.0 1 ... 1 1 None 70.916667 16.977929 2 2 130.0 None 12.0 2 ... 1 0 None 57.25 8.998302 3 3 90.0 None 7.0 3 ... 1 2 None 65.833333 5.602716 4 4 31.0 None 3.0 2 ... 1 0 None None 7.300509
Is there any change? Here I use the for loop to traverse each feature and process it one by one
Step 6: divide the data set
from sklearn.model_selection import train_test_split X_train, X_test, Y_train, Y_test = train_test_split( X , Y , test_size = 0.2, random_state = 0) here test_size I prefer to use 0.3 Description of main parameters: train_data:Sample set to be divided train_target:Label set of the sample set to be divided test_size:Proportion of samples, test If the proportion of data sets is an integer, it is the number of sample sets random_state: Is the seed of a random number
Step 7: feature scaling
In a data set, there is often a large difference between features, which is not conducive to computer training, so we need to scale the features to a certain extent
from sklearn.preprocessing import StandardScaler data_s = StandardScaler() data = data_s.fit_transform(data)
The StandardScaler here is to de normalize the mean and variance, so as to prevent the range difference of eigenvalues in the two characteristic columns from being too large, resulting in slower convergence rate.
Part of this article is extracted from Avik Jain's machine learning
Friends with similar interests can add a vx to communicate