Machine learning - Data Preprocessing

Data pre filling has its own characteristics, and redundant and invalid data need to be pre selected according to different data formats. Data preprocessing is roughly divided into three steps: data preparation, data conversion and data output. Data processing is not only the basic link of system engineering, but also an effective means to improve the accuracy of the algorithm. Therefore, in order to improve the accuracy of the algorithm model, the data should also be transformed according to the characteristics of the algorithm and the characteristics of the data in machine learning. Here we will use scikit learn to transform the data, so that we can apply the processed data to the algorithm, which can also improve the accuracy of the algorithm model.
This paper mainly introduces the following data conversion methods: RescaleData, NormalizeData, NormalizeData and BinarizeData.

Format data

Format data flow:
1. Import data
2. Sort out the data according to the input and output of the algorithm
3. Format input data
4. Summarize the changes of display data
Scikit leam provides two standard methods for formatting data, and each method has an applicable algorithm. The data sorted by these two methods can be directly used to train the algorithm model. The method is as follows:
1. Fit and multiple transform
2. Combined fit and transform
It is recommended to give priority to Fit and Multiple Transform methods. First, call the fit () function to prepare the parameters of the data transformation, and then call the transform () function to do the data preprocessing. Combined fit and transform has a very good effect on drawing or summary processing.

Adjust data scale

If each attribute of the data measures the data in different ways, adjusting the scale of the data so that all attributes measure the data in the same scale will bring great convenience to the algorithm model training of machine learning. This method usually standardizes all the attributes of the data and converts the data into values between 0 and 1, which is very useful for gradient descent algorithms and plays an important role in improving the accuracy of regression algorithms, neural network algorithms and K-nearest neighbor algorithms.
In statistics, according to the accuracy of describing things, the scales used are divided into four levels: classification scale, sequencing scale, distance scale and comparison scale. Classification scale is a measure of the category attributes of things, which are grouped or classified according to the attributes of things. Sequencing scale is a measure of the level or order of things, which can compare advantages and disadvantages or rank. Fixed distance scale and fixed ratio scale are the measurement of the distance between the categories or orders of things. The characteristic of fixed distance scale is that it can not only distinguish and sort things into different types, but also accurately point out the gap between categories. The fixed scale is one step closer. The difference between it and the fixed distance scale is that it has a fixed absolute "zero" point. Because there is no essential difference between the two measurement scales in most statistical analysis, there is no strict distinction in many cases.
You can adjust the data scale in the maxim min scale class. Unifying the data of different measurement units into the same scale is conducive to the classification or grouping of things. In fact, MinMaxScaler scales the attribute to a specified range, or normalizes the data and aggregates the data around 0 with a variance of 1. The unification of data scales can usually improve the accuracy of distance related algorithms (such as K-nearest neighbor algorithm). Here is an example of scaling data.

#Adjust data scale (0..)
from pandas import read_csv 
from numpy import set_printoptions
from sklearn.preprocessing import MinMaxScaler 
#Import data
filename = 'pima_data.csv'
names= ['preg','plas','pres','skin','test','mass','pedi','age','class'] 
data = read_csv(filename, names=names) 
#Divide the data into input data and output results
array = data.values 
X = array[ : , 0:8] 
Y = array [ : , 8] 
transformer= MinMaxScaler(feature_range=(0, 1)) 
#data conversion
newX = transformer.fittransform(X) 
#Print of setting data
set_printoptions(precision = 3)
print(newX)

Normalized data

Normalized data is an effective segment for processing data conforming to Gaussian distribution. The output result takes 0 as the median and the variance is 1, which is used as the input of the algorithm assuming that the data conforms to Gaussian distribution. These algorithms include linear regression, logistic regression and linear discriminant analysis. Here, you can use the StandardScaler class provided by scikit leam to process normalized data.

#Normalized data
from pandas import read_csv 
from numpy import set_printoptions
from sklearn.preprocessing import StandardScaler  
#Import data
filename = 'pima_data.csv'
names= ['preg','plas','pres','skin','test','mass','pedi','age','class'] 
data = read_csv(filename, names=names) 
#Divide the data into input data and output results
array = data.values 
X = array[ : , 0:8] 
Y = array [ : , 8] 
transformer= StandardScaler().fit(X)
#data conversion
newX = transformer.transform(X)
#Print of setting data
set_printoptions(precision = 3)
print(newX)

Standardized data

Normalized data processing is to process the distance of each row of data into 1 (the vector distance is 1 in linear algebra). It is also called "normalized data" processing, which is suitable for processing sparse data (there are many data with 0), The data processed by normalization can significantly improve the accuracy of neural network using weight input and K-nearest neighbor algorithm using distance. It is implemented using the Normalizer class in scikit leam.

#Standardized data
from pandas import read_csv 
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer 
#Import data
filename = 'pima_data.csv'
names= ['preg','plas','pres','skin','test','mass','pedi','age','class'] 
data = read_csv(filename, names=names) 
#Divide the data into input data and output results
array = data.values 
X = array[ : , 0:8] 
Y = array [ : , 8] 
transformer= Normalizer().fit(X)
#data conversion
newX = transformer.transform(X)
#Print of setting data
set_printoptions(precision = 3)
print(newX)

Binary data

Binary data is used to convert data into binary values. The value greater than | is set to l and the value less than | is set to 0. This process is called binary data or analytic value conversion. When generating explicit values or adding attributes in Feature Engineering, use the Binarizer class in scikit learn.

#Binary data
from pandas import read_csv 
from numpy import set_printoptions
from sklearn.preprocessing import Binarizer 
#Import data
filename = 'pima_data.csv'
names= ['preg','plas','pres','skin','test','mass','pedi','age','class'] 
data = read_csv(filename, names=names) 
#Divide the data into input data and output results
array = data.values 
X = array[ : , 0:8] 
Y = array [ : , 8] 
transformer= Binarizer(threshold = 0.0).fit(X)
#data conversion
newX = transformer.transform(X)
#Print of setting data
set_printoptions(precision = 3)
print(newX)

Keywords: Python Machine Learning

Added by niesom on Thu, 10 Feb 2022 07:14:40 +0200