ML - data normalization

1, About Normalization

Data normalization: map all data to equal scale space (in the same scale).

Dimension: preprocess continuous variables and standardize data. For disordered classification variables, dummy variables need to be generated.

Why normalization

When using Euclidean distance evaluation, some data have a large range and some are small.
For example, the room area is 70, 100 and 120, and the number of rooms is 3, 4 and 5. The data units are different. It is unreasonable to make decisions only according to the value, so data standardization or normalization is required.

2, Normalization method

1. Maximum normalization

The most commonly used normalization method is min max scaling, or normalization. It is often called 0-1 normalization. Compress to the interval of 0 – 1, which can suppress the influence of outliers on the results.

$ x_{scale} = \frac{x - x_{min}}{x_{max} - x_{min}} $

Applicable to: the distribution has obvious boundary; Such as score (0 – 100), color (0 – 255);
Disadvantages: it is greatly affected by the outsider. For example, income is widely distributed.

2. Normalization of mean variance

Mean variance normalization, also known as standardization, is called standardization or Z-score normalization in English.
Methods: all data were normalized to a distribution with a mean of 0 and a variance of 1.

Calculation formula: x s c a l e = x − x m e a n s x_{scale} = \frac{x - x_{mean}}{s} xscale=sx−xmean

It can also be recorded as: z = x − μ σ z=\frac{x-\mu}{\sigma} z=σx−μ

The mean is μ = 0 \mu=0 μ= 0, standard deviation σ = 1 \sigma=1 σ=1；

The data form is symmetrical with the origin, and the value range of features is the same.

Applicable to: the data has no obvious boundary, and there may be extreme data values.

3, Normalization in Python code

import numpy as np
 
x = np.random.randint(0, 100, size = 100)
 
x
''' 
    array([29, 45, 97, 97,  5, ...
    		82, 57,  2, 92, 14, 92, 70,  3, 56])
''' 


(x - np.min(x)) / (np.max(x) - np.min(x))
''' 
    array([0.28125   , 0.44791667, 0.98958333, 0.98958333, 0.03125   ,
           ...
           0.125     , 0.9375    , 0.70833333, 0.01041667, 0.5625    ])
'''

Processing matrix

X = np.random.randint(0, 100, (50, 2))
X[:10,:]

''' 
    array([[23, 47],
           [18,  6], 
           [58, 48],
           [65, 88]])

''' 
# Convert X to floating point number
X = np.array(X, dtype = float)
 
# Normalize the characteristics of the first column of data
X1 = X[:,0]
X11 = (X1 - np.min(X1))/ (np.max(X1) - np.min(X1))
X11

''' 
    array([0.23232323, 0.18181818, 0.11111111, 0.92929293, 0.78787879, 
           ... 
           0.        , 0.96969697, 0.02020202, 0.5959596 , 0.66666667])
''' 

X2 = X[:,1]
X21 = (X2 - np.min(X2))/ (np.max(X2) - np.min(X2))
X21

''' 
    array([0.47474747, 0.06060606, 0.        , 0.76767677, 0.57575758,
    		...
           0.64646465, 0.09090909, 0.41414141, 0.08080808, 0.61616162])
''' 


# View standard deviation
print('Mean 1:', np.mean(X11))
print('Variance 1:', np.std(X11))
print('Mean 2:', np.mean(X21))
print('Variance 2:', np.std(X21))


''' 
    Mean 1:0.5335353535353535
    Variance 1:0.3290554389302859
    Mean 2:0.4656565656565656
    Variance 2:0.31417389159473386
''' 

import matplotlib.pyplot as plt
 
plt.scatter(X11, X21)
# You can see that the values are between 0 and 1

Mean variance normalization

# (X - np.min(X)) / (np.max(X) - np.min(X))
  
X12 = (X1 - np.mean(X1))/ np.std(X1)
X22 = (X2 - np.mean(X2))/ np.std(X2)
 
print('Mean 1:', np.mean(X12))
print('Variance 1:', np.std(X12))
print('Mean 2:', np.mean(X22))
print('Variance 2:', np.std(X22))

# The mean is 0 and the variance is 1
''' 
    Mean 1: -8.881784197001253e-18
    Variance 1:1.0
    Mean 2: -4.5519144009631415e-17
    Variance 2:0.9999999999999999
'''

plt.scatter(X12, X22)

4, Implementation in Sklearn

sklearn uses Scaler to deal with normalization, and the encapsulation method is as similar as other algorithms as possible

from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target
 
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)
 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
 
# Mean variance normalization

std_scaler = StandardScaler()
std_scaler.fit(X_train) 
# StandardScaler(copy=True, with_mean=True, with_std=True)
 
# Check the mean value; The mean value corresponding to the four features of the characteristic matrix
std_scaler.mean_
# array([5.80916667, 3.06166667, 3.72666667, 1.18333333])
 
# View variance; The original version used std_, Replace with scale_, Mean distribution range;
std_scaler.scale_
# array([0.82036535, 0.44724776, 1.74502786, 0.74914766])
 
X_train_std = std_scaler.transform(X_train)
X_train_std

'''  
    array([[-1.47393679,  1.20365799, -1.56253475, -1.31260282],
           [-0.13307079,  2.99237573, -1.27600637, -1.04563275],
           [ 1.08589829,  0.08570939,  0.38585821,  0.28921757],
             ...
           [-0.01117388, -1.0322392 ,  0.15663551,  0.02224751],
           [ 1.57348593, -0.13788033,  1.24544335,  1.22361279]])
''' 

X_test_std = std_scaler.transform(X_test)
X_test_std
''' 
    array([[ 0.35451684, -0.58505976,  0.55777524,  0.02224751],
           [-0.13307079,  1.65083742, -1.16139502, -1.17911778], 
           ... 
           [-1.23014297, -0.13788033, -1.33331205, -1.17911778],
           [-1.23014297,  0.08570939, -1.2187007 , -1.31260282]])
'''

The normalized data are used to knn classify iris

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)

knn_clf.fit(X_train_std, y_train)
# KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=3, p=2, weights='uniform')

 
knn_clf.score(X_test_std, y_test) # Note the X above_ Train is std, X here_ Test also needs to use the function after std
# 1.0

Common use of other datasets

from sklearn.preprocessing import StandardScaler, MinMaxScaler

std_scaler = StandardScaler().fit(df[['alcohol', 'acid']])
df_std = std_scaler.transform(df[['alcohol', 'acid']])

minmax_scaler = StandardScaler().fit( df[['alcohol', 'acid']] )
df_minmax = minmax_scaler.transform( df[['alcohol', 'acid']] )

5, Encapsulated normalization class

import numpy as np
 
class StandardScaler:

    def __init__(self):
        self.mean_ = None
        self.scale_ = None
    
    def fit(self, X):
        """According to the training data set X Obtain the mean and variance of the data"""
        assert X.ndim == 2, "The dimension of X must be 2"
        self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])])
        return self
    
  # Normalize the mean variance of X according to this StandardScaler
    def transform(self, X):
        """""" 
        # Only two-dimensional data is processed
        assert X.ndim == 2, "The dimension of X must be 2"
        assert self.mean_ is not None and self.scale_ is not None,  "must fit before transform!"
        assert X.shape[1] == len(self.mean_),   "the feature number of X must be equal to mean_ and std_"
    
        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):
            resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col]
        return resX

6, Normalization of test data

Whether to normalize the training data and test data respectively?

Correct practice: use the test data set with the mean obtained from the training data set_ Train and std_train for normalization:
(x_test - mean_train) / std_train

Reasons for this:

The test data is to simulate the real environment; The mean and variance of all test data may not be obtained in the real environment;
The normalization of data is also a part of the algorithm.

Normalized StandardScaler in sklearn
https://scikit-learn.org/0.20/modules/generated/sklearn.preprocessing.StandardScaler.html

Added by tomlei on Mon, 03 Jan 2022 21:24:07 +0200

Programming VIP