1, About Normalization
Data normalization: map all data to equal scale space (in the same scale).
Dimension: preprocess continuous variables and standardize data. For disordered classification variables, dummy variables need to be generated.
Why normalization
When using Euclidean distance evaluation, some data have a large range and some are small.
For example, the room area is 70, 100 and 120, and the number of rooms is 3, 4 and 5. The data units are different. It is unreasonable to make decisions only according to the value, so data standardization or normalization is required.
2, Normalization method
1. Maximum normalization
The most commonly used normalization method is min max scaling, or normalization. It is often called 0-1 normalization. Compress to the interval of 0 – 1, which can suppress the influence of outliers on the results.
$ x_{scale} = \frac{x - x_{min}}{x_{max} - x_{min}} $
Applicable to: the distribution has obvious boundary; Such as score (0 – 100), color (0 – 255);
Disadvantages: it is greatly affected by the outsider. For example, income is widely distributed.
2. Normalization of mean variance
Mean variance normalization, also known as standardization, is called standardization or Z-score normalization in English.
Methods: all data were normalized to a distribution with a mean of 0 and a variance of 1.
Calculation formula: x s c a l e = x − x m e a n s x_{scale} = \frac{x - x_{mean}}{s} xscale=sx−xmean
It can also be recorded as: z = x − μ σ z=\frac{x-\mu}{\sigma} z=σx−μ
The mean is μ = 0 \mu=0 μ= 0, standard deviation σ = 1 \sigma=1 σ=1;
The data form is symmetrical with the origin, and the value range of features is the same.
Applicable to: the data has no obvious boundary, and there may be extreme data values.
3, Normalization in Python code
import numpy as np x = np.random.randint(0, 100, size = 100) x ''' array([29, 45, 97, 97, 5, ... 82, 57, 2, 92, 14, 92, 70, 3, 56]) ''' (x - np.min(x)) / (np.max(x) - np.min(x)) ''' array([0.28125 , 0.44791667, 0.98958333, 0.98958333, 0.03125 , ... 0.125 , 0.9375 , 0.70833333, 0.01041667, 0.5625 ]) '''
Processing matrix
X = np.random.randint(0, 100, (50, 2)) X[:10,:] ''' array([[23, 47], [18, 6], [58, 48], [65, 88]]) ''' # Convert X to floating point number X = np.array(X, dtype = float) # Normalize the characteristics of the first column of data X1 = X[:,0] X11 = (X1 - np.min(X1))/ (np.max(X1) - np.min(X1)) X11 ''' array([0.23232323, 0.18181818, 0.11111111, 0.92929293, 0.78787879, ... 0. , 0.96969697, 0.02020202, 0.5959596 , 0.66666667]) ''' X2 = X[:,1] X21 = (X2 - np.min(X2))/ (np.max(X2) - np.min(X2)) X21 ''' array([0.47474747, 0.06060606, 0. , 0.76767677, 0.57575758, ... 0.64646465, 0.09090909, 0.41414141, 0.08080808, 0.61616162]) ''' # View standard deviation print('Mean 1:', np.mean(X11)) print('Variance 1:', np.std(X11)) print('Mean 2:', np.mean(X21)) print('Variance 2:', np.std(X21)) ''' Mean 1:0.5335353535353535 Variance 1:0.3290554389302859 Mean 2:0.4656565656565656 Variance 2:0.31417389159473386 ''' import matplotlib.pyplot as plt plt.scatter(X11, X21) # You can see that the values are between 0 and 1
Mean variance normalization
# (X - np.min(X)) / (np.max(X) - np.min(X)) X12 = (X1 - np.mean(X1))/ np.std(X1) X22 = (X2 - np.mean(X2))/ np.std(X2) print('Mean 1:', np.mean(X12)) print('Variance 1:', np.std(X12)) print('Mean 2:', np.mean(X22)) print('Variance 2:', np.std(X22)) # The mean is 0 and the variance is 1 ''' Mean 1: -8.881784197001253e-18 Variance 1:1.0 Mean 2: -4.5519144009631415e-17 Variance 2:0.9999999999999999 ''' plt.scatter(X12, X22)
4, Implementation in Sklearn
sklearn uses Scaler to deal with normalization, and the encapsulation method is as similar as other algorithms as possible
from sklearn import datasets iris = datasets.load_iris() X = iris.data y = iris.target from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42) from sklearn.preprocessing import MinMaxScaler, StandardScaler # Mean variance normalization std_scaler = StandardScaler() std_scaler.fit(X_train) # StandardScaler(copy=True, with_mean=True, with_std=True) # Check the mean value; The mean value corresponding to the four features of the characteristic matrix std_scaler.mean_ # array([5.80916667, 3.06166667, 3.72666667, 1.18333333]) # View variance; The original version used std_, Replace with scale_, Mean distribution range; std_scaler.scale_ # array([0.82036535, 0.44724776, 1.74502786, 0.74914766]) X_train_std = std_scaler.transform(X_train) X_train_std ''' array([[-1.47393679, 1.20365799, -1.56253475, -1.31260282], [-0.13307079, 2.99237573, -1.27600637, -1.04563275], [ 1.08589829, 0.08570939, 0.38585821, 0.28921757], ... [-0.01117388, -1.0322392 , 0.15663551, 0.02224751], [ 1.57348593, -0.13788033, 1.24544335, 1.22361279]]) ''' X_test_std = std_scaler.transform(X_test) X_test_std ''' array([[ 0.35451684, -0.58505976, 0.55777524, 0.02224751], [-0.13307079, 1.65083742, -1.16139502, -1.17911778], ... [-1.23014297, -0.13788033, -1.33331205, -1.17911778], [-1.23014297, 0.08570939, -1.2187007 , -1.31260282]]) '''
The normalized data are used to knn classify iris
from sklearn.neighbors import KNeighborsClassifier knn_clf = KNeighborsClassifier(n_neighbors=3) knn_clf.fit(X_train_std, y_train) # KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=None, n_neighbors=3, p=2, weights='uniform') knn_clf.score(X_test_std, y_test) # Note the X above_ Train is std, X here_ Test also needs to use the function after std # 1.0
Common use of other datasets
from sklearn.preprocessing import StandardScaler, MinMaxScaler std_scaler = StandardScaler().fit(df[['alcohol', 'acid']]) df_std = std_scaler.transform(df[['alcohol', 'acid']]) minmax_scaler = StandardScaler().fit( df[['alcohol', 'acid']] ) df_minmax = minmax_scaler.transform( df[['alcohol', 'acid']] )
5, Encapsulated normalization class
import numpy as np class StandardScaler: def __init__(self): self.mean_ = None self.scale_ = None def fit(self, X): """According to the training data set X Obtain the mean and variance of the data""" assert X.ndim == 2, "The dimension of X must be 2" self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])]) self.scale_ = np.array([np.std(X[:,i]) for i in range(X.shape[1])]) return self # Normalize the mean variance of X according to this StandardScaler def transform(self, X): """""" # Only two-dimensional data is processed assert X.ndim == 2, "The dimension of X must be 2" assert self.mean_ is not None and self.scale_ is not None, "must fit before transform!" assert X.shape[1] == len(self.mean_), "the feature number of X must be equal to mean_ and std_" resX = np.empty(shape=X.shape, dtype=float) for col in range(X.shape[1]): resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col] return resX
6, Normalization of test data
Whether to normalize the training data and test data respectively?
Correct practice: use the test data set with the mean obtained from the training data set_ Train and std_train for normalization:
(x_test - mean_train) / std_train
Reasons for this:
- The test data is to simulate the real environment; The mean and variance of all test data may not be obtained in the real environment;
- The normalization of data is also a part of the algorithm.
Normalized StandardScaler in sklearn
https://scikit-learn.org/0.20/modules/generated/sklearn.preprocessing.StandardScaler.html