ML machine learning - PAC dimensionality reduction - coping with [high dimensional data disaster] [case]

Why dimensionality reduction?


High dimensional machine learning has a large number of features (variables), which have certain obstacles to machine learning calculation, but some of them may have certain correlation. Under the condition of ensuring that there is no loss of too much information, the features are processed into a group of new variables to reduce the dimension to the original data.

Principal component analysis PAC

Principal component analysis (PAC) is the most widely used in dimension reduction.


The dimension of the data set composed of a large number of related variables, while maintaining the variance of the data set as much as possible

Find a new set of variables. The original variables are just their linear combination
The new variable is called the Principal Component

Understanding principles: Portal
Next, expand the PCA example using the built-in data set of iris

import pandas as pd
url = ''
df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target'])

The data is about this long

Step 1: Standardization

The StandardScaler method of the sklearn module is used here

from sklearn.preprocessing import StandardScaler
variables = ['sepal length','sepal width','petal length','petal width']
x = df.loc[:, variables].values
y = df.loc[:,['target']].values
x = StandardScaler().fit_transform(x) 
x = pd.DataFrame(x)

It looks like this after standardization

Step 2: PCA

from sklearn.decomposition import PCA
pca = PCA()
x_pca = pca.fit_transform(x)
x_pca = pd.DataFrame(x_pca)

It turned out like this

There are four characteristics in the original dataset. Therefore, PCA will provide the same number of principal components
We use PCA explained_ variance_ ratio_ To obtain the proportion of information extracted by each principal component

Step 3: calculate the principal component (variance contribution)

explained_variance = pca.explained_variance_ratio_
# Returns a vector
[0.72770452 0.23030523 0.03683832 0.00515193]

Explanation: it shows that the first principal component accounts for 72.22% variance, and the second, third and fourth account for 23.9%, 3.68% and 0.51% variance. We can say that the first and second principal components capture 72.22 + 23.9 = 96.21% of the information. We often want to keep only important features and delete unimportant features. The rule of thumb is to retain the principal components that can capture significant variance and ignore the smaller variance.

Step 4: merge into new data

x_pca.columns = ['PC1','PC2','PC3','PC4','target']


Step 5: result visualization

import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1') 
ax.set_ylabel('Principal Component 2') 
ax.set_title('2 component PCA') 
targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']
for target, color in zip(targets,colors):
 indicesToKeep = x_pca['target'] == target
 ax.scatter(x_pca.loc[indicesToKeep, 'PC1']
 , x_pca.loc[indicesToKeep, 'PC2']
 , c = color
 , s = 50)

Surprisingly, principal component 12 is spatially separable.

Kernel PAC

In general, PAC is based on linear transformation, while KPAC is based on the results of nonlinear transformation, which is used to deal with linearly indivisible data sets.

The general idea of KPCA is: for the matrix [Formula] in the input space, we first map all the samples in the [Formula] to a high-dimensional or even infinite dimensional space (called Feature space) with a nonlinear mapping (make it linearly separable), and then reduce the dimension of PCA in this high-dimensional space.

Principle analysis reference

Linear discriminant analysis LDA

Tip: the total number of learning plans is counted here
For example:
1. Technical notes 2 times
2. CSDN technology blog 3
3. 1 vlog video for learning

Singular value decomposition SVD


Keywords: Python Machine Learning

Added by vintox on Tue, 21 Dec 2021 04:05:49 +0200