Why dimensionality reduction?
reason
High dimensional machine learning has a large number of features (variables), which have certain obstacles to machine learning calculation, but some of them may have certain correlation. Under the condition of ensuring that there is no loss of too much information, the features are processed into a group of new variables to reduce the dimension to the original data.
Principal component analysis PAC
Principal component analysis (PAC) is the most widely used in dimension reduction.
thought
The dimension of the data set composed of a large number of related variables, while maintaining the variance of the data set as much as possible
Find a new set of variables. The original variables are just their linear combination
The new variable is called the Principal Component
Understanding principles: Portal
Next, expand the PCA example using the built-in data set of iris
import pandas as pd url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data' df = pd.read_csv(url, names=['sepal length','sepal width','petal length','petal width','target']) print(df.head())
The data is about this long
Step 1: Standardization
The StandardScaler method of the sklearn module is used here
from sklearn.preprocessing import StandardScaler variables = ['sepal length','sepal width','petal length','petal width'] x = df.loc[:, variables].values y = df.loc[:,['target']].values x = StandardScaler().fit_transform(x) x = pd.DataFrame(x) print(x)
It looks like this after standardization
Step 2: PCA
from sklearn.decomposition import PCA pca = PCA() x_pca = pca.fit_transform(x) x_pca = pd.DataFrame(x_pca) x_pca.head()
It turned out like this
There are four characteristics in the original dataset. Therefore, PCA will provide the same number of principal components
We use PCA explained_ variance_ ratio_ To obtain the proportion of information extracted by each principal component
Step 3: calculate the principal component (variance contribution)
explained_variance = pca.explained_variance_ratio_ print(explained_variance) # Returns a vector [0.72770452 0.23030523 0.03683832 0.00515193]
Explanation: it shows that the first principal component accounts for 72.22% variance, and the second, third and fourth account for 23.9%, 3.68% and 0.51% variance. We can say that the first and second principal components capture 72.22 + 23.9 = 96.21% of the information. We often want to keep only important features and delete unimportant features. The rule of thumb is to retain the principal components that can capture significant variance and ignore the smaller variance.
Step 4: merge into new data
x_pca['target']=y x_pca.columns = ['PC1','PC2','PC3','PC4','target'] x_pca.head()
Exhibition
Step 5: result visualization
import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.set_xlabel('Principal Component 1') ax.set_ylabel('Principal Component 2') ax.set_title('2 component PCA') targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'] colors = ['r', 'g', 'b'] for target, color in zip(targets,colors): indicesToKeep = x_pca['target'] == target ax.scatter(x_pca.loc[indicesToKeep, 'PC1'] , x_pca.loc[indicesToKeep, 'PC2'] , c = color , s = 50) ax.legend(targets) ax.grid()
Surprisingly, principal component 12 is spatially separable.
Kernel PAC
In general, PAC is based on linear transformation, while KPAC is based on the results of nonlinear transformation, which is used to deal with linearly indivisible data sets.
The general idea of KPCA is: for the matrix [Formula] in the input space, we first map all the samples in the [Formula] to a high-dimensional or even infinite dimensional space (called Feature space) with a nonlinear mapping (make it linearly separable), and then reduce the dimension of PCA in this high-dimensional space.
Principle analysis reference
Portal
Linear discriminant analysis LDA
Tip: the total number of learning plans is counted here
For example:
1. Technical notes 2 times
2. CSDN technology blog 3
3. 1 vlog video for learning