Dimension reduction algorithm of machine learning: principal component analysis (PCA)

1. PCA theory

Common ideas to solve the problem of over fitting: add training to process data, add regularization items, and reduce the dimension of features.
Dimensionality reduction methods: direct feature selection, linear dimensionality reduction (PCA, MDS) and nonlinear dimensionality reduction (manifold).

1.1 representation of mean and variance of high-dimensional samples

Given the Data matrix, X = ( x 1 , x 2 , . . . , x N ) N ∗ p T X=(x_1,x_2,...,x_N)^{T}_{N*p} X=(x1​,x2​,...,xN​)N∗pT​.
Mean value of sample characteristics: X ‾ p ∗ 1 = 1 N ∑ i = 1 N x i \overline{X}_{p*1}= \frac{1}{N}\sum_{i=1}^Nx_i Xp∗1​=N1​i=1∑N​xi​
Sample characteristic variance: S p ∗ p = 1 N ∑ i = 1 N ( x i − X ‾ ) ( x i − X ‾ ) T S_{p*p}=\frac{1}{N}\sum_{i=1}^N(x_i-\overline{X})(x_i-\overline{X})^{T} Sp∗p​=N1​i=1∑N​(xi​−X)(xi​−X)T
Then there is the representation of the mean and variance of high-dimensional samples:

X ‾ p ∗ 1 = 1 N ∑ i = 1 N x i \overline{X}_{p*1}= \frac{1}{N}\sum_{i=1}^Nx_i Xp∗1​=N1​∑i=1N​xi​
= 1 N ( x 1 , x 2 , . . . , x N ) ( 1 . . . 1 ) N ∗ 1 = \frac{1}{N}(x_1,x_2,...,x_N)\begin{pmatrix} 1 \\ ...\\ 1 \\ \end{pmatrix}_{N*1} =N1​(x1​,x2​,...,xN​)⎝⎛​1...1​⎠⎞​N∗1​
= 1 N X T 1 N ∗ 1 = \frac{1}{N}X^{T}1_{N*1} =N1​XT1N∗1​

S p ∗ p = 1 N ∑ i = 1 N ( x i − X ‾ ) ( x i − X ‾ ) T S_{p*p}=\frac{1}{N}\sum_{i=1}^N(x_i-\overline{X})(x_i-\overline{X})^{T} Sp∗p​=N1​∑i=1N​(xi​−X)(xi​−X)T
= 1 N ( x 1 − X ‾ , . . . , x N − X ‾ ) ( ( x 1 − X ‾ ) T . . . ( x N − X ‾ ) T ) N ∗ 1 =\frac{1}{N}(x_1-\overline{X},...,x_N-\overline{X})\begin{pmatrix} (x_1-\overline{X})^{T} \\ ...\\ (x_N-\overline{X})^{T}\\ \end{pmatrix}_{N*1} =N1​(x1​−X,...,xN​−X)⎝⎛​(x1​−X)T...(xN​−X)T​⎠⎞​N∗1​
= 1 N X T ( I N − 1 N 1 N 1 N T ) N X T ( I N − 1 N 1 N 1 N T ) T X =\frac{1}{N}X^T(I_N-\frac{1}{N}1_N1_N^T){N}X^T(I_N-\frac{1}{N}1_N1_N^T)^TX =N1​XT(IN​−N1​1N​1NT​)NXT(IN​−N1​1N​1NT​)TX
If the expression in parentheses is regarded as H, there are:
= 1 N X T H H T X =\frac{1}{N}X^THH^TX =N1​XTHHTX
You can try to calculate and find:
H T = H H^T=H HT=H
H n = H H^n=H Hn=H
Then there are:
S p ∗ p = 1 N X T H X S_{p*p}=\frac{1}{N}X^THX Sp∗p​=N1​XTHX

1.2 maximum projection variance & minimum reconstruction distance

classical PCA can be regarded as:
One center: reconstruction of original feature space → \to → irrelevant);
Two basic points: 1 Maximum projection variance 2 Minimum reconstruction distance

The basic idea is to project the original features in one direction. The idea is a bit like but different from LDA linear discriminant analysis. If there are relevant problems that need to be reviewed, LDA will write another LDA note. It maximizes the variance of the new samples obtained in the projection direction of the sample features, and enables the new sample data to be distinguished, and the possibility of reconstruction is as large as possible. As shown in the figure:

Centralization: x i − X ‾ x_i-\overline{X} xi​−X
The projection direction is: u 1 ⃗ \vec{u_1} u1​ ​
According to the projection theorem shown in the figure, there are:

J = ( ∑ i = 1 N ( x i − X ‾ ) T u 1 ⃗ ) 2 J=(\sum_{i=1}^N(x_i-\overline{X})^T\vec{u_1})^2 J=(∑i=1N​(xi​−X)Tu1​ ​)2
= J = ∑ i = 1 N u 1 ⃗ T ( x i − X ‾ ) ( x i − X ‾ ) T u 1 ⃗ =J=\sum_{i=1}^N\vec{u_1}^T(x_i-\overline{X})(x_i-\overline{X})^T\vec{u_1} =J=∑i=1N​u1​ ​T(xi​−X)(xi​−X)Tu1​ ​
= u 1 ⃗ T ( ∑ i = 1 N 1 N ( x i − X ‾ ) ( x i − X ‾ ) T ) u 1 ⃗ =\vec{u_1}^T(\sum_{i=1}^N\frac{1}{N}(x_i-\overline{X})(x_i-\overline{X})^T)\vec{u_1} =u1​ ​T(∑i=1N​N1​(xi​−X)(xi​−X)T)u1​ ​
= u 1 ⃗ T S u 1 ⃗ =\vec{u_1}^TS\vec{u_1} =u1​ ​TSu1​ ​

Then the solution of PCA can be regarded as an optimization problem with constraints:
{ u 1 ′ = a r g m a x u 1 T S u 1 s . t . u 1 T u 1 = 1 \begin{cases} u_1'=argmaxu_1^TSu_1\\ s.t.u_1^Tu_1=1\\ \end{cases} {u1′​=argmaxu1T​Su1​s.t.u1T​u1​=1​
According to Lagrange multiplier method:
f ( u , λ ) = u 1 T S u 1 + λ ( u 1 T u 1 − 1 ) f(u,\lambda)=u_1^TSu_1+\lambda(u_1^Tu_1-1) f(u,λ)=u1T​Su1​+λ(u1T​u1​−1)
Partial derivation:
∂ f ∂ u 1 = 2 S u 1 − λ 2 u 1 = 0 \frac{\partial f}{\partial u_1}=2Su_1-\lambda2u_1=0 ∂u1​∂f​=2Su1​−λ2u1​=0
Then there are:
S u 1 = λ u 1 Su_1=\lambda u_1 Su1​=λu1​

It can be seen that PCA is actually the reconstruction of feature space. For each direction selected, there is a corresponding eigenvalue. The new reconstructed coordinate system is u 1 , u 2 , . . . , u p u_1,u_2,...,u_p u1​,u2​,..., up, then each centralized reconstructed sample can be expressed as:
x i = ∑ k = 1 p ( x i T u k ) u k x_i=\sum_{k=1}^p(x_i^Tu_k)u_k xi​=k=1∑p​(xiT​uk​)uk​
Dimensionality reduction is to preserve the spatial dimension according to the needs of the reconstructed feature space samples p p Select from p spaces q q If q dimensions are reserved, the sample after dimension reduction is expressed as:
x i ′ = ∑ k = 1 q ( x i T u k ) u k x_i'=\sum_{k=1}^q(x_i^Tu_k)u_k xi′​=k=1∑q​(xiT​uk​)uk​
Then the minimum reconstruction cost can be expressed as:
J = 1 N ∑ i = 1 N ∣ ∣ x i − x i ′ ∣ ∣ 2 = 1 N ∑ k = q + 1 p ∣ ∣ x i − x i ′ ∣ ∣ 2 J=\frac{1}{N}\sum_{i=1}^{N}||x_i-x_i'||^2=\frac{1}{N}\sum_{k=q+1}^{p}||x_i-x_i'||^2 J=N1​i=1∑N​∣∣xi​−xi′​∣∣2=N1​k=q+1∑p​∣∣xi​−xi′​∣∣2
= 1 N ∑ i = 1 N ∑ k = q + 1 p ( x i T u k ) 2 =\frac{1}{N}\sum_{i=1}^{N}\sum_{k=q+1}^{p}(x_i^Tu_k)^2 =N1​i=1∑N​k=q+1∑p​(xiT​uk​)2
Consider centralization:
= 1 N ∑ i = 1 N ∑ k = q + 1 p ( ( x i − X ‾ ) T u k ) 2 =\frac{1}{N}\sum_{i=1}^{N}\sum_{k=q+1}^{p}((x_i-\overline{X})^Tu_k)^2 =N1​i=1∑N​k=q+1∑p​((xi​−X)Tuk​)2
= ∑ k = q + 1 p ∑ i = 1 N 1 N ( ( x i − X ‾ ) T u k ) 2 = ∑ k = q + 1 p u k T S u k =\sum_{k=q+1}^{p}\sum_{i=1}^{N}\frac{1}{N}((x_i-\overline{X})^Tu_k)^2=\sum_{k=q+1}^{p}u_k^TSu_k =k=q+1∑p​i=1∑N​N1​((xi​−X)Tuk​)2=k=q+1∑p​ukT​Suk​
Therefore, PCA will get the same result from two perspectives.

2. PCA practice

The dimensionality reduction algorithms in sklearn are included in the decomposition module.

2.1 important parameters n_components

Represents the dimension required after dimension reduction, that is, the number of features to be retained after dimension reduction.

2.1.1 visual feature distribution of high-dimensional data

Reduce the dimension of the feature to 3D (if necessary).

import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
iris = load_iris()
y = iris.target
X = iris.data
import pandas as pd
print(pd.DataFrame(X))

pca = PCA(n_components=2)
pca = pca.fit(X) # Fitting model
X_dr = pca.transform(X) # Get new matrix
# One step: X_dr = PCA(2).fit_transform(X)

# Take out the first feature of the sample data classified as 0
X_dr[y == 0, 0]
# Take out the second feature of the sample data classified as 0
X_dr[y == 0, 2]

# Draw, or cycle
plt.figure()
plt.scatter(X_dr[y==0, 0], X_dr[y==0, 1], c="red", label=iris.target_names[0])
plt.scatter(X_dr[y==1, 0], X_dr[y==1, 1], c="black", label=iris.target_names[1])
plt.scatter(X_dr[y==2, 0], X_dr[y==2, 1], c="orange", label=iris.target_names[2])
plt.legend()
plt.title("PCA of IRIS dataset")
plt.show()

# View the interpretability variance size of the new feature
print(pca.explained_variance_)
# View the contribution rate of interpretable variance of new features (the proportion of original information carried by each new feature)
print(pca.explained_variance_ratio_)
# View the total proportion of reduced dimension information in the original information
print(pca.explained_variance_ratio_.sum())

import numpy as np
# Accumulate the list
np.consum(pca.explained_variance_ratio_)

2.1.2 select n by other methods_ components

# Maximum likelihood estimation can be used to take n_ Value of components
pca_mle = PCA(n_components="mle")
pca_mle = pca_mle.fit(X)
X_mle = pca_mle.transform(X)

# Select n according to the proportion of information_ components
pca_f = PCA(n_componets=0.97, svd_solver="full")
pca_f = pca_f.fit(X)
X_f = pca_f.transform(X)

2.2 parameter SVD_ solver&random_ state

PCA can also be viewed from the perspective of singular value decomposition:

svd_ There are four kinds of solvers: auto (full for small amount of data, and randomized for large amount of data), full (complete decomposition of singular value matrix), arpack (it can accelerate the operation, which is suitable for large and sparse characteristic matrix), and randomized (it is suitable for large characteristic matrix and large amount of calculation).

2.3 important attribute components_

from sklearn.datasets import fetch_lfw_people
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
faces = fetch_lfw_people(min_faces_per_people=60)
X = faces.data

# Realistic rendering of images in a dataset
# 4 rows and 5 columns 4 * 5 = 20 drawings
# figsize is the canvas size
fig, axes = plt.subplot(4, 5, figsize=(8, 4), subplot_kw={"xticks":[], "yticks":[]})# Do not display axes
for i, ax in enumerate(axes.flat):
	ax.imshow(faces.images[i,:,:], cmap="gray")

pca = PCA(150).fit(X)
V = pca.components_
# Visual components_ Core information
fig, axes = plt.subplot(3, 8, figsize=(8, 4), subplot_kw={"xticks":[], "yticks":[]})# Do not display axes
for i, ax in enumerate(axes.flat):
	ax.imshow(V[i,:].reshape(62,47), cmap="gray")

2.4 important interface reverse_ transform

Dimensionality reduction is irreversible and cannot be recovered even with this interface.

X_inverse = pca.inverse_transform(X_dr)
fig, axes = plt.subplot(2, 10, figsize=(10, 2.5), subplot_kw={"xticks":[], "yticks":[]})# Do not display axes
for i in range(10):
	ax[0,i].imshow(faces.images[i,:,:], cmap="binary_r")
	ax[1,i].imshow(X_inverse[i].reshape(62,47), cmap="binary_r")

2.5 noise filtering with PCA

from sklearn.datasets import load_digits

digits = load_digits()
# View label
print(set(digits.target.tolist()))
# mapping
def plot(data):
	fig, axes = plt.subplot(4, 10, figsize=(10, 4), subplot_kw={"xticks":[], "yticks":[]})# Do not display axes
	for i, ax in enumerate(axes.flat):
		ax.imshow(data[i].reshape(8,8), cmap="binary")

# Add noise to the data
np.random.RandomState(42)
noisy = np.random.normal(digits.data, 2)
plot(noisy)

pca = PCA(0.5, svd_solver="full").fit(noisy)
X_dr = pca.transform(noisy)

witout_noise = pca.inverse_transform(X_dr)
plot(without_noise)

2.6 actual combat: dimension reduction of handwritten data set

The packages that have been imported will not be imported.

from sklearn.esemble import RandomForestClassifier as RFC
from sklearn.model_selection import cross_val_score
data = pd.read_csv("data set CSV route")
X = data.iloc[:, 1:]
y = data.iloc[:,0]

# Draw the cumulative variance contribution rate curve to find the best dimension range after dimension reduction
pca_line = PCA().fit(X)
plt.figure(figsize=[20,5])
plt.plot(np.consum(pca.line.explained_variance_ratio_))
plt.xlabel("number of components after dimension reduction")
plt.ylabel("cumulative explained variance")
plt.show()

# The learning curve after dimension reduction continues to narrow the range of the best dimension
score = []
# Reduce the range (10, 25) and draw the learning curve again
for i in range(1, 101, 10):
	X_dr = PCA(i).fit_transform(X)
	once = cross_val_score(RFC(n_estimator=10, random_state=0), X_dr, y, cv=5).mean()
	score.append(once)
plt.figure(figsize=[20,5])
plt.plot(range(1, 101, 10), score)
plt.show()

# Determine the dimension after finding the best parameter
X_dr = PCA(23).fit_transform(X)
cross_val_score(RFC(n_estimator=100, random_state=0), X_dr, y, cv=5).mean()

# Use KNN to train
from sklearn.neighbors import KNeighborsClassifier as KNN
cross_val_score(KNN(), X_dr, y, cv=5).mean()

# k-value learning curve of KNN
score = []
# Determine the value of k as 3
for i in range(10):
	X_dr = PCA(23).fit_transform(X)
	once = cross_val_score(KNN(i+1), X_dr, y, cv=5).mean()
	score.append(once)
plt.figure(figsize=[20,5])
plt.plot(range(10), score)
plt.show()

Keywords: Machine Learning

Added by dukeu03 on Sun, 30 Jan 2022 10:57:03 +0200