[machine learning] dimensionality reduction code exercise

This course is the after-school code of the "dimension reduction" chapter of machine learning. Course address: https://www.icourse163.org/course/WZU-1464096179 Course complete code: https://github.com/fengdu78/WZU-machine-learning-course Code modification and comment: Huang haiguang, haiguang2000@wzu.edu.cn

Principal component analysis

PCA is a linear transformation to find the "principal component" or the direction of maximum variance in the data set. It can be used for dimensionality reduction. In this exercise, we are first responsible for implementing PCA and applying it to a simple two-dimensional dataset to understand how it works. Let's start by loading and visualizing datasets.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
from scipy.io import loadmat

data = pd.read_csv('data/pcadata.csv')
data.head()

	X1	X2
0	3.381563	3.389113
1	4.527875	5.854178
2	2.655682	4.411995
3	2.765235	3.715414
4	2.846560	4.175506

X = data.values

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(X[:, 0], X[:, 1])
plt.show()

PCA algorithm is quite simple. After ensuring that the data is normalized, the output is only the singular value decomposition of the covariance matrix of the original data.

def pca(X):
    # normalize the features
    X = (X - X.mean()) / X.std()

    # compute the covariance matrix
    X = np.matrix(X)
    cov = (X.T * X) / X.shape[0]

    # perform SVD
    U, S, V = np.linalg.svd(cov)

    return U, S, V

U, S, V = pca(X)
U, S, V

(matrix([[-0.79241747, -0.60997914],
         [-0.60997914,  0.79241747]]),
 array([1.43584536, 0.56415464]),
 matrix([[-0.79241747, -0.60997914],
         [-0.60997914,  0.79241747]]))

Now we have principal components (matrix U), which we can use to project the original data into a lower dimensional space. For this task, we will implement a function that calculates the projection and selects only the top K components, which effectively reduces the dimension.

def project_data(X, U, k):
    U_reduced = U[:,:k]
    return np.dot(X, U_reduced)

Z = project_data(X, U, 1)

We can also restore the original data through the reverse conversion step.

def recover_data(Z, U, k):
    U_reduced = U[:,:k]
    return np.dot(Z, U_reduced.T)

X_recovered = recover_data(Z, U, 1)

fig, ax = plt.subplots(figsize=(12,8))
ax.scatter(list(X_recovered[:, 0]), list(X_recovered[:, 1]))
plt.show()

Note that the projection axis of the first principal component is basically a diagonal in the dataset. When we reduce the data to one dimension, we lose the changes around the diagonal, so in our reproduction, everything follows the diagonal.

Our last task in this exercise is to apply PCA to face images. By using the same dimensionality reduction technology, we can capture the "essence" of the image with much less data than the original image.

faces = loadmat('data/ex7faces.mat')
X = faces['X']
X.shape

(5000, 1024)

def plot_n_image(X, n):
    """ plot first n images
    n has to be a square number
    """
    pic_size = int(np.sqrt(X.shape[1]))
    grid_size = int(np.sqrt(n))

    first_n_images = X[:n, :]

    fig, ax_array = plt.subplots(nrows=grid_size,
                                 ncols=grid_size,
                                 sharey=True,
                                 sharex=True,
                                 figsize=(8, 8))

    for r in range(grid_size):
        for c in range(grid_size):
            ax_array[r, c].imshow(first_n_images[grid_size * r + c].reshape(
                (pic_size, pic_size)))
            plt.xticks(np.array([]))
            plt.yticks(np.array([]))

The exercise code includes a function that renders the first 100 faces in the dataset. Instead of trying to regenerate here, you can see what they look like in the exercise text. We can easily render at least one image.

face = np.reshape(X[3,:], (32, 32))

plt.imshow(face)
plt.show()

It looks terrible. These images are only 32 x 32 grayscale (it is also side rendering, but we can ignore it now). Our next step is to run PCA on the face dataset and obtain the first 100 main features.

U, S, V = pca(X)
Z = project_data(X, U, 100)

Now we can try to restore the original structure and render again.

X_recovered = recover_data(Z, U, 100)
face = np.reshape(X_recovered[3,:], (32, 32))
plt.imshow(face)
plt.show()

We can see that the data dimension is reduced, but the details are not lost.

reference resources

Prof. Andrew Ng. Machine Learning. Stanford University

Added by deRusett on Mon, 13 Dec 2021 09:02:40 +0200

Programming VIP

[machine learning] dimensionality reduction code exercise

Principal component analysis

reference resources

Popular Keywords