机器学习-主成份分析

Posted by 周宝航 on July 31, 2018

Dimensionality Reduction

For

  1. Data Compression
  2. Data Visualization

Principal Component Analysis (PCA)

  • Reduce from n-dimension to k-dimension:

1. Compute “covariance matrix”

2.Compute “eigenvectors” of matrix $\Sigma$

[U,S,V] = svd(Sigma);

we get:

reduce dimensionality:

Ureduce = U(:,1:k);
z = Ureduce'*x;

Chooosing $k$ (number of principal components)

  • Average squared projection error: $\frac{1}{m} \sum_{i=1}^m | x^{(i)} - x_{approx}^{(i)} |^2 $
  • Total variation in the data: $\frac{1}{m} \sum_{i=1}^m | x^{(i)} |^2$

  • Typically, choose $k$ to be smallest value so that:

99% of varianve is retained

Algorithm

[U,S,V] = svd(Sigma);

Pick smallest value of $k$ for which

99% of varianve is retained

Reconstraction from Compressed Representation

we know:

reconstraction:

Addvice for Applying PCA

Supervised learning speedup

$(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\dots,(x^{(m)},y^{(m)})$

Extract inputs: Unlabeled dataset:

New training set:

$(z^{(1)},y^{(1)}),(z^{(2)},y^{(2)}),\dots,(z^{(m)},y^{(m)})$

Note:

Mapping $x^{(i)} \to z^{(i)}$ should be defined by running PCA only on the training set. This mapping can be applied as well to the examples $x_{cv}^{(i)}$ and $x_{test}^{(i)}$ in the cross validation and test sets.

Application of PCA

  • Compression
    • Reduce memory/disk needed to store data
    • Speed up learning algorithm
  • Visualization

Exercise

  • 课程原实验包括:pca可视化;用于人脸数据降维、重建

pca.py

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 30 20:41:36 2018

@author: 周宝航
"""

import numpy as np
import numpy.linalg as la

class PCA(object):
    
    def __init__(self):
        pass
    
    def featureNormalize(self, X):
        m, n = X.shape
        X_norm = X;
        mu = np.zeros((1, n))
        sigma = np.zeros((1, n))
        for i in range(n):
            mu[0, i] = np.mean(X[:,i])
            sigma[0, i] = np.std(X[:,i])
        X_norm  = (X - mu) / sigma
        return X_norm,mu,sigma
    
    def pca(self, X):
        m, n = X.shape
        X_norm = X.T.dot(X) / m
        U,S,_ = la.svd(X_norm)
        return U,S
    
    def projectData(self, X, U, K):
        U_reduce = U[:, :K]
        Z = X.dot(U_reduce)
        return Z
    
    def recoverData(self, Z, U, K):
        U_reduce = U[:, :K]
        X = Z.dot(U_reduce.T)
        return X

main.py

# -*- coding: utf-8 -*-
"""
Created on Mon Jul 30 20:44:41 2018

@author: 周宝航
"""

from pca import PCA
import numpy as np
import scipy.io as sio
import matplotlib.pyplot as plt

# ================== Load Example Dataset  ===================
data = sio.loadmat('data\\ex7data1.mat')
X = data['X']
m, n = X.shape

fig = plt.figure()
ax = fig.add_subplot(1,2,1)

ax.plot(X[:, 0], X[:, 1], 'bo')

# =============== Principal Component Analysis ===============
pca = PCA()

X_norm, mu, sigma = pca.featureNormalize(X)

U, S = pca.pca(X_norm)

p1 = mu
p2 = mu + 1.5 * S[0] * U[:, 0].T
data = np.r_[p1, p2].reshape([-1, 2])
ax.plot(data[:, 0], data[:, 1], '-k', linewidth=2)
p2 = mu + 1.5 * S[1] * U[:, 1].T
data = np.r_[p1, p2].reshape([-1, 2])
ax.plot(data[:, 0], data[:, 1], '-k', linewidth=2)

# =================== Dimension Reduction ===================
ax = fig.add_subplot(1,2,2)
ax.plot(X_norm[:, 0], X_norm[:, 1], 'bo')

K = 1

Z = pca.projectData(X_norm, U, K)

X_rec = pca.recoverData(Z, U, K)

ax.plot(X_rec[:, 0], X_rec[:, 1], 'ro')


for i in range(m):
    data = np.r_[X_norm[i, :], X_rec[i, :]].reshape([-1, 2])
    ax.plot(data[:, 0], data[:, 1], '--k', linewidth=2)

# =============== Loading and Visualizing Face Data =============
def displayData(X):
    m, n = X.shape
    example_width = np.int(np.sqrt(n))
    example_height = np.int(n / example_width)
    display_rows = np.int(np.floor(np.sqrt(m)))
    display_cols = np.int(np.ceil(m / display_rows))
    pad = 1;
    
    display_array = - np.ones([pad + display_rows * (example_height + pad),\
                       pad + display_cols * (example_width + pad)])
    curr_ex = 0
    for j in range(display_rows):
        for i in range(display_cols):
            if curr_ex == m:
                break
            max_val = max(abs(X[curr_ex, :]))
            row = pad+j*(example_height + pad)+1
            col = pad+i*(example_width + pad)+1
            display_array[row:row+example_height, col:col+example_width] = \
                            X[curr_ex, :].reshape([example_height, example_width]) / max_val
            curr_ex += 1
        if curr_ex == m:
            break
    plt.imshow(display_array.T, cmap='gray')
    
data = sio.loadmat('data\\ex7faces.mat')
X = data['X']

#displayData(X[1:100, :])

# =========== PCA on Face Data: Eigenfaces  ===================
X_norm, mu, sigma = pca.featureNormalize(X)

U, S = pca.pca(X_norm)

#displayData(U[:, 1:36].T)

K = 100
Z = pca.projectData(X_norm, U, K)

# ==== Visualization of Faces after PCA Dimension Reduction ====
X_rec  = pca.recoverData(Z, U, K)

fig = plt.figure()

original = fig.add_subplot(1, 2, 1)
original.set_title('Original faces')
displayData(X_norm[1:100,:])

recovered = fig.add_subplot(1, 2, 2)
recovered.set_title('Recovered faces')
displayData(X_rec[1:100,:])

结果1

  • 左图代表数据经过PCA处理得到的主成分方向。
  • 右图中,蓝点表示原始数据、红点表示降维后的数据。虚线表示原始数据到降维后数据的映射。

Alt text

结果2

  • 左图为原始人脸图像数据,而右侧为降维后恢复的人脸数据。

Alt text