Nerual Networks

Origins:Algorithms that try to mimic the brain.
Was very widely used in 80s and early 90s; popularity diminished in late 90s.
Recent resurgence: State-of-the-art technique for many applications

Model Representation

Simplistic Model

$\begin{bmatrix} x_0\\ x_1\\ x_2 \end{bmatrix} \to \left [\ \ \ \ \right ] \to h_\theta(x)$

With one hidden layer

$\begin{bmatrix} x_1\\ x_2\\ x_3 \end{bmatrix} \to \begin{bmatrix} a_1^{(2)}\\ a_2^{(2)}\\ a_3^{(2)} \end{bmatrix} \to h_\theta(x)$

$a_i^{(j)}=”activation”\ of \ unite \ i\ in \ layer\ j$
$\Theta^{(j)}= matrix\ of \ weights\ controlling\ function\ mapping\ from\ layer\ j\ to\ layer\ j+1$
details

$a_1^{(2)}=g(\Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_1+\Theta_{13}^{(1)}x_3)\\a_2^{(2)}=g(\Theta_{20}^{(1)}x_0+\Theta_{21}^{(1)}x_1+\Theta_{22}^{(1)}x_1+\Theta_{23}^{(1)}x_3)\\a_3^{(1)}=g(\Theta_{30}^{(1)}x_0+\Theta_{31}^{(1)}x_1+\Theta_{32}^{(1)}x_1+\Theta_{33}^{(1)}x_3)\\h_\Theta(x)=a_1^{(3)}=g(\Theta_{10}^{(2)}a_0^{(2)}+\Theta_{11}^{(2)}a_1^{(2)}+\Theta_{12}^{(2)}a_2^{(2)}+\Theta_{13}^{(2)}a_3^{(2)})$

If networks has $s_j$ units in layer $j$, $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j+1)$

Forward propagation: Vectorized implementation

We define:

$z_1^{(2)}=\Theta_{10}^{(1)}x_0+\Theta_{11}^{(1)}x_1+\Theta_{12}^{(1)}x_1+\Theta_{13}^{(1)}x_3$

$x=\begin{bmatrix}x_0\\x_1\\x_2\\x_3 \end{bmatrix}\quad z^{(2)}=\begin{bmatrix}z_1^{(2)}\\z_2^{(2)}\\z_3^{(2)}\end{bmatrix}$ $z^{(2)}=\Theta^{(1)}x\\ a^{(2)}=g(z^{(2)})\\Add\ a_0^{(2)}=1\\ z^{(3)}=\Theta^{(2)}a^{(2)}\\ h_\Theta(x)=a^{(3)}=g(z^{(3)})$

Multiclass Classification

There are four kinds of images: Pedestrain, Car, Motorcycle, Truck
So:$h_\Theta(x)\in \mathbb{R}^4$
Training set: $(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\dots,(x^{(m)},y^{(m)})$

$y^{(i)}$ one of $\begin{bmatrix} 1
0
0
0 \end{bmatrix} \to pedestrain, \begin{bmatrix} 0
1
0
0 \end{bmatrix} \to car, \begin{bmatrix} 0
0
1
0 \end{bmatrix} \to motorcycle, \begin{bmatrix} 0
0
0
1 \end{bmatrix} \to truck$

Cost function

${ (x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\dots,(x^{(m)},y^{(m)}) }$
$L=$ total no. of layers in network
$s_l=$ no. of units (not counting bias unit) in layer $l$

$h_\Theta(x)\in \mathbb{R}^K \quad (h_\Theta(x))_i=i^{th}\ output\\ J(\Theta)=-\frac{1}{m}\bigg[ \sum_{i=1}^m\sum_{k=1}^Ky_k^{(i)}\log(h_\Theta(x^{(i)}))_k+(1-y_k^{(i)})\log(1-(h_\Theta(x^{(i)}))_k) \bigg]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\Theta_{ji}^{(l)})^2$

Backpropagation algorithm

Gradient computation

假设一个神经网络为四层
Given one training example $(x,y)$
Fordward propagation:

$\begin{align} & a^{(1)}=x\\ & z^{(2)}=\Theta^{(1)}a^{(1)}\\ & a^{(2)}=g(z^{(2)}) \ (add\ a_0^{(2)})\\ & z^{(3)}=\Theta^{(2)}a^{(2)}\\ & a^{(3)}=g(z^{(3)})\ (add\ a_0^{(3)})\\ & z^{(4)}=\Theta^{(3)}a^{(3)}\\ & a^{(4)}=h_\Theta(x)=g(z^{(4)}) \end{align}$

Intuition: $\delta_j^{(l)}$=”error” of node $j$ in layer $l$.

For each output unit (layer L = 4)

$\delta^{(4)}=a^{(4)}-y$

$\delta^{(3)}=(\Theta^{(3)})^T \delta^{(4)} .* g’(z^{(3)})
\delta^{(2)}=(\Theta^{(2)})^T \delta^{(3)} .* g’(z^{(2)})$

Summarize

Training set ${ (x^{(1)},y^{(1)}),\dots,(x^{(m)},y^{(m)}) }$

Set $\Delta_{ij}^{(l)}=0\ (for\ all \ l,i,j)$

$\begin{align} For\ & i=1\ to\ m\\ & Set\ a^{(1)}=x^{(i)}\\ & Perform\ forward\ propagation\ to\ compute\ a^{(l)}\ for\ l=2,3,\dots,L\\ & Using\ y^{(i)},\ compute\ \delta^{(L)}=a^{(L)}-y^{(i)}\\ & Compute\ \delta^{(L-1)},\delta^{(L-2)},\dots,\delta^{(2)}\\ &\Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)} \end{align}$

Example

Topology

$\begin{bmatrix} 1\\ x_1^{(i)}\\ x_2^{(i)} \end{bmatrix} \to \begin{bmatrix} 1\\ z_1^{(2)} \to a_1^{(2)}\\ z_2^{(2)} \to a_2^{(2)} \end{bmatrix} \to \begin{bmatrix} 1\\ z_1^{(3)} \to a_1^{(3)}\\ z_2^{(3)} \to a_2^{(3)} \end{bmatrix} \to \begin{bmatrix} z_1^{(4)} \to a_1^{(4)} \end{bmatrix}$

$z_1^{(3)}=\Theta_{10}^{(2)} \times 1+\Theta_{11}^{(2)} \times a_1^{(2)}+\Theta_{12}^{(2)} \times a_2^{(2)}$

Define

$\delta_j^{(l)}=$ “error” of cost for $a_j^{(l)}$ (unite $j$ in layer $l$)

Formally, $\delta_j^{(l)}=\frac{\partial}{\partial z_j^{(l)}}cost(i)$ (for $j \ge 0$), where $cost(i)=y^{(i)}\log h_\Theta(x^{(i)}) + (1-y^{(i)})(1-\log h_\Theta(x^{(i)}))$

E.g.

$\begin{align} \delta_2^{(2)}&=\Theta_{12}^{(2)}\delta_1^{(3)}+\Theta_{22}^{(2)}\delta_2^{(3)}\\ \delta_2^{(3)}&=\Theta_{12}^{(3)}\delta_1^{(4)} \end{align}$

Advanced Optimization

Neural Network (L = 4)

$\Theta^{(1)},\Theta^{(2)},\Theta^{(3)}$ - matrices (Theta1, Theta2, Theta3)

$D^{(1)}, D^{(2)}, D^{(3)}$ - matrices (D1, D2, D3)

Example

s1 = 10, s2 = 10, s3 = 1

$\begin{align} \Theta^{(1)} \in \mathbb R^{10 \times 11},\Theta^{(2)} \in \mathbb R^{10 \times 11},\Theta^{(3)} \in \mathbb R^{1 \times 11}\\ D^{(1)} \in \mathbb R^{10 \times 11}, D^{(2)} \in \mathbb R^{10 \times 11}, D^{(3)} \in \mathbb R^{1 \times 11} \end{align}$

thetaVec = [Theta1[:], Theta2[:], Theta3[:]]

DVec = [D1[:], D2[:], D3[:]]

Theta1 = reshape(thetaVec(1:110), 10, 11)
Theta2 = reshape(thetaVec(111:220), 10, 11)
Theta3 = reshape(thetaVec(221:231), 1, 11)

Learning Algorithm

**Have initial parameters: ** $\Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}, \dots$

$\begin{align} function\ &[jVal,\ gradientVec]= costFunction(thetaVec)\\ & From\ thetaVec,\ get\ \Theta^{(1)},\Theta^{(2)},\Theta^{(3)}\\ & Use\ forward\ prop|back\ prop\ to\ complete\ D^{(1)},D^{(2)},D^{(3)},J(\Theta).\\ & Unroll\ D^{(1)},D^{(2)},D^{(3)}\ to\ get\ gradientVec. \end{align}$

Gradient Checking

检查神经网络是否正常收敛

Define

$\frac{d}{d \Theta}J(\Theta) \simeq \frac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2 \epsilon } \quad \epsilon \leftarrow 10^{-4}$

**Parameter vector: ** $\theta$

$\theta \in \mathbb R^n. $ (E.g. $\theta$ is “unrolled” vector of $\Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}$).

$\theta = [\theta_1, \theta_2, \theta_3, \dots, \theta_n]$

$\begin{align} \frac{\partial}{\partial \theta_1}&J(\Theta) \simeq \frac{J(\theta_1 + \epsilon, \theta_2, \dots, \theta_n) - J(\theta_1 - \epsilon, \theta_2, \dots, \theta_n)}{2 \epsilon}\\ & \vdots\\ \frac{\partial}{\partial \theta_n}&J(\Theta) \simeq \frac{J(\theta_1, \theta_2, \dots, \theta_n + \epsilon) - J(\theta_1, \theta_2, \dots, \theta_n - \epsilon)}{2 \epsilon} \end{align}$

Implementation Note:

Implement backprop to compute DVec
Implement numerical gradient check to compute gradApprox
Make sure they give similar values
Turn off gradient checking. Using backprop code for learning.

Random initialization

Initialize each $\Theta_{ij}^{(l)}$ to a random values in $[- \epsilon, \epsilon]$
(i.e. $- \epsilon \leq \Theta_{ij}^{(l)} \leq \epsilon$)

E.g.

Theta1 = rand(10, 11) * (2 * INIT_EPSILON) - INIT_EPSILON

Train a neural network

Pick a network architechture (connectivity pattern between neurous).

No. of input units: Dimension of features $x^{(i)}$
No. of output units: Number of classes.

Randomly initialize weights.
Implement forward propagation to get $h_\Theta(x^{(i)})$ for any $x^{(i)}$.
Implement code to compute cost function $J(\Theta)$.
Implement backprop to compute partial derivatives $\frac{\partial}{\partial \Theta_{jk}^{(l)}}J(\Theta)$. $\begin{align} for\ & \ i=1:m\\ & Perform\ forward\ propagation\ and\ backpropagation\ using\ example\ (x^{(i)},y^{(i)}).\\ & (Get\ activations\ a^{(l)}\ and\ delta\ terms\ \delta^{(l)}\ for\ l=2, \dots, L) \end{align}$
Use gradient checking to compare $\frac{\partial}{\partial \Theta_{jk}^{(l)}}J(\Theta)$. Compare using backpropagation vs. using numerical estimate of gradient of $J(\Theta)$
Use gradient descent or advanced optimization method with backpropagation to try to minimize $J(\Theta)$ as a function of parameter $\Theta$.

Exercise

最后还是手动实现一下神经网络。
训练数据就是很简单的异或问题。
本来想做手写数字的训练识别，等之后再说吧，23333.

xor.txt

neural_network.py

# -*- coding: utf-8 -*-
"""
Created on Sat Jul 21 21:02:28 2018

@author: 周宝航
"""

import numpy as np
import logging
import matplotlib.pyplot as plt

class NeuralNetwork(object):
    
    def __init__(self, sizes, num_iters=None, alpha=None):
        # layers numbers
        self.num_layers = len(sizes)
        # parameter sizes
        self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]
        # iteration numbers
        self.num_iters = num_iters if num_iters else 6000
        # learning rate
        self.alpha = alpha if alpha else 1
        # logger
        self.logger = logging.getLogger()
        logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
        logging.root.setLevel(level=logging.INFO)
        
    def __read_data(self, file_path=None):
        data = None
        if file_path:
            self.logger.info("loading train data from {0}".format(file_path))
            data = np.loadtxt(file_path)
        return data
        
    def __sigmoid(self, z, derive=False):
        if derive:
            return self.__sigmoid(z) * (1.0 - self.__sigmoid(z))
        else:
            return 1.0 / (1.0 + np.exp(-z))
    
    def save(self, file_path):
        if file_path:
            import pickle
            with open(file_path, 'wb') as f:
                pickle.dump(self.weights, f)
    
    def load(self, file_path):
        if file_path:
            import pickle
            with open(file_path, 'rb') as f:
                self.weights = pickle.load(f)
    
    def forwardprop(self, X):
        activation = X
        activations = [X]
        zs = []
        for w in self.weights:
            z = w.dot(activation)
            zs.append(z)
            activation = self.__sigmoid(z)
            activations.append(activation)
        return (activations, zs)
    
    def costFunction(self, y, _y):
        m = len(y)
        return - np.sum(y * np.log(_y) + (1.0 - y) * np.log(1.0 - _y)) / m
    
    def backprop(self, X, y):
        nable_w = [np.zeros(w.shape) for w in self.weights]
        # forward propagation
        activations, zs = self.forwardprop(X)
        # cost
        # delta^(l) = a^(l) - y
        cost = activations[-1] - y
        # calc delta
        delta = cost * self.__sigmoid(zs[-1], derive=True)
        nable_w[-1] = delta.dot(activations[-2].T)
        # back propagation
        for l in range(2, self.num_layers):
            # delta^(l) = weights^(l)^T delta^(l+1)
            delta = self.weights[-l+1].T.dot(delta) * self.__sigmoid(zs[-l], derive=True)
            nable_w[-l] = delta.dot(activations[-l-1].T)
        
        # update weights
        self.weights = [w-self.alpha*delta_w for w, delta_w in zip(self.weights, nable_w)]
        return activations[-1]
    
    def train_model(self, file_path=None):
        train_data = self.__read_data(file_path)
        self.logger.info("getting feature values")
        X = train_data[:, :-1].T
        self.logger.info("getting object values")
        y = train_data[:, -1]
        J_history = []
        for i in range(self.num_iters):
            _y = self.backprop(X, y)
            cost = self.costFunction(y, _y)
            self.logger.info("epoch {0} cost : {1}".format(i, cost))
            J_history.append(cost)
        fig = plt.figure()
        ax_loss = fig.add_subplot(1,1,1)
        ax_loss.set_title('Loss')
        ax_loss.set_xlabel('Iteration')
        ax_loss.set_ylabel('Loss')
        ax_loss.set_xlim(0,self.num_iters)
        ax_loss.plot(J_history)
        plt.show()

train_model.py

# -*- coding: utf-8 -*-
"""
Created on Sat Jul 21 22:28:54 2018

@author: 周宝航
"""

import logging
import os.path
import sys
import argparse
from neural_network import NeuralNetwork

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    
    parser = argparse.ArgumentParser(prog=program, description = 'train the model by neural network')
    parser.add_argument("--in_path", "-i", required=True, help="train data path")
    parser.add_argument("--out_path", "-o", help="output model path, file type is : *.pkl")
    parser.add_argument("--num_iters", "-n", type=int,help="iteration times")
    parser.add_argument("--alpha", "-a", type=float, help="learning rate")
    parser.add_argument("--layers", "-l", help="neural network architechtures, e.g. 784,100,10")
    args = parser.parse_args()
    
    logger = logging.getLogger(program)
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    
    if args.layers:
        sizes = [int(layer) for layer in args.layers.split(',')]
        nn = NeuralNetwork(sizes=sizes, num_iters=args.num_iters, alpha=args.alpha)
        logger.info("start training")
        nn.train_model(args.in_path)   

        if args.out_path:
            if args.out_path.split('.')[-1] == "pkl":
                nn.save(args.out_path)
            else:
                print("model file type error. Please use *.pkl to name your model.")
                sys.exit(1)
    else:
        print("Please give the neural network architechtures args as the note of help")

使用方法

python train_model.py -i data\xor.txt -n 3000 -l 2,4,1

结果图

Alt text