机器学习-逻辑回归

Posted by 周宝航 on July 18, 2018

Logistic Regression

Classification

  • Email : Spam/Not Spam?
  • Online Transactions : Fraudulent(Yes/No)?
  • Tumor : Malignant/Brnign?

So: Linear regression’s $h_\theta(x)$ can be $> 1$ or $< 0$

Logistic Regression:$0 \le h_\theta(x)\le 1$

Hypothesis Representation

  • want $0 \le h_\theta(x)\le 1$

  • original

  • sigmoid function
  • new

Interpretation of Hyphothesis Output

$h_\theta(x)=$ estimated probability that $y=1$ on input $x$

Example: if $\begin{align}&x=\begin{bmatrix}x_0\ x_1\end{bmatrix}=\begin{bmatrix}1\ tumorSize\end{bmatrix}\ & h_\theta(x)=0.7 \end{align}$

Tell patient that 70% chance of tumor being malignant

Summarize:

  • “Probability that y = 1, given x, parameterized by $\theta$”

Decision boundary

  • 决策边界是假设函数的一个属性,它将平面分为两部分。其只与假设函数中的参数有关,因此不是数据集的属性。

E.g.

and

We compute $\theta^Tx$ : get $-3+x_1+x_2$

So $x_1+x_2=3$ maps “$h_\theta(x)=0.5$” Predict “$y=1$” if $-3+x_1+x_2 \ge 0$

Cost function

Training set: ${(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),\dots,(x^{(m)},y^{(m)}) }$

m examples: $x\in\begin{bmatrix}x_0\ x_1\ \cdots\ x_n\end{bmatrix}\ \ \ x_0=1,y\in{0,1}$

Redefine:

Logistic regression cost function

  • 两种情况合并
  • 上式代入原成本函数

Gradient Descent

Repeat {

}

Code

matlab

Multiclass Classification

  • Email foldering/tagging:Work, Friendly, Family, Hobby
  • Medical diagrams:Not ill, Cold, Flu
  • Weather:Sunny, Cloudy, Rain, Snow

One versus rest

  • 假设我们现在要分为三类:T,S,C

流程

  1. 先考查T这一类,将剩余的类全部看作一类。如此一来,又变为了Binary Classification问题。然后,使用Logistic Regression得到了一个$h_\theta^{(1)}(x)$.
  2. 然后,按照上面的步骤,继续得到对S,C这两类的$h_\theta^{(2)}$与$h_\theta^{(3)}$.

Summarize:

Train a logistic regression classifier $h_\theta^{(i)}(x)$ for each class $i$ to predict the probability that $y=i$.

On a new input $x$, to make a prediction, pick the class $i$ that maximizes $\max_i h_\theta^{(i)}(x)$

Exercise

  • 当然又是实践一哈啦,附上代码和结果图

logistic_regression.py

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 18 16:52:11 2018

@author: 周宝航
"""

import numpy as np
import matplotlib.pyplot as plt
import logging

class LogisticRegression(object):
    
    def __init__(self, num_iters=None, alpha=None, num_params=3):
        # iteration numbers
        self.num_iters = num_iters if num_iters else 1500
        # learning rate
        self.alpha = alpha if alpha else 0.01
        # parameters
        self.theta = np.zeros([num_params,1])
        # training datas
        self.data = None
        # logger
        self.logger = logging.getLogger()
        logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
        logging.root.setLevel(level=logging.INFO)
        
    def read_data(self, file_path=None):
        if file_path:
            self.logger.info("reading the data from %s" % file_path)
            self.data = np.loadtxt(file_path)
            
    def save(self, path=None):
        if path:
            import pickle
            with open(path, "rb") as f:
                pickle.dump(self.theta, f)
                
    def load(self, path=None):
        if path:
            import pickle
            with open(path, "rb") as f:
                self.theta = pickle.load(f)
                
    def sigmoid(self, z):
        return 1 / (1 + np.exp(- z))
    
    def computeCost(self, X, y, theta):
        m = len(y)
        h = self.sigmoid(X.dot(theta))
        J = - (y.T.dot(np.log(h)) + (1.0 - y).T.dot(np.log(1.0 - h))) / m
        return np.sum(J)
    
    def gradientDescent(self, X, y):
        for i in range(self.num_iters):
            h = self.sigmoid(X.dot(self.theta))
            self.theta = self.theta - self.alpha * X.T.dot(h - y)
            J = self.computeCost(X, y, self.theta)
            yield J
    
    def train_model(self, file_path=None):
        self.read_data(file_path)
        self.logger.info("getting the feature values")
        x = self.data[:,:-1]
        self.logger.info("getting the object values")
        y = self.data[:,-1].reshape([-1, 1])
        # generate the feature matrix
        X = np.c_[np.ones([len(x), 1]), x]
        self.logger.info("start gradient descent")
        fig = plt.figure()
        ax_model = fig.add_subplot(1,2,1)
        for feature,tag in zip(x,y):
            color = 'or' if tag==0 else 'ob'
            ax_model.plot(feature[0], feature[1], color)
        ax_loss = fig.add_subplot(1,2,2)
        J_history = []
        for J in self.gradientDescent(X, y):
            J_history.append(J)
        ax_model.set_title('Logistic regression')
        ax_model.set_xlabel('feature 1')
        ax_model.set_ylabel('feature 2')
        tx = x[:,0]
        ty = (-self.theta[0, 0] - self.theta[1, 0] * tx) / self.theta[2, 0]
        ax_model.plot(tx, ty, color='r')

        ax_loss.set_title('Loss')
        ax_loss.set_xlabel('Iteration')
        ax_loss.set_ylabel('Loss')
        ax_loss.set_xlim(0,self.num_iters)
        ax_loss.plot(J_history)
        plt.show()
        self.logger.info("end")

train_model.py

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 18 16:50:20 2018

@author: 周宝航
"""

import logging
import os.path
import sys
import argparse
from logistic_regression import LogisticRegression

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    
    parser = argparse.ArgumentParser(prog=program, description = 'train the model by linear regression')
    parser.add_argument("--in_path", "-i", required=True, help="train data path")
    parser.add_argument("--out_path", "-o", help="output model path, file type is : *.pkl")
    parser.add_argument("--num_iters", "-n", type=int,help="iteration times")
    parser.add_argument("--alpha", "-a", type=float, help="learning rate")
    args = parser.parse_args()
    
    logger = logging.getLogger(program)
    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))
    
    lr_model = LogisticRegression(num_iters=args.num_iters, alpha=args.alpha)
    logger.info("start training")
    lr_model.train_model(args.in_path)   

    if args.out_path:
        if args.out_path.split('.')[-1] == "pkl":
            lr_model.save(args.out_path)
        else:
            print("model file type error. Please use *.pkl to name your model.")
            sys.exit(1)

结果图

Alt text