机器学习-异常检测

Posted by 周宝航 on August 1, 2018

Anomaly Detection

Example

  1. Fraud detection:
    • $x^{(i)}=$features of user $i$’s activities
    • Model $p(x)$ from data
    • Identify unusual users by checking which have $p(x) < \epsilon$
  2. Manufacturing
    • Monitoring computers in a data center

Gaussian distribution

parameter estimation

Dataset: ${ x^{(1)},x^{(2)},\dots,x^{(m)} } \quad x^{(i)} \in \mathbb R$

we want to know $\mu, \sigma^2$

Algorithm

  1. Choose features $x_i$ that you think might be indicative of anomalous examples.
  2. Fit examples $\mu_1,\dots,\mu_n,\sigma_1^2,\dots,\sigma_n^2$

  3. Given new example $x$, compute $p(x)$:

Anomaly if $p(x) < \epsilon$

Developing and evaluating anomaly detection system

Aircraft engines motivating example

  • 10000 good(normal) engines
  • 20 flawed engines

  • Always
    • Training set:6000 good engines
    • CV:2000 good engines($y=0$),10 anomalous($y=1$)
    • Test:2000 good engines($y=0$),10 anomalous($y=1$)
  • Alternative:
    • Training set:6000 good engines
    • CV:4000 good engines($y=0$),10 anomalous($y=1$)
    • Test:4000 good engines($y=0$),10 anomalous($y=1$)

Algorithm evaluation

  • Fit model $p(x)$ on training set ${ x^{(1)},\dots,x^{(m)} }$
  • On a cross validation/test example $x$, predict

  • Possible evaluation metrics:
    • True positive, false positive, false negative, true negative
    • Precision/Recall
    • $F_1$ score
  • Can alse use cross validation set to choose parameter $\epsilon$

Anomaly detection vs. supervised learning

Anomaly detection Supervised learning
Very small number of positive examples($y=1$).(0-20 is common)
Large number of negative ($y=0$) examples.
Large number of positive and negative examples.
Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far. Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.
Fraud detection
Manufacturing(e.g. aircraft engines)
Monitoring machines in a data center
Email spam classification
Weather prediction
Cancer classification

Choosing what features to use

  • we always do something to make our datas more Gaussian: like $x_1 \leftarrow \log(x_1)$

Error analysis for anomaly detection

Want:

  • $p(x)$ large for normal examples $x$.
  • $p(x)$ small for anomalous examples $x$.

Most common problem:

  • $p(x)$ is comparable (say, both large) for normal and anomalous examples.

Now:

  • we can create some new features

Multivariate Gaussian distribution

  • Parameters: $\mu \in \mathbb R^n, \Sigma \in \mathbb R^{n \times n}$

Parameter fitting:

  • Given training set: ${ x^{(1)},x^{(2)},\dots,x^{(m)} }$

Given a new example $x$, compute:

**Flag an anomaly if ** $p(x) < \epsilon$

Relationship to original model

Original model: $p(x)=p(x_1;\mu_1,\sigma_1^2) \times p(x_2;\mu_2,\sigma_2^2) \times \cdots \times p(x_n;\mu_n,\sigma_n^2)$

Corresponds to multivariate Gaussian

where

Original model vs. Multivariate Gaussian

Original model Multivariate Gaussian
Manually create features to capture anomalies where $x_1,x_2$ take unusual combinations of values. Automatically captures correlations between features
Computationally cheaper(alternatively, scales better to large $n$) Computationally more expensive
Ok even if $m$(training set size) is small Must have $m > n$, or else $\Sigma$ is non-invertible

Exercise

  • 本次实验包括:可视化异常监测结果,以及应用异常监测于一个小数据集。

anomaly_detection.py

# -*- coding: utf-8 -*-
"""
Created on Wed Aug  1 09:59:30 2018

@author: 周宝航
"""

import numpy as np
import numpy.linalg as la


class AnomalyDetector(object):
    
    def __init__(self):
        pass
    
    def estimateGaussian(self, X):
        m, n = X.shape
        mu = np.zeros([n, 1])
        sigma2 = np.zeros([n, 1])
        for i in range(n):
            mu[i] = np.mean(X[:, i])
            sigma2[i] = np.mean((X[:, i] - mu[i])**2)
        return mu,sigma2
    
    def multivariateGaussian(self, X, mu, sigma2):
        k = len(mu)
        sigma2 = np.diag(sigma2.flatten())
        X = X - mu.T
        p = (2 * np.pi) ** (- k / 2) * la.det(sigma2) ** (- 0.5) * \
            np.exp(-0.5 * np.sum(X.dot(la.pinv(sigma2)) * X, 1))
        return p.reshape([-1, 1])
    
    def selectThreshold(self, yval, pval):
        bestEpsilon = 0
        bestF1 = 0
        F1 = 0
        maxPval, minPval = max(pval), min(pval)
        stepsize = (maxPval - minPval) / 1000
        for epsilon in np.arange(minPval, maxPval, stepsize):
            predictions = pval < epsilon
            tp = np.sum((predictions == 1) & (yval == 1))
            fn = np.sum((predictions == 0) & (yval == 1))
            fp = np.sum((predictions == 1) & (yval == 0))
            prec = tp / (tp + fp)
            rec = tp / (tp + fn)
            F1 = 2 * prec * rec / (prec + rec)
            if F1 > bestF1:
                bestF1 = F1
                bestEpsilon = epsilon
        return bestEpsilon,bestF1

main.py

# -*- coding: utf-8 -*-
"""
Created on Wed Aug  1 11:02:43 2018

@author: 周宝航
"""

from anomaly_detect import AnomalyDetector
import matplotlib.pyplot as plt
import scipy.io as sio
import numpy as np

# ================== Load Example Dataset  ===================
data = sio.loadmat('data\\ex8data1.mat')

X = data['X']

fig = plt.figure()
ax = fig.add_subplot(1,1,1)

ax.plot(X[:, 0], X[:, 1], 'bx')
ax.axis([0, 30, 0, 30])
ax.set_xlabel('Latency (ms)')
ax.set_ylabel('Throughput (mb/s)')

# ================== Estimate the dataset statistics ===================
def visualizeFit(anomalyDetector, X, mu, sigma2):
    rg = np.arange(0,35.5,0.5)
    X1, X2 = np.meshgrid(rg, rg)
    Z = anomalyDetector.multivariateGaussian(np.c_[X1.reshape([-1, 1]), X2.reshape([-1, 1])], mu, sigma2)
    Z = Z.reshape(X1.shape)
    
    plt.contour(X1, X2, Z, 10**np.arange(-20,0,3,dtype=float).T)

anomalyDetector = AnomalyDetector()

mu, sigma2 = anomalyDetector.estimateGaussian(X)

p = anomalyDetector.multivariateGaussian(X, mu, sigma2)

visualizeFit(anomalyDetector, X, mu, sigma2)

# ================== Find Outliers ===================
Xval = data['Xval']
yval = data['yval']

pval = anomalyDetector.multivariateGaussian(Xval, mu, sigma2)

epsilon, F1 = anomalyDetector.selectThreshold(yval, pval)

outliers = np.where(p < epsilon)[0]

ax.scatter(X[outliers, 0], X[outliers, 1], marker='o', color='', edgecolors='r', linewidth=2)
plt.show()
# ================== Multidimensional Outliers ===================
data = sio.loadmat('data\\ex8data2.mat')
X = data['X']
Xval = data['Xval']
yval = data['yval']

mu, sigma2 = anomalyDetector.estimateGaussian(X)

p = anomalyDetector.multivariateGaussian(X, mu, sigma2)

pval = anomalyDetector.multivariateGaussian(Xval, mu, sigma2)

epsilon, F1 = anomalyDetector.selectThreshold(yval, pval)

print('Best epsilon found using cross-validation: %e\n' % epsilon);
print('Best F1 on Cross Validation Set:  %f\n' % F1);
print('   (you should see a value epsilon of about 1.38e-18)\n');
print('   (you should see a Best F1 value of 0.615385)\n');
print('# Outliers found: %d\n\n' % np.sum(p < epsilon));

result

  • 下图中,红圈标记的数据点为异常数据,而位于黄色椭圆圈附近的为正常数据。

Alt text