3.7 outlier analysis

Krish_ver2 16,006 views 17 slides May 07, 2015
Slide 1
Slide 1 of 17
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17

About This Presentation

Data Mining-outlier analysis


Slide Content

Outlier Analysis
1

Outlier Analysis
Outlier – data objects that are grossly different from or
inconsistent with the remaining set of data
Causes
Measurement / Execution errors
Inherent data variability
Outliers – maybe valuable patterns
Fraud detection
Customized marketing
Medical Analysis
2

Outlier Mining
Given n data points and k – expected number of
outliers find the top k dissimilar objects
Define inconsistent data
Residuals in Regression
Difficulties – Multi-dimensional data, non-numeric data
Mine the outliers
Visualization based methods
Not applicable to cyclic plots, high dimensional data and categorical data
Approaches
Statistical Approach
Distance-based approach
Density based outlier approach
Deviation based approach
3

Statistical Distribution-based Outlier
detection
Assumes data follows a probability distribution and uses
discordancy test
Discordancy testing
Working hypothesis – H: o
i
Î F i=1,2,..n
Test verifies whether an object oi is significantly different from F
Significance probability SP(v
i
) = Prob(T>v
i
)
IF SP is small o
i
is discordant and working hypothesis is rejected
and alternate hypothesis that o
i
comes from another distribution
model G is adopted
4

Statistical Distribution-based Outlier
detection
Alternative distributions
Inherent alternative distribution
Alternative hypothesis: All objects arise from another distribution G
Mixture alternative distribution
Discordant values are not outliers but contaminants from G H’: o
i
Î (1-
l) F + lG i=1,2,..n
Slippage alternative distribution
Some Objects are independent observations from a modified version
of F (different parameters)
5

Statistical Distribution-based Outlier
detection
Procedures for detecting Outliers
Block procedures
All are outliers or all are consistent
Consecutive Procedures
Inside-out procedure: Least likely object is tested first
If it is an outlier – more extreme values are also considered as outliers
Disadvantages of Statistical Approach
Tests are for single attributes
Data distribution may not be known
6

Distance based Outlier Detection
Distance-based outlier
A DB(p, D)-outlier is an object O in a dataset T such that at least
a fraction p of the objects in T lies at a distance greater than D
from O
Object does not have enough neighbours
Avoids excessive computation of Statistical models
If an object is an outlier according to a discordancy test then o is
DB(p, D) outlier for some p and D
7

Distance based Outlier Detection
Index based Algorithm
Uses multi-dimensional indexing structures such as k-d trees and R-trees
M – maximum number of objects within dmin neighborhood
Once M+1 neighbours are found o is not an outlier
O(n
2
k) apart from index construction
Nested loop algorithm
Avoids index construction
Tries to minimize I/Os
Divides memory buffer space into two halves and data set into several logical
blocks
8

Distance based Outlier Detection
Cell based Algorithm
Complexity : O(c
k
+n) c- depends on number of cells ; k – dimensionality
Data space is partitioned into cells: dmin / 2Ök
Two layers surround each cell
First layer – One cell thick
Second layer - é 2Ök-1 ù cells thick
Algorithm processes cells instead of objects
Maintains three counts: cell_count, cell_+_1_layer_count,
cell_+_2_layers_count
An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no
objects in the cell are outliers
If cell_+_2_layers_count, <= M then all objects in cell – Outliers
If > M some may be outliers
Object by object processing has to be done
9

Density based Outlier detection
Previous methods assume data are uniformly
distributed
Data may have different density distributions
Difficulty in choosing dmin
10

Density based Outlier detection
Local Outlier – if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects
in C1 are treated as outliers
Considers degree to which an object is an outlier
Local Outlier factor – degree depends on how isolated the object is
wrt its surroundings
11

Density based Outlier detection
The k-distance of an object p is the maximal distance that p gets
from its k-nearest neighbors d(p, o)
there are at least k objects in D that are as close as or closer to p than o;
for k o’ d(p, o’) <= d(p, o)
there are at most k-1 objects that are closer to p than o; for k-1 o” d(p,
o”) < d(p, o)
k-distance neighborhood
contains every object whose distance is not greater than the MinPts (k)-
distance of p
The reachability distance of an object p with respect to object o, is
defined as reach_dist
MinPts
(p, o) = max { MinPts-distance(o), d(p, o) }
12

OPTICS
Complexity : O(n log n)
13

Density based Outlier detection
Local reachability density of p is the inverse of the
average reachability density based on the MinPts-
nearest neighbors of p.
Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
It is the average of the ratio of the local reachability density of p
and those of p’s MinPts-nearest neighbors.
LOF is higher for outliers
14

Deviation based Outlier detection
Identifies outliers by examining the main characteristics
of objects in a group
Objects that “deviate” from this description are
considered outliers
Sequential exception technique
Simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
15

Sequential exception technique
Given a data set D a sequence of subsets {D1, D2, ..Dm} is built
such that Dj-1 Í Dj; Dissimilarities are assessed between
subsets in the sequence
Exception Set – Smallest subset of objects whose removal
results in greatest reduction of dissimilarity
Dissimilarity function – 1/n å
i=1

n
(x
i
-x’)
2
Smoothing factor: Assesses how much the dissimilarity can be
reduced by removing the subset from the original set of objects
Can be repeated to avoid the influence of order
16
Deviation based Outlier detection

Deviation based Outlier detection
OLAP Data Cube technique
Uses data cubes to identify regions of anomalies
A cell value in a cube is an exception if it differs
significantly from an expected value
Visualization effects guide user
May drill down
17