3.7 outlier analysis

Krish_ver2 16,006 views 17 slides May 07, 2015

Slide 1 of 17

About This Presentation

Data Mining-outlier analysis

Size: 208.13 KB

Language: en

Added: May 07, 2015

Slides: 17 pages

Slide Content

Outlier Analysis
1

Outlier Analysis
Outlier – data objects that are grossly different from or
inconsistent with the remaining set of data
Causes
Measurement / Execution errors
Inherent data variability
Outliers – maybe valuable patterns
Fraud detection
Customized marketing
Medical Analysis
2

Outlier Mining
Given n data points and k – expected number of
outliers find the top k dissimilar objects
Define inconsistent data
Residuals in Regression
Difficulties – Multi-dimensional data, non-numeric data
Mine the outliers
Visualization based methods
Not applicable to cyclic plots, high dimensional data and categorical data
Approaches
Statistical Approach
Distance-based approach
Density based outlier approach
Deviation based approach
3

Statistical Distribution-based Outlier
detection
Assumes data follows a probability distribution and uses
discordancy test
Discordancy testing
Working hypothesis – H: o
i
Î F i=1,2,..n
Test verifies whether an object oi is significantly different from F
Significance probability SP(v
i
) = Prob(T>v
i
)
IF SP is small o
i
is discordant and working hypothesis is rejected
and alternate hypothesis that o
i
comes from another distribution
model G is adopted
4

Statistical Distribution-based Outlier
detection
Alternative distributions
Inherent alternative distribution
Alternative hypothesis: All objects arise from another distribution G
Mixture alternative distribution
Discordant values are not outliers but contaminants from G H’: o
i
Î (1-
l) F + lG i=1,2,..n
Slippage alternative distribution
Some Objects are independent observations from a modified version
of F (different parameters)
5

Statistical Distribution-based Outlier
detection
Procedures for detecting Outliers
Block procedures
All are outliers or all are consistent
Consecutive Procedures
Inside-out procedure: Least likely object is tested first
If it is an outlier – more extreme values are also considered as outliers
Disadvantages of Statistical Approach
Tests are for single attributes
Data distribution may not be known
6

Distance based Outlier Detection
Distance-based outlier
A DB(p, D)-outlier is an object O in a dataset T such that at least
a fraction p of the objects in T lies at a distance greater than D
from O
Object does not have enough neighbours
Avoids excessive computation of Statistical models
If an object is an outlier according to a discordancy test then o is
DB(p, D) outlier for some p and D
7

Distance based Outlier Detection
Index based Algorithm
Uses multi-dimensional indexing structures such as k-d trees and R-trees
M – maximum number of objects within dmin neighborhood
Once M+1 neighbours are found o is not an outlier
O(n
2
k) apart from index construction
Nested loop algorithm
Avoids index construction
Tries to minimize I/Os
Divides memory buffer space into two halves and data set into several logical
blocks
8

Distance based Outlier Detection
Cell based Algorithm
Complexity : O(c
k
+n) c- depends on number of cells ; k – dimensionality
Data space is partitioned into cells: dmin / 2Ök
Two layers surround each cell
First layer – One cell thick
Second layer - é 2Ök-1 ù cells thick
Algorithm processes cells instead of objects
Maintains three counts: cell_count, cell_+_1_layer_count,
cell_+_2_layers_count
An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no
objects in the cell are outliers
If cell_+_2_layers_count, <= M then all objects in cell – Outliers
If > M some may be outliers
Object by object processing has to be done
9

Density based Outlier detection
Previous methods assume data are uniformly
distributed
Data may have different density distributions
Difficulty in choosing dmin
10

Density based Outlier detection
Local Outlier – if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects
in C1 are treated as outliers
Considers degree to which an object is an outlier
Local Outlier factor – degree depends on how isolated the object is
wrt its surroundings
11

Density based Outlier detection
The k-distance of an object p is the maximal distance that p gets
from its k-nearest neighbors d(p, o)
there are at least k objects in D that are as close as or closer to p than o;
for k o’ d(p, o’) <= d(p, o)
there are at most k-1 objects that are closer to p than o; for k-1 o” d(p,
o”) < d(p, o)
k-distance neighborhood
contains every object whose distance is not greater than the MinPts (k)-
distance of p
The reachability distance of an object p with respect to object o, is
defined as reach_dist
MinPts
(p, o) = max { MinPts-distance(o), d(p, o) }
12

OPTICS
Complexity : O(n log n)
13

Density based Outlier detection
Local reachability density of p is the inverse of the
average reachability density based on the MinPts-
nearest neighbors of p.
Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
It is the average of the ratio of the local reachability density of p
and those of p’s MinPts-nearest neighbors.
LOF is higher for outliers
14

Deviation based Outlier detection
Identifies outliers by examining the main characteristics
of objects in a group
Objects that “deviate” from this description are
considered outliers
Sequential exception technique
Simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
15

Sequential exception technique
Given a data set D a sequence of subsets {D1, D2, ..Dm} is built
such that Dj-1 Í Dj; Dissimilarities are assessed between
subsets in the sequence
Exception Set – Smallest subset of objects whose removal
results in greatest reduction of dissimilarity
Dissimilarity function – 1/n å
i=1

n
(x
i
-x’)
2
Smoothing factor: Assesses how much the dissimilarity can be
reduced by removing the subset from the original set of objects
Can be repeated to avoid the influence of order
16
Deviation based Outlier detection

Deviation based Outlier detection
OLAP Data Cube technique
Uses data cubes to identify regions of anomalies
A cell value in a cube is an exception if it differs
significantly from an expected value
Visualization effects guide user
May drill down
17

3.7 outlier analysis

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

3.7 outlier analysis

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......