1.8 discretization

10,753 views 15 slides May 06, 2015

Slide 1 of 15

About This Presentation

Data Mining- data Preprocessing-discretization

Size: 232.2 KB

Language: en

Added: May 06, 2015

Slides: 15 pages

Slide Content

1
Discretization and Concept
Hierarchy Generation

2
Discretization
Types of attributes:
Nominal — values from an unordered set, e.g., color, profession
Ordinal — values from an ordered set, e.g., military or academic
rank
Continuous — real numbers, e.g., integer or real numbers
Discretization:
Divide the range of a continuous attribute into intervals
Reduce data size by discretization

3
Discretization and Concept Hierarchy
Discretization
Reduce the number of values for a given continuous attribute
by dividing the range of the attribute into intervals
Interval labels can then be used to replace actual data values
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute

4
Concept hierarchy
Concept hierarchy formation
Recursively reduce the data by collecting and replacing low level
concepts (such as numeric values for age) by higher level
concepts (such as young, middle-aged, or senior)
Detail lost
More meaningful
Easier to interpret
Mining becomes easier
Several concept hierarchies can be defined for the same
attribute
Manual / Implicit

5
Discretization and Concept Hierarchy
Generation for Numeric Data
Typical methods:
Binning
Histogram analysis
Clustering analysis
Entropy-based discretization
c
2
merging
Segmentation by natural partitioning
All the methods can be applied recursively

6
Techniques
Binning
Distribute values into bins
Replace by bin mean / median
Recursive application – leads to concept hierarchies
Unsupervised technique
Histogram Analysis
Data Distribution – Partition
Equiwidth – (0-100], (100-200], …
Equidepth
Recursive
Minimum Interval size
Unsupervised

7
Techniques
Cluster Analysis
Clusters form nodes of concept hierarchy
Can decompose / combine
Lower level / higher level of hierarchy

8
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals S
1
and S
2

using boundary T, the expected information requirement after partitioning is
Entropy is calculated based on class distribution of the samples in the set.
Given m classes, the entropy of S
1 is
where p
i
is the probability of class i in S
1
The boundary that minimizes the expected information requirement over all
possible boundaries is selected as a binary discretization
The process is recursively applied to partitions obtained until some stopping
criterion is met
)(
||
||
)(
||
||
),(
2
2
1
1
SEntropy
S
S
SEntropy
S
S
TSI +=
å
=
-=
m
i
ii ppSEntropy
1
21 )(log)(

9
Reduces data size
Class information is considered
Improves accuracy
Entropy-Based Discretization

10
Interval Merging by c
2
Analysis
ChiMerge
Bottom-up approach
find the best neighbouring intervals and merges them to form larger intervals
Supervised
If two adjacent intervals have similar distribution of classes – they can be
merged
Initially each value is in a separate interval
c
2
tests are performed for adjacent intervals. Those with least
values are merged
Can be repeated
Stopping condition (Threshold, Number of intervals)

11
Segmentation by Natural Partitioning
A simply 3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width intervals
If it covers 2, 4, or 8 distinct values at the most significant digit,
partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most significant digit,
partition the range into 5 intervals

12
Outliers could be present
Consider only the majority values
5
th
percentile – 95
th
percentile
Segmentation by Natural Partitioning

13
Example of 3-4-5 Rule
(-$400 -$5,000)
(-$400 - 0)
(-$400 -
-$300)
(-$300 -
-$200)
(-$200 -
-$100)
(-$100 -
0)
(0 - $1,000)
(0 -
$200)
($200 -
$400)
($400 -
$600)
($600 -
$800) ($800 -
$1,000)
($2,000 - $5, 000)
($2,000 -
$3,000)
($3,000 -
$4,000)
($4,000 -
$5,000)
($1,000 - $2, 000)
($1,000 -
$1,200)
($1,200 -
$1,400)
($1,400 -
$1,600)
($1,600 -
$1,800)
($1,800 -
$2,000)
msd=1,000 Low=-$1,000High=$2,000Step 2:
Step 4:
Step 1: -$351 -$159 profit $1,838 $4,700
Min Low (i.e, 5%-tile) High(i.e, 95%-tile) Max
count
(-$1,000 - $2,000)
(-$1,000 - 0) (0 -$ 1,000)
Step 3:
($1,000 - $2,000)

14
Concept Hierarchy Generation for
Categorical Data
Specification of a partial ordering of attributes explicitly at
the schema level by users or experts
User / Expert defines hierarchy
Street < city < state < country
Specification of a portion of a hierarchy by explicit data
grouping
Manual
Intermediate level information specified
Industrial, Agricultural..

15
Concept Hierarchy Generation for
Categorical Data
Specification of a set of attributes but not their partial
ordering
Automatically inferring the hierarchy
Heuristic rule
High level concepts contain a smaller number of values
Specification of only a partial set of attributes
Embedding data semantics
Attributes with tight semantic connections are pinned together

1.8 discretization

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

1.8 discretization

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......