Chapter 2 Data Preprocessing part3.ppt

logeswarisaravanan 19 views 24 slides May 05, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Data Preprocessing


Slide Content

Data Preprocessing part3

Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation

Forms of Data Preprocessing

Data Transformation
Data transformation –the data are
transformed or consolidated into forms
appropriate for mining

Data Transformation
Data Transformation can involve the
following:
Smoothing: remove noise from the data,
including binning, regression and clustering
Aggregation
Generalization
Normalization
Attribute construction

Normalization
Min-max normalization
Z-score normalization
Decimal normalization

Min-max normalization
Min-max normalization: to [new_min
A,
new_max
A]
Ex. Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,000 is
mapped to AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 


 716.00)00.1(
000,12000,98
000,12600,73


Z-score normalization
Z-score normalization (μ: mean, σ:
standard deviation):
Ex. Let μ= 54,000, σ= 16,000. ThenA
Av
v


' 225.1
000,16
000,54600,73

Decimal normalization
Normalization by decimal scaling
Suppose the recorded value of A range from
-986 to 917, the max absolute value is 986,
so j = 3j
v
v
10
'
Where jis the smallest integer such that Max(|ν’|) < 1

Data Reduction
Why data reduction?
A database/data warehouse may store
terabytes of data
Complex data analysis/mining may take a
very long time to run on the complete data
set

Data Reduction
Data reduction
Obtain a reduced representation of the
data set that is much smaller in volume but
yet produce the same (or almost the same)
analytical results

Data Reduction
Data reduction strategies
Data cube aggregation-aggregation operation are
applied
Attribute subset selection-irrelevant / weakly
relevant or redundant or dimension attribute is
removed
Dimensionality reduction —e.g.,remove
unimportant attributes
Numerosity reduction —e.g.,fit data into models
Discretization and concept hierarchy generation-
low to high

Data cube aggregation

Data cube aggregation
Data cube store multidimensional
aggregated information
Data cube the lowest level called as
base cuboid and highest level is apex
cuboid (represent the total sale for all
the year)

Attribute subset selection
Dimensionality reduction
Feature selection (i.e., attribute subset
selection):
Select a minimum set of features such that the
probability distribution of different classes given
the values for those features is as close as
possible to the original distribution given the
values of all features
reduce # of patterns in the patterns, easier to
understand

Attribute subset selection
Dimensionality reduction
Heuristic methods (due to exponential
# of choices):
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward
elimination
Decision-tree induction

Attribute subset selection
Dimensionality reduction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1Class 2Class 1Class 2
>Reduced attribute set: {A1, A4, A6}

Numerosity reduction
Reduce data volume by choosing
alternative, smaller forms of data
representation
Major families: histograms, clustering,
sampling

Data Reduction Method:
Histograms0
5
10
15
20
25
30
35
40
10000 20000 30000 40000 50000 60000 70000 80000 90000
100000

Data Reduction Method:
Histograms
Divide data into buckets and store average (sum) for
each bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
V-optimal: with the least histogram variance(weighted sum
of the original values that each bucket represents)
MaxDiff: set bucket boundary between each pair for pairs
have the β–1 largest differences

Data Reduction Method:
Clustering
Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 7

Data Reduction Method:
Sampling
Sampling: obtaining a small sample sto
represent the whole data set N
Simple random sample without replacement
Simple random sample with replacement
Cluster sample: if the tuples in D are grouped
into M mutually disjoint clusters, then an Simple
Random Sample can be obtained, where s < M
Stratified sample

Sampling: with or without
Replacement
Raw Data

Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample
Tags