logeswarisaravanan
19 views
24 slides
May 05, 2024
Slide 1 of 24
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
About This Presentation
Data Preprocessing
Size: 536.04 KB
Language: en
Added: May 05, 2024
Slides: 24 pages
Slide Content
Data Preprocessing part3
Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Forms of Data Preprocessing
Data Transformation
Data transformation –the data are
transformed or consolidated into forms
appropriate for mining
Data Transformation
Data Transformation can involve the
following:
Smoothing: remove noise from the data,
including binning, regression and clustering
Aggregation
Generalization
Normalization
Attribute construction
Min-max normalization
Min-max normalization: to [new_min
A,
new_max
A]
Ex. Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,000 is
mapped to AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__('
716.00)00.1(
000,12000,98
000,12600,73
Z-score normalization
Z-score normalization (μ: mean, σ:
standard deviation):
Ex. Let μ= 54,000, σ= 16,000. ThenA
Av
v
' 225.1
000,16
000,54600,73
Decimal normalization
Normalization by decimal scaling
Suppose the recorded value of A range from
-986 to 917, the max absolute value is 986,
so j = 3j
v
v
10
'
Where jis the smallest integer such that Max(|ν’|) < 1
Data Reduction
Why data reduction?
A database/data warehouse may store
terabytes of data
Complex data analysis/mining may take a
very long time to run on the complete data
set
Data Reduction
Data reduction
Obtain a reduced representation of the
data set that is much smaller in volume but
yet produce the same (or almost the same)
analytical results
Data Reduction
Data reduction strategies
Data cube aggregation-aggregation operation are
applied
Attribute subset selection-irrelevant / weakly
relevant or redundant or dimension attribute is
removed
Dimensionality reduction —e.g.,remove
unimportant attributes
Numerosity reduction —e.g.,fit data into models
Discretization and concept hierarchy generation-
low to high
Data cube aggregation
Data cube aggregation
Data cube store multidimensional
aggregated information
Data cube the lowest level called as
base cuboid and highest level is apex
cuboid (represent the total sale for all
the year)
Attribute subset selection
Dimensionality reduction
Feature selection (i.e., attribute subset
selection):
Select a minimum set of features such that the
probability distribution of different classes given
the values for those features is as close as
possible to the original distribution given the
values of all features
reduce # of patterns in the patterns, easier to
understand
Attribute subset selection
Dimensionality reduction
Heuristic methods (due to exponential
# of choices):
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward
elimination
Decision-tree induction
Numerosity reduction
Reduce data volume by choosing
alternative, smaller forms of data
representation
Major families: histograms, clustering,
sampling
Data Reduction Method:
Histograms
Divide data into buckets and store average (sum) for
each bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
V-optimal: with the least histogram variance(weighted sum
of the original values that each bucket represents)
MaxDiff: set bucket boundary between each pair for pairs
have the β–1 largest differences
Data Reduction Method:
Clustering
Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 7
Data Reduction Method:
Sampling
Sampling: obtaining a small sample sto
represent the whole data set N
Simple random sample without replacement
Simple random sample with replacement
Cluster sample: if the tuples in D are grouped
into M mutually disjoint clusters, then an Simple
Random Sample can be obtained, where s < M
Stratified sample
Sampling: with or without
Replacement
Raw Data
Sampling: Cluster or Stratified
Sampling
Raw Data Cluster/Stratified Sample