Chapter 2 Data Preprocessing part3.ppt

logeswarisaravanan 19 views 24 slides May 05, 2024

Slide 1 of 24

About This Presentation

Data Preprocessing

Size: 536.04 KB

Language: en

Added: May 05, 2024

Slides: 24 pages

Slide Content

Data Preprocessing part3

Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery
process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation

Forms of Data Preprocessing

Data Transformation
Data transformation –the data are
transformed or consolidated into forms
appropriate for mining

Data Transformation
Data Transformation can involve the
following:
Smoothing: remove noise from the data,
including binning, regression and clustering
Aggregation
Generalization
Normalization
Attribute construction

Normalization
Min-max normalization
Z-score normalization
Decimal normalization

Min-max normalization
Min-max normalization: to [new_min
A,
new_max
A]
Ex. Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,000 is
mapped to AAA
AA
A
minnewminnewmaxnew
minmax
minv
v _)__(' 


 716.00)00.1(
000,12000,98
000,12600,73




Z-score normalization
Z-score normalization (μ: mean, σ:
standard deviation):
Ex. Let μ= 54,000, σ= 16,000. ThenA
Av
v


' 225.1
000,16
000,54600,73



Decimal normalization
Normalization by decimal scaling
Suppose the recorded value of A range from
-986 to 917, the max absolute value is 986,
so j = 3j
v
v
10
'
Where jis the smallest integer such that Max(|ν’|) < 1

Data Reduction
Why data reduction?
A database/data warehouse may store
terabytes of data
Complex data analysis/mining may take a
very long time to run on the complete data
set

Data Reduction
Data reduction
Obtain a reduced representation of the
data set that is much smaller in volume but
yet produce the same (or almost the same)
analytical results

Data Reduction
Data reduction strategies
Data cube aggregation-aggregation operation are
applied
Attribute subset selection-irrelevant / weakly
relevant or redundant or dimension attribute is
removed
Dimensionality reduction —e.g.,remove
unimportant attributes
Numerosity reduction —e.g.,fit data into models
Discretization and concept hierarchy generation-
low to high

Data cube aggregation

Data cube aggregation
Data cube store multidimensional
aggregated information
Data cube the lowest level called as
base cuboid and highest level is apex
cuboid (represent the total sale for all
the year)

Attribute subset selection
Dimensionality reduction
Feature selection (i.e., attribute subset
selection):
Select a minimum set of features such that the
probability distribution of different classes given
the values for those features is as close as
possible to the original distribution given the
values of all features
reduce # of patterns in the patterns, easier to
understand

Attribute subset selection
Dimensionality reduction
Heuristic methods (due to exponential
# of choices):
Step-wise forward selection
Step-wise backward elimination
Combining forward selection and backward
elimination
Decision-tree induction

Attribute subset selection
Dimensionality reduction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A1? A6?
Class 1Class 2Class 1Class 2
>Reduced attribute set: {A1, A4, A6}

Numerosity reduction
Reduce data volume by choosing
alternative, smaller forms of data
representation
Major families: histograms, clustering,
sampling

Data Reduction Method:
Histograms0
5
10
15
20
25
30
35
40
10000 20000 30000 40000 50000 60000 70000 80000 90000
100000

Data Reduction Method:
Histograms
Divide data into buckets and store average (sum) for
each bucket
Partitioning rules:
Equal-width: equal bucket range
Equal-frequency (or equal-depth)
V-optimal: with the least histogram variance(weighted sum
of the original values that each bucket represents)
MaxDiff: set bucket boundary between each pair for pairs
have the β–1 largest differences

Data Reduction Method:
Clustering
Partition data set into clusters based on similarity,
and store cluster representation (e.g., centroid and
diameter) only
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 7

Data Reduction Method:
Sampling
Sampling: obtaining a small sample sto
represent the whole data set N
Simple random sample without replacement
Simple random sample with replacement
Cluster sample: if the tuples in D are grouped
into M mutually disjoint clusters, then an Simple
Random Sample can be obtained, where s < M
Stratified sample

Chapter 2 Data Preprocessing part3.ppt

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Chapter 2 Data Preprocessing part3.ppt

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......