ShanthiSubramaniam
354 views
12 slides
Aug 29, 2024
Slide 1 of 12
1
2
3
4
5
6
7
8
9
10
11
12
About This Presentation
Preprocessing
Size: 363.7 KB
Language: en
Added: Aug 29, 2024
Slides: 12 pages
Slide Content
DATA CLEANING Dr. S.SHANTHI Associate Professor Department of CSE Kongu Engineering College, ERODE Dr. S.SHANTI/ASP/CSE/KEC
Preparing Data for Analysis 1. CLEANING THE DATA 2. REMOVING OBSERVATIONS AND VARIABLES 3. GENERATING CONSISTENT SCALES ACROSS VARIABLES 4. NEW FREQUENCY DISTRIBUTION 5. CONVERTING TEXT TO NUMBERS 6. CONVERTING CONTINUOUS DATA TO CATEGORIES 7. COMBINING VARIABLE 8. GENERATING GROUPS 9. PREPARING UNSTRUCTURED DATA Dr. S.SHANTI/ASP/CSE/KEC
Preparing Data for Analysis 1. CLEANING THE DATA a nominal or ordinal scale inspect all possible values to uncover mistakes, duplications and inconsistencies. Each value should map onto a unique term ( Eg. Kongu Engg . College, KEC, Kongu , …….) Numeric => “above 50” or “out of range.” missing data values Outliers => single or a small no of data values that differ greatly from the rest of the values. why outlier => error in the measurement; Methods to identify =>Histograms and box plots Duplicate entries Dr. S.SHANTI/ASP/CSE/KEC
Preparing Data for Analysis 2. REMOVING OBSERVATIONS AND VARIABLES constants and variables with too many missing data values would be candidates for removal 3. GENERATING CONSISTENT SCALES ACROSS VARIABLES Normalization Types Min-max normalization Z-score normalization Normalization by decimal scaling Dr. S.SHANTI/ASP/CSE/KEC
Preparing Data for Analysis- Normalization Min-max normalization : to [ new_min A , new_max A ] Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC Consider the following is the age of 12 persons. 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34 Normalize the age attribute using min-max normalization with new minimum is 1 and maximum is 5
Preparing Data for Analysis- Normalization Z-score normalization ( μ : mean, σ : standard deviation): Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling Where j is the smallest integer such that Max(| ν ’ |) < 1 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC Consider the following is the age of 12 persons. 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34 Normalize the age attribute using min-max [0,1], Z-score normalization
Preparing Data for Analysis 4. NEW FREQUENCY DISTRIBUTION Dr. S.SHANTI/ASP/CSE/KEC More closely approximates a normal distribution, it may be necessary to take the log, exponential, or a Box–Cox transformation.
Preparing Data for Analysis 5. CONVERTING TEXT TO NUMBERS a variable with values “low,” “medium” and “high” may be replaced with 0,1,2 dummy variables Dr. S.SHANTI/ASP/CSE/KEC 6. CONVERTING CONTINUOUS DATA TO CATEGORIES credit score => categories: poor, average, good, and excellent; Mark => Grade
Preparing Data for Analysis 8. GENERATING GROUPS larger data sets take more computational time to analyze & creating subsets from the data can speed up the analysis to select a diverse set of observations=> Analyze the performance Placement training during evening hours Breaking the data set down into subsets based on your knowledge of the data => predicting the house price => 9. PREPARING UNSTRUCTURED DATA Image , text documents , web logs, device readouts, audio or video information, Dr. S.SHANTI/ASP/CSE/KEC
How to Handle Missing Data? Ignore the tuple : usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC
How to Handle Noisy Data? Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries , etc. Binning is used for data discrtization 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC
Data Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values s.t . each old value can be identified with one of the new values Methods Smoothing: Remove noise from data Attribute/feature construction New attributes constructed from the given ones Aggregation: Summarization, data cube construction Normalization: Scaled to fall within a smaller, specified range min-max normalization z-score normalization normalization by decimal scaling Generalization : Concept hierarchy climbing 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC