Importance of Data Cleaning in Data Analytics

ShanthiSubramaniam 354 views 12 slides Aug 29, 2024

Slide 1 of 12

About This Presentation

Preprocessing

Size: 363.7 KB

Language: en

Added: Aug 29, 2024

Slides: 12 pages

Slide Content

DATA CLEANING Dr. S.SHANTHI Associate Professor Department of CSE Kongu Engineering College, ERODE Dr. S.SHANTI/ASP/CSE/KEC

Preparing Data for Analysis 1. CLEANING THE DATA 2. REMOVING OBSERVATIONS AND VARIABLES 3. GENERATING CONSISTENT SCALES ACROSS VARIABLES 4. NEW FREQUENCY DISTRIBUTION 5. CONVERTING TEXT TO NUMBERS 6. CONVERTING CONTINUOUS DATA TO CATEGORIES 7. COMBINING VARIABLE 8. GENERATING GROUPS 9. PREPARING UNSTRUCTURED DATA Dr. S.SHANTI/ASP/CSE/KEC

Preparing Data for Analysis 1. CLEANING THE DATA a nominal or ordinal scale inspect all possible values to uncover mistakes, duplications and inconsistencies. Each value should map onto a unique term ( Eg. Kongu Engg . College, KEC, Kongu , …….) Numeric => “above 50” or “out of range.” missing data values Outliers => single or a small no of data values that differ greatly from the rest of the values. why outlier => error in the measurement; Methods to identify =>Histograms and box plots Duplicate entries Dr. S.SHANTI/ASP/CSE/KEC

Preparing Data for Analysis 2. REMOVING OBSERVATIONS AND VARIABLES constants and variables with too many missing data values would be candidates for removal 3. GENERATING CONSISTENT SCALES ACROSS VARIABLES Normalization Types Min-max normalization Z-score normalization Normalization by decimal scaling Dr. S.SHANTI/ASP/CSE/KEC

Preparing Data for Analysis- Normalization Min-max normalization : to [ new_min A , new_max A ] Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC Consider the following is the age of 12 persons. 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34 Normalize the age attribute using min-max normalization with new minimum is 1 and maximum is 5

Preparing Data for Analysis- Normalization Z-score normalization ( μ : mean, σ : standard deviation): Ex. Let μ = 54,000, σ = 16,000. Then Normalization by decimal scaling Where j is the smallest integer such that Max(| ν ’ |) < 1 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC Consider the following is the age of 12 persons. 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34 Normalize the age attribute using min-max [0,1], Z-score normalization

Preparing Data for Analysis 4. NEW FREQUENCY DISTRIBUTION Dr. S.SHANTI/ASP/CSE/KEC More closely approximates a normal distribution, it may be necessary to take the log, exponential, or a Box–Cox transformation.

Preparing Data for Analysis 5. CONVERTING TEXT TO NUMBERS a variable with values “low,” “medium” and “high” may be replaced with 0,1,2 dummy variables Dr. S.SHANTI/ASP/CSE/KEC 6. CONVERTING CONTINUOUS DATA TO CATEGORIES credit score => categories: poor, average, good, and excellent; Mark => Grade

Preparing Data for Analysis 8. GENERATING GROUPS larger data sets take more computational time to analyze & creating subsets from the data can speed up the analysis to select a diverse set of observations=> Analyze the performance Placement training during evening hours Breaking the data set down into subsets based on your knowledge of the data => predicting the house price => 9. PREPARING UNSTRUCTURED DATA Image , text documents , web logs, device readouts, audio or video information, Dr. S.SHANTI/ASP/CSE/KEC

How to Handle Missing Data? Ignore the tuple : usually done when class label is missing (when doing classification)—not effective when the % of missing values per attribute varies considerably Fill in the missing value manually: tedious + infeasible? Fill in it automatically with a global constant : e.g., “unknown”, a new class?! the attribute mean the attribute mean for all samples belonging to the same class: smarter the most probable value: inference-based such as Bayesian formula or decision tree 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC

How to Handle Noisy Data? Binning first sort data and partition into (equal-frequency) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries , etc. Binning is used for data discrtization 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC

Data Transformation A function that maps the entire set of values of a given attribute to a new set of replacement values s.t . each old value can be identified with one of the new values Methods Smoothing: Remove noise from data Attribute/feature construction New attributes constructed from the given ones Aggregation: Summarization, data cube construction Normalization: Scaled to fall within a smaller, specified range min-max normalization z-score normalization normalization by decimal scaling Generalization : Concept hierarchy climbing 8/29/2024 Dr S SHANTHI,ASP,CSE,KEC

Importance of Data Cleaning in Data Analytics

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Importance of Data Cleaning in Data Analytics

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......