Handling noisy data

vivekgandhi399 5,238 views 15 slides Apr 18, 2018
Slide 1
Slide 1 of 15
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15

About This Presentation

Data Mining


Slide Content

G H Patel College of Engineering & Technology Submitted By: - Vivek Gandhi (140110107017) Gujarat Technological University Submitted To: - Ms. Manpreet Bagga

Data Mining Handling Noisy Data

Noisy Data: - Noise: - Random error or variance in a measured variable or we can say meaningless data.

Incorrect attribute values may due to: - Faulty data collection Data entry problems Data transmission problems Technology limitation Inconsistency in naming convention

How to Handle Noisy Data? Binning method Clustering Regression

Binning Method : - First sort data and partition Then one can smooth by bin mean, median and boundaries. Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid If A and B are the lowest and highest values of the attribute, the width of intervals will be: W = ( B - A )/ N. The most straightforward Skewed data is not handled well.

Continue …… . Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples Good data scaling Managing categorical attributes can be tricky .

Binning Methods for Data Smoothing * Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34

Cluster Analysis: - Outliers may be detected by clustering, where similar values are organized into group of cluster.

Example:- Data:- 4, 8, 15 ,21 ,21 ,24 ,25 ,28 ,34 Partition into equidepth Bin1:- 4 ,8 ,15 Bin2:- 21 ,21 ,24 Bin3:- 25 ,28 ,34 Smooth by bin Bin1:- 9 ,9 ,9 Bin2:- 22 ,22 ,22 Bin3:- 29 ,29 ,39

Example:- Smooth by Boundaries Bin1:- 4 ,4 ,15 Bin2:- 21 ,21 ,24 Bin3:- 25 ,25 ,34

Regression: - Here data can be smoothed by fitting the data to a function. Linear regression involves finding the “best” line to fit two attributes, so that one attribute can be used to predict the other. Multiple linear regression is an extension of linear regression, where more than two attributes are involved and the data are fit to a multidimensional surface.

Continue …… .