Major Tasks in Data Preprocessing - Data cleaning

VidhyaB10 14 views 21 slides Feb 28, 2025
Slide 1
Slide 1 of 21
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21

About This Presentation

Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: ...


Slide Content

Datamining & Warehousing
Dr.VIDHYA B
ASSISTANT PROFESSOR & HEAD
Department of Computer Technology
Sri Ramakrishna College of Arts and Science
Coimbatore - 641 006
Tamil Nadu, India
1
Unit 2 - Preprocessing

22
Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary

3
Data Quality: Why Preprocess the Data?

Measures for data quality: A multidimensional view

Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable, …

Consistency: some modified but some not, dangling, …

Timeliness: timely update

Believability: how trustable the data are correct

Interpretability: how easily the data can be understood
Sri Ramakrishna College of Arts & Science

4
Data Quality: Why Preprocess the Data?

Example : Analyzing the company’s data for branch’s sales.

Inspect the company’s database and data warehouse, users of
database system, some data have reported errors, unusual values,
and inconsistencies in the data recorded for some transactions.

Data analyzing by data mining techniques are

incomplete (lacking attribute values or certain attributes of interest,
or containing only aggregate data);

inaccurate or noisy (containing errors, or values that deviate from
the expected);

inconsistent (e.g., containing discrepancies in the department codes
used to categorize items)
Sri Ramakrishna College of Arts & ScienceSri Ramakrishna College of Arts & Science

5
Data Quality: Why Preprocess the Data?

Reasons for inaccurate data (i.e., having incorrect attribute values):

The data collection instruments used may be faulty.

There may have been human or computer errors occurring at data
entry.

Users may purposely submit incorrect data values for mandatory
fields when they do not wish to submit personal information (e.g.,
by choosing the default value “January 1” displayed for birthday).
This is known as disguised missing data.

There may be technology limitations: limited buffer size for
coordinating synchronized data transfer and consumption.

Incorrect data may also result from inconsistencies in naming
conventions or data codes, or inconsistent formats for input fields
(e.g., date). Duplicate tuples also require data cleaning
Sri Ramakrishna College of Arts & Science

6
Data Quality: Why Preprocess the Data?

Reasons for incomplete data:

Attributes of interest, may not always be available, such as customer
information for sales transaction data.

Relevant data may not be recorded due to a misunderstanding or because
of equipment malfunctions

The recording of the data history or modifications may have been
overlooked. Missing data, particularly for tuples with missing values for
some attributes, may need to be inferred.

Timeliness also affects data quality. The month-end data are not updated in a
timely fashion has a negative impact on the data quality.

Two other factors affecting data quality are believability and interpretability

Believability reflects how much the data are trusted by users

Interpretability reflects how easy the data are understood
Sri Ramakrishna College of Arts & Science

7
Major Tasks in Data Preprocessing

Data cleaning

Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies.

dirty data can cause confusion for the mining procedure,
resulting in unreliable output.

Data integration

Integration of multiple databases, data cubes, or files

Some attributes representing a given concept may have
different names in different databases, causing
inconsistencies and redundancies.

Eg: customer id in one data store and cust id in another.

Large amount of redundant data may slow down or confuse
the knowledge discovery process..
Sri Ramakrishna College of Arts & Science

8
Major Tasks in Data Preprocessing

Data reduction is a reduced representation of the data set smaller in
volume, yet produces the same (or almost the same) analytical results.

Dimensionality reduction: data encoding schemes are applied to obtain a
reduced or “compressed” representation of the original data. Eg: attribute
subset selection (e.g., removing irrelevant attributes) attribute
construction (e.g., where a small set of more useful attributes is derived
from the original set).

Numerosity reduction : the data are replaced by alternative, smaller
representations using parametric models (e.g., regression or log-linear
models) or nonparametric models (e.g., histograms, clusters, sampling, or
data aggregation).

Data compression

Data transformation and data discretization

powerful tools for data mining allow data mining at multiple abstraction
levels are Normalization & Concept hierarchy generation
Sri Ramakrishna College of Arts & Science

9
Major Tasks in Data Preprocessing
Sri Ramakrishna College of Arts & Science

1010
Chapter 3: Data Preprocessing

Data Preprocessing: An Overview

Data Quality

Major Tasks in Data Preprocessing

Data Cleaning

Data Integration

Data Reduction

Data Transformation and Data Discretization

Summary

11
Data Cleaning - Introduction

Data in the Real World Is Dirty: Lots of potentially incorrect
data, e.g., instrument faulty, human or computer error,
transmission error

Data Cleaning process:
Sri Ramakrishna College of Arts & Science

12
Data Cleaning - Introduction

Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error

Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.

incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data

e.g., Occupation=“ ” (missing data)

noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)

inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records

Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
Sri Ramakrishna College of Arts & Science

13
Data Cleaning – Missing Values

Data is not always available

E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data

Missing data may be due to

equipment malfunction

inconsistent with other recorded data and thus
deleted

data not entered due to misunderstanding

certain data may not be considered important at the
time of entry

not register history or changes of the data

Missing data may need to be inferred

14
Data Cleaning – Missing Values

Ignore the tuple:

when the class label is missing

Not very effective, unless the tuple contains several
attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies
considerably.

Fill in the missing value manually:

Time consuming and may not be feasible given a large data
set with many missing values.

Use a global constant to fill in the missing value:

Replace all missing attribute values by the same constant
such as a label like “Unknown” or −∞
Sri Ramakrishna College of Arts & Science

15
Data Cleaning – Missing Values

Use a measure of central tendency for the attribute (e.g.,
the mean or median) to fill in the missing value:

For normal (symmetric) data distributions, the mean can be
used, while skewed data distribution should employ the
median

Use the attribute mean or median for all samples
belonging to the same class as the given tuple:

If the data distribution for a given class is skewed, the
median value is a better choice

Use the most probable value to fill in the missing value:

Determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction
Sri Ramakrishna College of Arts & Science

16
Data Cleaning - Noisy Data

Noise: random error or variance in a measured variable

Incorrect attribute values may be due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention

Three methods to remove Noisy data:

Binning

Regression

Outlier Analysis
Sri Ramakrishna College of Arts & Science

17
How to Handle Noisy Data?

Binning is also used as a discretization.

first sort data and partition into (equal-frequency)
bins

then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.)
In smoothing by bin means,
each value in a bin is replaced by the mean value
of the bin.
For example, the mean of the values4, 8, and 15
in Bin 1 is 9.
Therefore, each original value in this bin is
replaced by the value 9.
Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by
the bin median.
In smoothing by bin boundaries, the minimum
and maximum values in a given bin are identified
as the bin boundaries.
Each bin value is then replaced by the closet
boundary value.
Sri Ramakrishna College of Arts & Science

18
How to Handle Noisy Data?

Regression

Data smoothing can also be done by regression.

Converts data values to a function.

Linear regression involves finding the “best” line
to fit two attributes (or variables) so that one
attribute can be used to predict the other.

Multiple linear regression more than two
attributes are involved and the data are fit to a
multidimensional surface.

Outliers analysis

Detected by clustering similar values are organized
into groups, or “clusters.”

Intuitively, values that fall outside of the set of
clusters may be considered outliers

19
Data Cleaning as a Process
Data discrepancy detection
The first step in data cleaning as a process is discrepancy detection
Several factors of data discrepancy detection are:

poorly designed data entry forms have many optional fields

human error in data entry

deliberate errors – users does not want to revel their secret

data decay – outdated addresses

inconsistent data representations & inconsistent use of codes

errors in instrumentation devices

when the data are (inadequately) used for purposes other than
originally intended.

Inconsistencies due to data integration
Sri Ramakrishna College of Arts & Science

20
Data Cleaning as a Process
Data discrepancy detection

Use metadata (e.g., domain, range, dependency, distribution)

Check field overloading

Check uniqueness rule, consecutive rule and null rule
- A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
- A consecutive rule says that there can be no missing values between
the lowest and highest values for the attribute, and that all values must
also be unique.
- A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition (e.g.,
where a value for a given attribute is not available), and how such
values should be handled.

21
Data Cleaning as a Process

Use commercial tools

Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections

Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration

Data migration tools: allow transformations to be specified

ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface.
Integration of the two processes data discrepancy & data
transformation which is error-prone and time consuming

Iterative and interactive new approach –
e.g., Potter’s Wheels) publicly available data tool.

Development of declarative languages