Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: ...
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update
Believability: how trustable the data are correct
Interpretability: how easily the data can be understood
Size: 322.43 KB
Language: en
Added: Feb 28, 2025
Slides: 21 pages
Slide Content
Datamining & Warehousing
Dr.VIDHYA B
ASSISTANT PROFESSOR & HEAD
Department of Computer Technology
Sri Ramakrishna College of Arts and Science
Coimbatore - 641 006
Tamil Nadu, India
1
Unit 2 - Preprocessing
22
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Summary
3
Data Quality: Why Preprocess the Data?
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update
Believability: how trustable the data are correct
Interpretability: how easily the data can be understood
Sri Ramakrishna College of Arts & Science
4
Data Quality: Why Preprocess the Data?
Example : Analyzing the company’s data for branch’s sales.
Inspect the company’s database and data warehouse, users of
database system, some data have reported errors, unusual values,
and inconsistencies in the data recorded for some transactions.
Data analyzing by data mining techniques are
incomplete (lacking attribute values or certain attributes of interest,
or containing only aggregate data);
inaccurate or noisy (containing errors, or values that deviate from
the expected);
inconsistent (e.g., containing discrepancies in the department codes
used to categorize items)
Sri Ramakrishna College of Arts & ScienceSri Ramakrishna College of Arts & Science
5
Data Quality: Why Preprocess the Data?
Reasons for inaccurate data (i.e., having incorrect attribute values):
The data collection instruments used may be faulty.
There may have been human or computer errors occurring at data
entry.
Users may purposely submit incorrect data values for mandatory
fields when they do not wish to submit personal information (e.g.,
by choosing the default value “January 1” displayed for birthday).
This is known as disguised missing data.
There may be technology limitations: limited buffer size for
coordinating synchronized data transfer and consumption.
Incorrect data may also result from inconsistencies in naming
conventions or data codes, or inconsistent formats for input fields
(e.g., date). Duplicate tuples also require data cleaning
Sri Ramakrishna College of Arts & Science
6
Data Quality: Why Preprocess the Data?
Reasons for incomplete data:
Attributes of interest, may not always be available, such as customer
information for sales transaction data.
Relevant data may not be recorded due to a misunderstanding or because
of equipment malfunctions
The recording of the data history or modifications may have been
overlooked. Missing data, particularly for tuples with missing values for
some attributes, may need to be inferred.
Timeliness also affects data quality. The month-end data are not updated in a
timely fashion has a negative impact on the data quality.
Two other factors affecting data quality are believability and interpretability
Believability reflects how much the data are trusted by users
Interpretability reflects how easy the data are understood
Sri Ramakrishna College of Arts & Science
7
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies.
dirty data can cause confusion for the mining procedure,
resulting in unreliable output.
Data integration
Integration of multiple databases, data cubes, or files
Some attributes representing a given concept may have
different names in different databases, causing
inconsistencies and redundancies.
Eg: customer id in one data store and cust id in another.
Large amount of redundant data may slow down or confuse
the knowledge discovery process..
Sri Ramakrishna College of Arts & Science
8
Major Tasks in Data Preprocessing
Data reduction is a reduced representation of the data set smaller in
volume, yet produces the same (or almost the same) analytical results.
Dimensionality reduction: data encoding schemes are applied to obtain a
reduced or “compressed” representation of the original data. Eg: attribute
subset selection (e.g., removing irrelevant attributes) attribute
construction (e.g., where a small set of more useful attributes is derived
from the original set).
Numerosity reduction : the data are replaced by alternative, smaller
representations using parametric models (e.g., regression or log-linear
models) or nonparametric models (e.g., histograms, clusters, sampling, or
data aggregation).
Data compression
Data transformation and data discretization
powerful tools for data mining allow data mining at multiple abstraction
levels are Normalization & Concept hierarchy generation
Sri Ramakrishna College of Arts & Science
9
Major Tasks in Data Preprocessing
Sri Ramakrishna College of Arts & Science
1010
Chapter 3: Data Preprocessing
Data Preprocessing: An Overview
Data Quality
Major Tasks in Data Preprocessing
Data Cleaning
Data Integration
Data Reduction
Data Transformation and Data Discretization
Summary
11
Data Cleaning - Introduction
Data in the Real World Is Dirty: Lots of potentially incorrect
data, e.g., instrument faulty, human or computer error,
transmission error
Data Cleaning process:
Sri Ramakrishna College of Arts & Science
12
Data Cleaning - Introduction
Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
Data cleaning (or data cleansing) routines attempt to fill in missing
values, smooth out noise while identifying outliers, and correct
inconsistencies in the data.
incomplete: lacking attribute values, lacking certain attributes of interest,
or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
Sri Ramakrishna College of Arts & Science
13
Data Cleaning – Missing Values
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus
deleted
data not entered due to misunderstanding
certain data may not be considered important at the
time of entry
not register history or changes of the data
Missing data may need to be inferred
14
Data Cleaning – Missing Values
Ignore the tuple:
when the class label is missing
Not very effective, unless the tuple contains several
attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies
considerably.
Fill in the missing value manually:
Time consuming and may not be feasible given a large data
set with many missing values.
Use a global constant to fill in the missing value:
Replace all missing attribute values by the same constant
such as a label like “Unknown” or −∞
Sri Ramakrishna College of Arts & Science
15
Data Cleaning – Missing Values
Use a measure of central tendency for the attribute (e.g.,
the mean or median) to fill in the missing value:
For normal (symmetric) data distributions, the mean can be
used, while skewed data distribution should employ the
median
Use the attribute mean or median for all samples
belonging to the same class as the given tuple:
If the data distribution for a given class is skewed, the
median value is a better choice
Use the most probable value to fill in the missing value:
Determined with regression, inference-based tools using a
Bayesian formalism, or decision tree induction
Sri Ramakrishna College of Arts & Science
16
Data Cleaning - Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Three methods to remove Noisy data:
Binning
Regression
Outlier Analysis
Sri Ramakrishna College of Arts & Science
17
How to Handle Noisy Data?
Binning is also used as a discretization.
first sort data and partition into (equal-frequency)
bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.)
In smoothing by bin means,
each value in a bin is replaced by the mean value
of the bin.
For example, the mean of the values4, 8, and 15
in Bin 1 is 9.
Therefore, each original value in this bin is
replaced by the value 9.
Similarly, smoothing by bin medians can be
employed, in which each bin value is replaced by
the bin median.
In smoothing by bin boundaries, the minimum
and maximum values in a given bin are identified
as the bin boundaries.
Each bin value is then replaced by the closet
boundary value.
Sri Ramakrishna College of Arts & Science
18
How to Handle Noisy Data?
Regression
Data smoothing can also be done by regression.
Converts data values to a function.
Linear regression involves finding the “best” line
to fit two attributes (or variables) so that one
attribute can be used to predict the other.
Multiple linear regression more than two
attributes are involved and the data are fit to a
multidimensional surface.
Outliers analysis
Detected by clustering similar values are organized
into groups, or “clusters.”
Intuitively, values that fall outside of the set of
clusters may be considered outliers
19
Data Cleaning as a Process
Data discrepancy detection
The first step in data cleaning as a process is discrepancy detection
Several factors of data discrepancy detection are:
poorly designed data entry forms have many optional fields
human error in data entry
deliberate errors – users does not want to revel their secret
data decay – outdated addresses
inconsistent data representations & inconsistent use of codes
errors in instrumentation devices
when the data are (inadequately) used for purposes other than
originally intended.
Inconsistencies due to data integration
Sri Ramakrishna College of Arts & Science
20
Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
- A unique rule says that each value of the given attribute must be
different from all other values for that attribute.
- A consecutive rule says that there can be no missing values between
the lowest and highest values for the attribute, and that all values must
also be unique.
- A null rule specifies the use of blanks, question marks, special
characters, or other strings that may indicate the null condition (e.g.,
where a value for a given attribute is not available), and how such
values should be handled.
21
Data Cleaning as a Process
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and
clustering to find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface.
Integration of the two processes data discrepancy & data
transformation which is error-prone and time consuming
Iterative and interactive new approach –
e.g., Potter’s Wheels) publicly available data tool.
Development of declarative languages