Remediating Data Quality by using Data cleaning

gabbz308 15 views 10 slides Sep 06, 2024
Slide 1
Slide 1 of 10
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10

About This Presentation

Data Cleaning


Slide Content

1/18/18
1
DATA CLEANING & DATA
QUALITY
Admin
¨Sign-up for presentations
¤Volunteers for next week
¨Projects
¤Finalize project topics by end of the month
2
Poor Data Quality
Poor data quality is a serious problem for organizations
¤Incorrect decision making
q“multiple versions of the truth”, lack of user confidence in data
¤Operational inefficiencies and mistakes
qMajor Canadian bank faxed sensitive customer data to US junk yard
(incorrect extra digit in fax number)
¤Waste of time and money [Gartner Research Report]
qOrganizations are each losing an average of $8.2M annually
qMore than 25% of critical data in Fortune 1000 companies will
continue to be inaccurate and incomplete
3
Examples
Class EDUC MR OCC REL SAL
pri vateBachelor MarriedExec Husband>50k
pri vateBachelor MarriedProf Husband>50k
self-empMasters MarriedExec Wife >50k
? HS Di v orc e d? Not-fam<50k
self-empMasters MarriedAdmin Wife <50k
nev-workPrimary Si n gl e? child >50k
? Bachelor Di v orc e d? Not-fam>50k
Sample from US Census
City #Beds# BathsMonthlyRent
To r o n t o ,ON1 1 $1000
To r o n t o ,ON1 1 $900
To r o n t o1 1 $380To r o n t o , O h i o , U S A ?
About:IBM
An Entity of Type : Public company
LocationCity: Uni te d S tate s
Armo nk, Ne w Yo rk
NumberOfEmployees: 426751
Da t eOfDea t h: 1956
44

1/18/18
2
Causes of Inconsistencies
¨Human Error
¨We a k e n fo rcemen t o f co n stra i nts (o r l a ck o f co n stra i nts)
¨Results of data integration or information extraction
¨Violation of constraints due to errors or schema/data evolution
C1: [S al ary < $100k] à[rate = 5%]
C2: [S al ary > $100k] à[rate = 3%]
Cl i e nt I DName Net Wort hSalary Ot he r Pr oduct sAmt Rat e
225 Jan eWalters$300k $85k $15k $150k5%
420 TracySmith $800k $150k $90k $350k3%
333 Jen Wright $100k $65k $0 $200k5%
128 RonMaples $235k $70k $45k $200k4%
5
Objective
¨Integrity constraints are the primary means for preserving data
integrity
¨Use constraints as the benchmark for defining conditions the data
should satisfy
¨Constraints may be stale, or better constraints may be available
to meet the application needs
Leverage the power of constraints to improve data quality.
6
F | I and [Y], t [Y]t
then [X] t [X] tif I, t,t
21
2121
==
=
F1: [Income] à[Privileges]
New Semantic Constraints
Cl i e nt I DIncomePri vi l egesCount r yAmt Status
225 Low None USA $150kApproved
420 High Full USA $350kApproved
333 MiddleLimited USA $200kApproved
128 MiddleLimited USA $200kApproved
193 MiddleLimited UK $0 Pending
550 Low None UK $0 Pending
243 High Full UK $200kIncomplete
376 High Full UK $175kApproved
Functional Dependency (FD) F: X àY
F2: [Amt] à[Status]
Conditional Functional Dependencies (CFD)
[Maher97,Bohannon
+
07]
[Country = ‘USA’, Amt] à[Status]
7
[Status = ‘Pending’] à[Amt = $0]
7
Data and Constraint Repair
Cl i e nt I DIncomeOt he r Pr odRat eCount r yCi t y State
225 High A 2.1%USA SanFranciscoMD
420 High A 2.1%USA San FranciscoCA
333 High B 3.0%USA San FranciscoCA
576 High B 3.0%USA San FranciscoCA
128 Low C 4.5%UK Reading Berkshire
193 Low C 4.5%UK London London
550 Low B 3.5%UK London London
F1: [Income ] à[Rate]

F2: [Country, City] à[State]
1)Repair the data or the constraints?
2)How to find the repairs?
8

1/18/18
3
Outline for Today
¨Review and discuss this week’s readings
¨Advice for giving presentations
¨Project topics
9
This Week’s Papers
1.E. Rahm, H. Do. Data Cleaning: Problems and Current
Approaches.IEEE Data Eng. Bull. 23 (4), pp. 3-13.
2.W. F a n , F. Geerts. Foundations of Data Quality Management.
Synthesis Lectures on Data Management.
10
Data Quality and Data Integration
S1
S2
S4
S3
Josephine Smith Jo Smith
J. Smith
Joe Smith
11
Single vs. Multi-Source
What examples can you provide for each type of
problem?
12

1/18/18
4
Data Cleaning Steps
1.Data Analysis
¤Identify the inconsistencies
2.Defining workflow and rules
¤What is the data cleaning process? What are the data
quality rules? Can we programmatically define the
transformation process?
3.Verification
¤Validate the rules and data cleaning process were
correct. How do we validate? Suggestions?
4.Execution
13
Data Analysis: Example
14
Example 1
Number of cars sold with this colour
0
100
200
300
400
500
600
blackblue buleorangered greenwhiteyellow
Cr e dit: David Co r ne
15
Ex 2: What problems do you observe?
Histogram of Online Supermarket customers by age field
0
100
200
300
400
500
600
<00-1011-2021-3031-4041-5051-6061-7071-8081-90>90
16

1/18/18
5
Tools
¨Many tools available to help…
¨Data analysis tools (identify potential errors)
¤Commercial tools
¨Domain specific tools
¤Addresses, customer relationship management, energy
usage data, etc.
¨Duplicate Elimination
¨ETL
¤Good at providing aggregated data summaries
17
Let’s Discuss
1.What do you think is the main contribution of this
paper? What did you take away?
2.Strengths?
3.Weaknesses?
4.Improvements?
18
¨W. Fan, F. Geerts. Foundations of Data Quality
Management. Synthesis Lectures on Data
Management.
19
Data Quality
¨DBMS focus on efficient processing of large
quantities of data, but not quality.
¤If we provide poor quality data, we will get poor
quality results.
¨Real data is often dirty, duplicated and inconsistent
DB
20

1/18/18
6
Examples
¨US: Pentagon asked 200+ dead officers to re-enlist
¨UK: there are 81M national insurance numbers but
only 60M people are eligible
¨Australia: 500K dead people retain active medical
cards
¨Typical data error rate in industry: 1%-5%
Source: F. Ge e rts, W. Fan, X. Jia
21
Lots of Statistics
¨Poor data costs companies $600B annually
¨Incorrectly priced data costs retailers in the US
$2.5B annually
¨30% -80% of time spent on data integration
projects is for data cleaning
¨The market for data quality tools is growing at
17% annually (vs. 7% for other IT segments)
22
Central Issues in DQ
¨Data Consistency: whether the data contains errors or
conflicts that arise as violations of rules.
¤Example: age = 82 and age = 28 for the same patient
¤Different types of rules may be used to judge consistency
¨Accuracy: how close a value (representing an entity) is
to the true value.
¤Example: age <= 200 vs. age = 45
23
Data Quality Issues (cont’d)
¨Completeness: whether a query can be answered given
the information available.
¤Example: age = null (missing value) in a patient record, or
missing patient record (missing tuple)
¤What is the degree of incompleteness?
¨Currency:whether the data is too outdated to answer a
query.
¤Example: Census results in 2013, based on old address and
demographic information from 2010
¤We want timely data!Can y o u thi nk o f an e x ampl e whe re
timeliness doesn’t matter?
24

1/18/18
7
Data Quality Issues (cont’d)
¨Deduplication: whether two records refer to the
same entity
¤Example: If Jane Smith and Joan Smith have the same
email address, are they referring to the same person?
25
Improving DQ with Constraints
¨Need constraints to help improve DQ, how:
¨Types of errors:
¤Syntactic: values do not conform to type, length, domain
range [easier to identify]
¤Semantic: data values deviate from the true values of the
entity they represent [harder]
¨Constraints model the intended application semantics
¤Many different types of constraints exist
¤Identify patterns, similarity, containment of tuples,
timeliness of data
26
Constraint Considerations
¨Discovering data quality rules
¨Identifying errors
¨Repairing the data (or the constraints)
27
Interactions Between DQ Issues
¨These issues clearly do not occur in isolation!
¨Let’s consider a case study…
28

1/18/18
8
Let’s Discuss
1.What do you think is the main contribution of this
paper? What did you take away?
2.Strengths?
3.Weaknesses?
4.Improvements?
Do you think any one of the properties is
more important than the others?
29
Case Study
¨The International Seismological Centre (ISC)
(located in the UK)
¨It’s a charity funded by various governments.
¨Their role is to be the repository for recording all
earthquake events on the planet.
Cr e dit: David Co r ne
30
Gathering Data
ISCData gathering centres
31
Data Delivery
¨Raw seismograph data from local collection points to
‘data gathering’ centres.
¨Send to ISC
¨Some data gathering centres provide raw data, some
provide interpreted data
¤Some will send partial data if they believe it’s erroneous
32

1/18/18
9
Data Integration
¨Integrate the raw data from the incoming sources
¨ISC job is actually to determine:
¤where and when the Earth tremors were
¤Hundreds per month around the world
¤Reverse engineer from seismograph readings
33
Analysis
¨Where can errors arise in data …
Accuracy
Consistency Currency
Completeness Du p l i c a t es
What other factors are important to consider?
34
Presentation Tips
35
Giving Presentations
Things to avoid:
¨You don’t want people to lose focus on you (the presenter)
¤Do not make the slides the focus of the audience
¨Lose the attention of the audience altogether
¨Don’t disrespect the audience –they can read and don’t
need a voice over of words
36

1/18/18
10
Advice
1)Limit the number of slides. Guideline: about 2-3
minutes per slide
2)Limit the words on the page
3)Use imagery: pictures help the audience
remember and you will need to describe what the
picture is about
37
Advice (cont’d)
4) A presentation should aim to focus on three major
points. Any more and the audience may forget.
oMake them clear, make them relevant and most all…make
them.
5) Make it personal: if you inject a personal touch it will
connect with the audience, and you are likely to be
more engaged and animated.
6) Use stories: effective way to get a message across.
Make sure the story is relevant to the key message you
are conveying and make it short.
38
Project Topics
¨Yo u are e nco urage d to co me up with y o ur o wn ide as,
and discuss them with me.
¨Send me a ½ page proposal idea
¨Finalize a topic in the next few weeks
¨I will propose a few projects if needed.
39
Tags