IBM Software Group
IBM Information Server
Delivering information you can trust
Understan
d
Cleanse Transform Deliver
Discover, model, and
govern information
structure and
content
Standardize, merge,
and correct information
Combine and
restructure
information for new
uses
Synchronize, virtualize
and move information
for in-line delivery
Parallel
Processing
Connectivity Metadata DeploymentAdministration
Platform Services
Support for Service-Oriented Architectures
22
IBM Software Group
3
The IBM Solution: IBM Information Server
Delivering information you can trust
Understand Transform Deliver
Parallel Processing
Rich Connectivity to Applications, Data, and Content
IBM Information Server
Unified Deployment
Unified Metadata Management
Cleanse
WebSphere QualityStage
Data cleansing, standardization, matching,
and survivorship for enhancing data quality
and creating coherent business views
IBM Software Group
Need for Data Quality
4
Critical Problems
Need to create & maintain 360 degree views of
customers, suppliers, products, locations, events
Need to leverage data - make reliable decisions,
comply with regulations, meet service agreements
Why?
No common standards across organization
Unexpected values stored in fields
Required information buried in free-form fields
Fields evolve - used for multiple purposes
No reliable keys for consolidated views
Operational data degrades 2% per month
Alternative Approaches
Denial – problem misunderstood and ignored until
too late; load and explode
Hand-coding - clerical exception processing; very
time consuming and resource intensive
Simplistic cleansing apps - evolved from direct
marketing & list hygiene, lack flexibility
Kent Fried Chick
Kentucky Fried
Kentucky Fried Chicken
KFC
Molly Talber DBA KFC
Mrs. M. Talber
John & Molly Talber
Talber, KFC, ATIMA
Data Sources Data Values
227G CB&NATURAL STICK
MOZZ WRAPPER
227G CB&NAT STICK P
QUE/MOZZ WRAPP.
4
IBM Software Group
Why Should I Care About Cleansing Information?
Lack of information standards
Different formats & structures
across different systems
Data surprises in individual
fields
Data misplaced in the database
Information buried in free-form
fields
Data myopia
Lack of consistent identifiers inhibit
a single view
The redundancy nightmare
Duplicate records with a lack of
standards
Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116
Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116
Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116
Name Tax ID Telephone
J Smith DBA Lime Cons. 228-02-1975 6173380300
Williams & Co. C/O Bill 025 -37-1888 415-392-2000
1st NatlProvident 34-2671434 3380321
HP 15 State St. 508-466-1200 Orlando
WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH
WING ASSEMBY, USE 5J868 -A HEX BOLT .25”-DRILL FOUR HOLES
USE 4 5J868A BOLTS (HEX .25) -DRILL HOLES FOR EA ON WING ASSEM
RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)
19-84-103 RS232 Cable 6' M-F CandS
CS-89641 6 ft. Cable Male-F, RS232 #87951
C&SUCH6 Male/Female 25 PIN 6 Foot Cable
90328574 IBM 187 N.Pk. Str. Salem NH 01456
90328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 01456
90238495 Int. Bus. Machines 187 No. Park St Salem NH 04156
90233479 International Bus. M. 187 Park Ave Salem NH 04156
90233489 Inter-Nation Consults 15 Main Street Andover MA 02341
90345672 I.B. Manufacturing Park Blvd. BostnoMA 04106
5
IBM Software Group
Importance of Data Quality
Low data quality impacts an organization in several ways
Poor data quality leads to misguided marketing promotions
Cross sell opportunities may be missed because same customer appears several
times in slightly different ways
Valued customers may not be recognized during support calls or other important
touchpoints
Data mining is difficult because related items are not detected as related
What is good data quality?
Two percent of “bad” data doesn’t sound that bad?
Two percent of 10M rows means that you have 200K errors
200K errors add up to big problem for analytics/operations/anything!
6
IBM Software Group
Compliance
Business to Business
Standards
Risk Management
Reduce Costs &
Increase Productivity
Increase Revenue /
CRM Payoff
Business Intelligence
Payoff
Supply chain collaboration & item
synchronization
Inventory consolidation
Single view of a customer or supplier
ERP Implementations
ERP instance consolidation
IT System renovation
Consolidation resulting from
M&A activity
Enterprise Data Warehouse
Compliance & Regulatory projects
(SOX, HIPAA, ACCORD, etc.)
Enterprise initiatives…
…to satisfy
critical business
requirements.
…need
high
quality
data…
7
IBM Software Group
IBM WebSphere QualityStage
Shared design environment with
DataStage increases
functionality and reduces
development time
Visual match rule interface
simplifies match tuning
Service orientation provides
‘continuous’ quality & delivers
confidence in your data
Parallel architecture shortens
execution time
8
IBM Software Group
9
Database with
Consolidated
Views
1. Free Form Investigation
2. Data Standardization
3. Data Matching
4. Data Survivorship
WebSphere
QualityStage Process
Customers
Transactions
Vendors /
Suppliers
Target
Products /
Materials
How will you get an accurate, consolidated view of your
business?
IBM Software Group
10
Why Investigate
Discover trends and potential anomalies in the data
100% visibility of single domain and free-form fields
Identify invalid and default values
Reveal undocumented business rules and common terminology
Verify the reliability of the data in the fields to be used as matching
criteria
Gain complete understanding of data within context
IBM Software Group
11
Investigation - Free Form
Parsing:
Separating multi-valued fields into individual pieces
“The instructions for handling the data are inherent within the data itself.”
123 | St. | Virginia | St.
VirginiaVirginia
Lexical analysis:
Determining business significance of individual
pieces
Context Sensitive:
Identifying various data structures and content
number street state street
type type
123 | St. | Virginia | St.
House Street Street
Number Name Type
123 | St. Virginia | St.
123123St.St. St.St.
IBM Software Group
12
Rule Sets
Pre-defined rules for parsing and
standardizing:
Name
Address
Area (City, State and Zip)
Multi-national address processing
Validate structure:
Tax ID
US Phone
Date
Email
Append ISO country codes
Pre-process or filter name, address
and area
Rule sets are stored in the common
repostiory
IBM Software Group
13
Standardization - Example
Input File:
Address Line 1 Address Line 2
639 N MILLS AVENUE ORLANDO, FLA 32803
306 W MAIN STR, CUMMING, GA 30130
3142 WEST CENTRAL AV TOLEDO OH 43606
843 HEARD AVE AUGUSTA-GA-30904
1139 GREENE ST ACCT #1234 AUGUSTA GEORGIA 30901
4275 OWENS ROAD SUITE 536 EVANS GA 30809
Result File:
House # DirStr. NameTypeUnitNo. NYSIIS City SOUNDEX State Zip ACCT#
639N MILLS AVE MAL ORLANDOO645 FL 32803
306W MAIN ST MAN CUMMINGC552 GA 30130
3142W CENTRALAVE CANTRAL TOLEDO T430 OH 43606
843 HEARD AVE HAD AUGUSTA A223 GA 30904
1139 GREENEST GRAN AUGUSTAA223 GA 30901 1234
4275 OWENS RDSTE 536ON EVANS E152 GA 30809
IBM Software Group
14
Why Match
Identify duplicate entities within one or more files
Perform householding
Create consolidated view of customer
Establish cross-reference linkage
Enrich existing data with new attributes from external
sources
IBM Software Group
15
WILLIAM J KAZANGIAN 128 MAIN ST 02111 12/8/62
WILLAIM JOHN KAZANGIAN 128 MAINE AVE 02110 12/8/62
Are these two records a match?
Deterministic Decisions Tables:
• Fields are compared
• Letter grade assigned
• Combined letter grades are compared to a vendor delivered file
• Result: Match; Fail; Suspect
B B A A B D B A =
BBAABDBA
+5 +2 +20 +3 +4 -1 +7 +9 = +49
Probabilistic Record Linkage:
• Fields are evaluated for degree-of-match
• Weight assigned: represents the “information content” by value
• Weights are summed to derived a total score
• Result: Statistical probability of a match
Two Methods to Decide a Match
IBM Software Group
16
Why Survive
Provide consolidated view of data
Provide consolidated view containing the “best-of-breed”
data
Resolve conflicting values and fill missing values
Cross-populate best available data
Implement business and mapping rules
Create cross-reference keys
IBM Software Group
17
Survivorship - Example
Survivorship Input (Match Output)
GroupLegacy FirstMiddleLast No.Dir.Str. NameTypeUnit
No.
1 D150Bob Dixon1500 SE ROSS CLARK CIR
1 A1367Robert Dickson1500 ROSS CLARK CIR
23 D689ErnestA Obrian5901 SW74THST STE 202
23 A436ErnieAlex O’Brian5901 SW74THST
23 D352Ernie Obrian5901 74 ST # 202
Consolidated Output
GroupFirstMiddleLast No.Dir.Str. NameTypeUnitNo.
1 Robert Dickson1500SE ROSS CLARKCIR
23 ErnieAlexO’Brian5901 SW 74TH ST STE202
GroupLegacy
1 D150
1 A1367
23 D689
23 A436
23 D352
IBM Software Group
18
How Does WebSphere QualityStage Integrate
Database
DB2
Oracle
Sybase
Onyx
IDMS
etc.
Target
1.Investigation
2.Standardizatio
n
3.Integration
4.Survivorship
QualityStage
Data Extraction
and Load Routines
DB2
Oracle
Sybase
Onyx
IDMS
etc.
IBM Software Group
19
WebSphere DataStage and
WebSphere QualityStage: Fully Integrated!
IBM Software Group
QualityStage: Data Quality Extensions
IBM WebSphere QualityStage GeoLocator
IBM WebSphere QualityStage Postal Verification Products
WAVES (WorldWide)
IBM WebSphere Worldwide Address Verification Solution
IBM WebSphere QualityStage Postal Certification
Products
CASS (United States)
SERP (Canada)
DPID (Australia)
IBM Information Server Data Quality Module for SAP
IBM WebSphere QualityStage for Siebel
2020
IBM Software Group
Key Strengths for IBM QualityStage
Intuitive, “Design as you think” User Interface
Simple rule design & fine tuning
Seamless Data Flow integration
Intuitive rule design & fine tuning
Defining the technology standard with SOA
Industry leading probabilistic matching engine
2121