Qualitystage IBM Infosphere server cleanse

ssusera92ed61 15 views 22 slides Sep 04, 2024
Slide 1
Slide 1 of 22
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22

About This Presentation

Qualitystage IBM Infosphere server cleanse


Slide Content

®
IBM Software Group
©IBM Corporation
IBM Information Server
Cleanse - QualityStage

IBM Software Group
IBM Information Server
Delivering information you can trust
Understan
d
Cleanse Transform Deliver
Discover, model, and
govern information
structure and
content
Standardize, merge,
and correct information
Combine and
restructure
information for new
uses
Synchronize, virtualize
and move information
for in-line delivery
Parallel
Processing
Connectivity Metadata DeploymentAdministration
Platform Services
Support for Service-Oriented Architectures
22

IBM Software Group
3
The IBM Solution: IBM Information Server
Delivering information you can trust
Understand Transform Deliver
Parallel Processing
Rich Connectivity to Applications, Data, and Content
IBM Information Server
Unified Deployment
Unified Metadata Management
Cleanse
WebSphere QualityStage
Data cleansing, standardization, matching,
and survivorship for enhancing data quality
and creating coherent business views

IBM Software Group
Need for Data Quality
4
Critical Problems
Need to create & maintain 360 degree views of
customers, suppliers, products, locations, events
Need to leverage data - make reliable decisions,
comply with regulations, meet service agreements
Why?
No common standards across organization
Unexpected values stored in fields
Required information buried in free-form fields
Fields evolve - used for multiple purposes
No reliable keys for consolidated views
Operational data degrades 2% per month
Alternative Approaches
Denial – problem misunderstood and ignored until
too late; load and explode
Hand-coding - clerical exception processing; very
time consuming and resource intensive
Simplistic cleansing apps - evolved from direct
marketing & list hygiene, lack flexibility
Kent Fried Chick
Kentucky Fried
Kentucky Fried Chicken
KFC
Molly Talber DBA KFC
Mrs. M. Talber
John & Molly Talber
Talber, KFC, ATIMA
Data Sources Data Values
227G CB&NATURAL STICK
MOZZ WRAPPER
227G CB&NAT STICK P
QUE/MOZZ WRAPP.
4

IBM Software Group
Why Should I Care About Cleansing Information?
Lack of information standards
Different formats & structures
across different systems
Data surprises in individual
fields
Data misplaced in the database
Information buried in free-form
fields
Data myopia
Lack of consistent identifiers inhibit
a single view
The redundancy nightmare
Duplicate records with a lack of
standards
Kate A. Roberts 416 Columbus Ave #2, Boston, Mass 02116
Catherine Roberts Four sixteen Columbus APT2, Boston, MA 02116
Mrs. K. Roberts 416 Columbus Suite #2, Suffolk County 02116
Name Tax ID Telephone
J Smith DBA Lime Cons. 228-02-1975 6173380300
Williams & Co. C/O Bill 025 -37-1888 415-392-2000
1st NatlProvident 34-2671434 3380321
HP 15 State St. 508-466-1200 Orlando
WING ASSY DRILL 4 HOLE USE 5J868A HEXBOLT 1/4 INCH
WING ASSEMBY, USE 5J868 -A HEX BOLT .25”-DRILL FOUR HOLES
USE 4 5J868A BOLTS (HEX .25) -DRILL HOLES FOR EA ON WING ASSEM
RUDER, TAP 6 WHOLES, SECURE W/KL2301 RIVETS (10 CM)
19-84-103 RS232 Cable 6' M-F CandS
CS-89641 6 ft. Cable Male-F, RS232 #87951
C&SUCH6 Male/Female 25 PIN 6 Foot Cable
90328574 IBM 187 N.Pk. Str. Salem NH 01456
90328575 I.B.M. Inc. 187 N.Pk. St. Salem NH 01456
90238495 Int. Bus. Machines 187 No. Park St Salem NH 04156
90233479 International Bus. M. 187 Park Ave Salem NH 04156
90233489 Inter-Nation Consults 15 Main Street Andover MA 02341
90345672 I.B. Manufacturing Park Blvd. BostnoMA 04106
5

IBM Software Group
Importance of Data Quality
Low data quality impacts an organization in several ways
Poor data quality leads to misguided marketing promotions
Cross sell opportunities may be missed because same customer appears several
times in slightly different ways
Valued customers may not be recognized during support calls or other important
touchpoints
Data mining is difficult because related items are not detected as related
What is good data quality?
Two percent of “bad” data doesn’t sound that bad?
Two percent of 10M rows means that you have 200K errors
 200K errors add up to big problem for analytics/operations/anything!
6

IBM Software Group
Compliance
Business to Business
Standards
Risk Management
Reduce Costs &
Increase Productivity
Increase Revenue /
CRM Payoff
Business Intelligence
Payoff
Supply chain collaboration & item
synchronization
Inventory consolidation
Single view of a customer or supplier
ERP Implementations
ERP instance consolidation
IT System renovation
Consolidation resulting from
M&A activity
Enterprise Data Warehouse
Compliance & Regulatory projects
(SOX, HIPAA, ACCORD, etc.)
Enterprise initiatives…
…to satisfy
critical business
requirements.
…need
high
quality
data…
7

IBM Software Group
IBM WebSphere QualityStage
Shared design environment with
DataStage increases
functionality and reduces
development time
Visual match rule interface
simplifies match tuning
Service orientation provides
‘continuous’ quality & delivers
confidence in your data
Parallel architecture shortens
execution time
8

IBM Software Group
9
Database with
Consolidated
Views
1. Free Form Investigation
2. Data Standardization
3. Data Matching
4. Data Survivorship
WebSphere
QualityStage Process
Customers
Transactions
Vendors /
Suppliers
Target
Products /
Materials
How will you get an accurate, consolidated view of your
business?

IBM Software Group
10
Why Investigate
Discover trends and potential anomalies in the data
100% visibility of single domain and free-form fields
Identify invalid and default values
Reveal undocumented business rules and common terminology
Verify the reliability of the data in the fields to be used as matching
criteria
Gain complete understanding of data within context

IBM Software Group
11
 Investigation - Free Form
Parsing:
Separating multi-valued fields into individual pieces
“The instructions for handling the data are inherent within the data itself.”
123 | St. | Virginia | St.
VirginiaVirginia
Lexical analysis:
Determining business significance of individual
pieces
Context Sensitive:
Identifying various data structures and content
number street state street
type type
123 | St. | Virginia | St.
House Street Street
Number Name Type
123 | St. Virginia | St.
123123St.St. St.St.

IBM Software Group
12
Rule Sets
Pre-defined rules for parsing and
standardizing:
Name
Address
Area (City, State and Zip)
Multi-national address processing
Validate structure:
Tax ID
US Phone
Date
Email
Append ISO country codes
Pre-process or filter name, address
and area
Rule sets are stored in the common
repostiory

IBM Software Group
13
 Standardization - Example
Input File:
Address Line 1 Address Line 2
639 N MILLS AVENUE ORLANDO, FLA 32803
306 W MAIN STR, CUMMING, GA 30130
3142 WEST CENTRAL AV TOLEDO OH 43606
843 HEARD AVE AUGUSTA-GA-30904
1139 GREENE ST ACCT #1234 AUGUSTA GEORGIA 30901
4275 OWENS ROAD SUITE 536 EVANS GA 30809
Result File:
House # DirStr. NameTypeUnitNo. NYSIIS City SOUNDEX State Zip ACCT#
639N MILLS AVE MAL ORLANDOO645 FL 32803
306W MAIN ST MAN CUMMINGC552 GA 30130
3142W CENTRALAVE CANTRAL TOLEDO T430 OH 43606
843 HEARD AVE HAD AUGUSTA A223 GA 30904
1139 GREENEST GRAN AUGUSTAA223 GA 30901 1234
4275 OWENS RDSTE 536ON EVANS E152 GA 30809

IBM Software Group
14
Why Match
Identify duplicate entities within one or more files
Perform householding
Create consolidated view of customer
Establish cross-reference linkage
Enrich existing data with new attributes from external
sources

IBM Software Group
15
WILLIAM J KAZANGIAN 128 MAIN ST 02111 12/8/62
WILLAIM JOHN KAZANGIAN 128 MAINE AVE 02110 12/8/62
Are these two records a match?
Deterministic Decisions Tables:
• Fields are compared
• Letter grade assigned
• Combined letter grades are compared to a vendor delivered file
• Result: Match; Fail; Suspect
B B A A B D B A =
BBAABDBA
+5 +2 +20 +3 +4 -1 +7 +9 = +49
Probabilistic Record Linkage:
• Fields are evaluated for degree-of-match
• Weight assigned: represents the “information content” by value
• Weights are summed to derived a total score
• Result: Statistical probability of a match
Two Methods to Decide a Match

IBM Software Group
16
Why Survive
Provide consolidated view of data
Provide consolidated view containing the “best-of-breed”
data
Resolve conflicting values and fill missing values
Cross-populate best available data
Implement business and mapping rules
Create cross-reference keys

IBM Software Group
17
 Survivorship - Example
Survivorship Input (Match Output)
GroupLegacy FirstMiddleLast No.Dir.Str. NameTypeUnit
No.
1 D150Bob Dixon1500 SE ROSS CLARK CIR
1 A1367Robert Dickson1500 ROSS CLARK CIR
23 D689ErnestA Obrian5901 SW74THST STE 202
23 A436ErnieAlex O’Brian5901 SW74THST
23 D352Ernie Obrian5901 74 ST # 202
Consolidated Output
GroupFirstMiddleLast No.Dir.Str. NameTypeUnitNo.
1 Robert Dickson1500SE ROSS CLARKCIR
23 ErnieAlexO’Brian5901 SW 74TH ST STE202
GroupLegacy
1 D150
1 A1367
23 D689
23 A436
23 D352

IBM Software Group
18
How Does WebSphere QualityStage Integrate
Database
DB2
Oracle
Sybase
Onyx
IDMS
etc.
Target
1.Investigation
2.Standardizatio
n
3.Integration
4.Survivorship
QualityStage
Data Extraction
and Load Routines
DB2
Oracle
Sybase
Onyx
IDMS
etc.

IBM Software Group
19
WebSphere DataStage and
WebSphere QualityStage: Fully Integrated!

IBM Software Group
QualityStage: Data Quality Extensions
IBM WebSphere QualityStage GeoLocator
IBM WebSphere QualityStage Postal Verification Products
WAVES (WorldWide)
IBM WebSphere Worldwide Address Verification Solution
IBM WebSphere QualityStage Postal Certification
Products
CASS (United States)
SERP (Canada)
DPID (Australia)
IBM Information Server Data Quality Module for SAP
IBM WebSphere QualityStage for Siebel
2020

IBM Software Group
Key Strengths for IBM QualityStage
Intuitive, “Design as you think” User Interface
Simple rule design & fine tuning
Seamless Data Flow integration
Intuitive rule design & fine tuning
Defining the technology standard with SOA
Industry leading probabilistic matching engine
2121

®
IBM Software Group
©IBM Corporation
Thank You
Tags