slides on the subject of information integration and application
sawan23003
12 views
22 slides
Oct 06, 2024
Slide 1 of 22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
About This Presentation
information integration and application
Size: 3.5 MB
Language: en
Added: Oct 06, 2024
Slides: 22 pages
Slide Content
W2
Information Integration scenarios,
opportunities and challenges
Acknowledgement: This lecture includes contents from many open sources.
Transactional
&
Collaborative
Applications
Business Analytics
Applications
Information
Sources
Mastering information across the Information Supply Chain
Trusted ◆ Relevant ◆ Governed
Integrate
Manage
Cubes
Streams
Master
Data
Data
Content
Streaming
Information
Information
Governance
Data
Warehouses
Analyze
Content
Analytics
Big Data
Govern
Quality Lifecycle
Security
& Privacy
Standards
In
f
o
r
m
a
ti
o
n
I
n
t
e
g
r
a
ti
o
n
Information Integration
•Information Integration refers to a category of middleware which lets
applications access data as though it were in a single database, whether or not
it is.
•It enables the integration of data and content sources to provide real-time read
and write access, to transform data for business analysis and data
interchange, and to data placement for performance, currency and availability.
•The goal of data integration: tie together different sources, controlled by many
people, under a common schema.
▪Numerous works in the past 30 years
▪In many communities: DB, AI, KDD, Web, Semantic
Web
1
1
Long-Standing Challenges in Information Integration
II Architecture: Virtualization Layer Approach
II Architecture: A Data Warehousing Approach
Information Integration scenarios
Integration for creating a single site to search for jobs/rentals/…
Data
cleansing and
normalization
XML
processing
at large
scales
Informati
on
Extractio
n
Standardizati
on
Duplicate
detection
Query
Interface
Query
Decomposer
and Optimizer
Single View
of
Researcher
Researcher
Value
estimation
Data Noise &
Format handling
Data correlation
& De-dup
Query/Analytics
Distribution
Infrastructure
for enabling
smart data use
and analysis
Application
s
Integration through (sub-)query federation (Mediator
Approach)
Entity
Matching
Researcher’s
interest
evolution
Citation
s/ DBLP
DB
<DBLP
Data/>
Patent
DB
https://sites.google.com/
site/
anhaidgroup/useful-stuff/
data
Multiple
Data Sources
https://developer.uspto.gov/api-ca
talog
Context
Builder
Data
cleansing and
normalization
XML
processing
at large
scales
Informati
on
Extractio
n
Standardizati
on
Duplicate
detection
Query Interface
Single View
of
Researcher
Researcher
Value
estimation
Data Noise &
Format handling
Data correlation
& De-dup
Query/Analytics
Distribution
Infrastructure
for enabling
smart data use
and analysis
Application
s
Materialization Approach
Entity
Matching
Researcher’s
interest
evolution
Citation
s/ DBLP
DB
<DBLP
Data/>
Patent
DB
https://sites.google.com/
site/
anhaidgroup/useful-stuff/
data
Multiple
Data Sources
https://developer.uspto.gov/api-ca
talog
Context
Builder
DBMS
Integrated
Master
Data
Integration for Single truth --
Landline Phone
Rahul K Sharma
DOB: 06/17/1934
(022) 7314-5577
Satellite TV
Rahul Kumar Sharma
55 Link Road
(022) 7314-5577
XX/1133107
Mobile Phone
R Sharma
55 Firoza Link Road
(022) 9311234590
537-27-6402
XX/0001133107
1
Rahul K Sharma
55 Firoza Link Road
(022) 9311234590
537-27-6402
XX/0001133107
Rahul Kumar Sharma
55 Firoza Link Road
(022) 9311234590
537-27-6402
CEO: KP
Technologies
Member of IEEE
Linked-In
Rahul K Sharma
55 Firoza Link Road
537-27-6402
XX/0001133107
Proud owner of a
santro XL
Twitter
Extended View -- CEO: KP
Technologies Member
of IEEE
-- Vehicle Owned: Santro X
‘Text + Data’ Integration
Data
analysi
s
Data integration, data wrangling,
…
●The raw data to insight pipeline
is there any
correlation
between
location
and
revenue?
Building Data Driven Artifacts
Information Integration in Google search --
Data Finds Data: Entity relationship discovery
Where does he live ?
Who is associated to
him?
Give me records on him?
PolNet (photos)
Passport
Driving License
Vehicle Registration
Electoral Rolls
Water Meter
Mobile Phones
Single/federated View
Rangaga St
Wamana, MTW
•Visited Afganistan
in last 3 months
•Seen in Rally
Bob Smyth
Manish Deshraj
Alert me based on
events of interest
around him
Mogd Yokub Thapa
Tracker
Immigration Records
FIR Data
Bank Transactions
International visitor
gave address as
Australia contact
Challenges
•Data collection and maintenance (with data quality)
•Information extraction
•Multi-modal data integration
•Entity Matching (with privacy preserving)
•Integrated Analysis and Intelligence
Integrating Real-time Audio with Databases
•IBM logo must not
be moved, added to,
or altered in
any way.
•Background should
not be modified.
•Title/subtitle/confidentiality line: 10pt Arial Regular, white
Maximum length: 1 line
Information separated by vertical strokes,
with two spaces on either side
•Slide heading:
28pt Arial Regular,
blue R120 | G137 | B251
Maximum length: 2 lines
•Slide body:
18pt Arial Regular, black
Square bullet color:
teal R045 | G182 | B179
Recommended maximum
text length: 5 principal points
•Group name:
14pt Arial Regular, white
Maximum length: 1 line
•Copyright: 10pt Arial
Regular, white
Optional slide number:
10pt Arial Bold, white
Template release: Oct 02
For the latest, go to http://w3.ibm.com/ibm/presentations
Indications in green = Live content
Indications in white = Edit in master
Indications in blue = Locked elements
Indications in black = Optional elements
▪16
Integration of Structured Query Results with Unstructured Data
RDBMS
Search Engine
SELECT name, max(price) - min(price)
FROM stocks
GROUP BY name
ORDER BY 2
FETCH FIRST 3 ROWS ONLY
“IBM” “ORCL” “MSFT”
“Database” “Data Cloud
Services”
SCORE
“Doctype:Patents”
(optional directive)
“Doctype:Patents”
“Get the 3 companies with
max price variation”
And related
documents
(Keywords not required)
•IBM logo must not
be moved, added to,
or altered in
any way.
•Background should
not be modified.
•Title/subtitle/confidentiality line: 10pt Arial Regular, white
Maximum length: 1 line
Information separated by vertical strokes,
with two spaces on either side
•Slide heading:
28pt Arial Regular,
blue R120 | G137 | B251
Maximum length: 2 lines
•Slide body:
18pt Arial Regular, black
Square bullet color:
teal R045 | G182 | B179
Recommended maximum
text length: 5 principal points
•Group name:
14pt Arial Regular, white
Maximum length: 1 line
•Copyright: 10pt Arial
Regular, white
Optional slide number:
10pt Arial Bold, white
Template release: Oct 02
For the latest, go to http://w3.ibm.com/ibm/presentations
Indications in green = Live content
Indications in white = Edit in master
Indications in blue = Locked elements
Indications in black = Optional elements
Integrating Unstructured Documents with Structured Data
I am <Name> Bharat Kumar </Name>
….
…… bought a
<Company>Sony</Company>
<product> DVD player </product>
….
from <Company>JK Electronics
</Company> ……..
CustId StoreId Payment Discount Terms
A756K9 S8976 Card (AMEX)Promo# 1236NOREFND
CustId Name Loyalty Club Addr
A756K9 Bharat
Kumar
Platinum Royal Okhla Phase 3,
New Delhi
Additional “sidebar”
information available
as a result of the
annotation
58
Data Sources DB
Data
Stream
We
b
MDB
Adaptor
Monitor
Adaptor Adaptor Adaptor
MonitorMonitorMonitor
Connectors
Business
Logic/Process
Feedback
Active Functionalities
Integration Hub
Event based Information Integration for STP
•Many applications that require a
more proactive approach –
Integration triggered by events
•If the sales of a toy in the
different regions are less than
100 units by 2 PM, give a
discount of 10%
•Useful for making timely business
decisions
A Platform for Information Integration
▪VSAM
▪Sequential
▪IMS
▪Adabas
▪CA-
Datacom
▪CA-IDMS
▪DB2 UDB
▪Informix
▪Oracle
▪Sybase
▪Teradata
▪Microsoft
SQL
Server
▪ODBC
▪OLE DB
▪Excel
▪Flat files
▪IBM Lotus
Extended
Search
▪Web
search
▪LDAP
▪Custom
DB2 CM
Family
Domino.doc
▪
▪
▪
▪
▪
▪
▪
▪
Documentum
FileNet
Open Text
Stellent
Interwoven
Hummingbird
▪WebSphere
▪FileNet
▪WebSphere
BI Adaptors
▪SAP
▪PeopleSoft
▪Siebel
ContentWorkflow
& Imaging
systems
Relational
databases
Web
Other
Collaboration
Systems
XML
Web services
Packaged
applications
Mainframe
files
Mainframe
databases
▪Lotus Notes
▪Microsoft
Index Server
▪IBM Lotus
Extended
Search▪Sametime
▪QuickPlace
▪Microsoft
Exchange
Any data
Search SQL XQuery Content
-- Multiple access paradigms -- Multiple integration disciplines
Find Consoli
date
Publish
FederateTransform
Data and Content Access
Metadata Management
Integration
Design
Tools
Information Integration Key Challenges
Managing different platforms:
•Identifying relevant information from multiple data sources
•Logical specification of data desired
•Handle dynamic arrival and departure of data sources
Automated data transformations:
•Data curation,
•Defining and working with data quality
–What characteristics matter? What’s a “good” answer?
–How does quality compose across sources? characteristics? For different activities?
•Schema and Data heterogeneity:
•Integrating diverse information from the recorded state of the business within
cost and skill constraints
•schema mapping, data mapping, information discovery, …
•Uniform (or source specific) query access to data sources
•Distributed query processing and optimization
•Consolidating, transforming, and mining data for analysis.
Can AI help in Information Integration cycle?
•Reduce the effort needed to set up an data curation, integration tasks.
•Enable the system to perform gracefully with uncertainty (e.g., on
the web/noisy source/…)