W2
Information Integration scenarios,
opportunities and challenges
Acknowledgement: This lecture includes contents from many open sources.
Transactional
&
Collaborative
Applications
Business Analytics
Applications
Information
Sources
Mastering information across the Information Supply Chain
Trusted ◆ Relevant ◆ Governed
Integrate
Manage Cubes
Streams
Master
Data
Data
Content
Streaming
Information
Information
Governance
Data
Warehouses
Analyze
Content
Analytics
Big Data
Govern
Quality Lifecycle
Security
& Privacy
Standards
In
f
o
r
m
a
ti
o
n
I
n
t
e
g
r
a
ti
o
n
Information Integration
• Information Integration refers to a category of middleware which lets
applications access data as though it were in a single database, whether or not
it is.
• It enables the integration of data and content sources to provide real-time read
and write access, to transform data for business analysis and data
interchange, and to data placement for performance, currency and availability.
• The goal of data integration: tie together different sources, controlled by many
people, under a common schema.
▪ Numerous works in the past 30 years
▪ In many communities: DB, AI, KDD, Web, Semantic
Web
1
1
Long-Standing Challenges in Information Integration
II Architecture: Virtualization Layer Approach
II Architecture: A Data Warehousing Approach
Information Integration scenarios
Integration for creating a single site to search for jobs/rentals/…
Data
cleansing and
normalization
XML
processing
at large
scales
Informati
on
Extractio
n
Standardizati
on
Duplicate
detection
Query
Interface
Query
Decomposer
and Optimizer
Single View
of
Researcher
Researcher
Value
estimation
Data Noise &
Format handling
Data correlation
& De-dup
Query/Analytics
Distribution
Infrastructure
for enabling
smart data use
and analysis
Application
s
Integration through (sub-)query federation (Mediator
Approach)
Entity
Matching
Researcher’s
interest
evolution
Citation
s/ DBLP
DB
<DBLP
Data/>
Patent
DB
https://sites.google.com/
site/
anhaidgroup/useful-stuff/
data
Multiple
Data Sources
https://developer.uspto.gov/api-ca
talog
Context
Builder
Data
cleansing and
normalization
XML
processing
at large
scales
Informati
on
Extractio
n
Standardizati
on
Duplicate
detection
Query Interface
Single View
of
Researcher
Researcher
Value
estimation
Data Noise &
Format handling
Data correlation
& De-dup
Query/Analytics
Distribution
Infrastructure
for enabling
smart data use
and analysis
Application
s
Materialization Approach
Entity
Matching
Researcher’s
interest
evolution
Citation
s/ DBLP
DB
<DBLP
Data/>
Patent
DB
https://sites.google.com/
site/
anhaidgroup/useful-stuff/
data
Multiple
Data Sources
https://developer.uspto.gov/api-ca
talog
Context
Builder
DBMS
Integrated
Master
Data
Integration for Single truth --
Landline Phone
Rahul K Sharma
DOB: 06/17/1934
(022) 7314-5577
Satellite TV
Rahul Kumar Sharma
55 Link Road
(022) 7314-5577
XX/1133107
Mobile Phone
R Sharma
55 Firoza Link Road
(022) 9311234590
537-27-6402
XX/0001133107
1
Rahul K Sharma
55 Firoza Link Road
(022) 9311234590
537-27-6402
XX/0001133107
Rahul Kumar Sharma
55 Firoza Link Road
(022) 9311234590
537-27-6402
CEO: KP
Technologies
Member of IEEE
Linked-In
Rahul K Sharma
55 Firoza Link Road
537-27-6402
XX/0001133107
Proud owner of a
santro XL
Twitter
Extended View -- CEO: KP
Technologies Member
of IEEE
‘Text + Data’ Integration
Data
analysi
s
Data integration, data wrangling,
…
● The raw data to insight pipeline
is there any
correlation
between
location
and
revenue?
Building Data Driven Artifacts
Information Integration in Google search --
Data Finds Data: Entity relationship discovery
Where does he live ?
Who is associated to
him?
Give me records on him?
PolNet (photos)
Passport
Driving License
Vehicle Registration
Electoral Rolls
Water Meter
Mobile Phones
Single/federated View
Rangaga St
Wamana, MTW
• Visited Afganistan
in last 3 months
• Seen in Rally
Bob Smyth
Manish Deshraj
Alert me based on
events of interest
around him
Mogd Yokub Thapa
Tracker
Immigration Records
FIR Data
Bank Transactions
International visitor
gave address as
Australia contact
Challenges
• Data collection and maintenance (with data quality)
• Information extraction
• Multi-modal data integration
• Entity Matching (with privacy preserving)
• Integrated Analysis and Intelligence
Integrating Real-time Audio with Databases
▪16
Integration of Structured Query Results with Unstructured Data
RDBMS
Search Engine
SELECT name, max(price) - min(price)
FROM stocks
GROUP BY name
ORDER BY 2
FETCH FIRST 3 ROWS ONLY
“IBM” “ORCL” “MSFT”
“Database” “Data Cloud
Services”
SCORE
“Doctype:Patents”
(optional directive)
“Doctype:Patents”
“Get the 3 companies with
max price variation”
And related
documents
(Keywords not required)
Integrating Unstructured Documents with Structured Data
I am <Name> Bharat Kumar </Name>
….
…… bought a
<Company>Sony</Company>
<product> DVD player </product>
….
from <Company>JK Electronics
</Company> ……..
CustId StoreId Payment Discount Terms
A756K9 S8976 Card (AMEX) Promo# 1236 NOREFND
CustId Name Loyalty Club Addr
A756K9 Bharat
Kumar
Platinum Royal Okhla Phase 3,
New Delhi
Additional “sidebar”
information available
as a result of the
annotation
58
Data Sources
DB
Data
Stream
We
b MDB
Adaptor
Monitor
Adaptor Adaptor Adaptor
Monitor
Monitor
Monitor
Connectors
Business
Logic/Process
Feedback
Active Functionalities
Integration Hub
Event based Information Integration for STP
• Many applications that require a
more proactive approach –
Integration triggered by events
• If the sales of a toy in the
different regions are less than
100 units by 2 PM, give a
discount of 10%
• Useful for making timely business
decisions
A Platform for Information Integration
▪VSAM
▪Sequential
▪IMS
▪Adabas
▪CA-
Datacom
▪CA-IDMS
▪DB2 UDB
▪Informix
▪Oracle
▪Sybase
▪Teradata
▪Microsoft
SQL
Server
▪OLE DB
▪Excel
▪Flat files
▪IBM Lotus
Extended
Search
▪Web
search
▪ LDAP
DB2 CM
Family
Domino.doc
▪
▪
▪
▪
▪
▪
▪
▪
Documentum
FileNet
Open Text
Stellent
Interwoven
Hummingbird
▪WebSphere
▪FileNet
▪WebSphere
BI Adaptors
▪SAP
▪PeopleSoft
▪Siebel
Content Workflow
& Imaging
systems
Relational
databases
Web
Other
Collaboration
Systems
XML
Web services
Packaged
applications
Mainframe
files
Mainframe
databases
▪Lotus Notes
▪Microsoft
Index Server
▪IBM Lotus
Extended
Search
▪Sametime
▪QuickPlace
▪Microsoft
Exchange
Any data
Search SQL XQuery Content
-- Multiple access paradigms -- Multiple integration disciplines
Find Consoli
date
Publish
FederateTransform
Data and Content Access
Metadata Management
Integration
Design
Tools
Information Integration Key Challenges
Managing different platforms:
• Identifying relevant information from multiple data sources
• Logical specification of data desired
• Handle dynamic arrival and departure of data sources
Automated data transformations:
• Data curation,
• Defining and working with data quality
– What characteristics matter? What’s a “good” answer?
– How does quality compose across sources? characteristics? For different activities?
• Schema and Data heterogeneity:
• Integrating diverse information from the recorded state of the business within
cost and skill constraints
• schema mapping, data mapping, information discovery, …
• Uniform (or source specific) query access to data sources
• Distributed query processing and optimization
• Consolidating, transforming, and mining data for analysis.
Can AI help in Information Integration cycle?
• Reduce the effort needed to set up an data curation, integration tasks.
• Enable the system to perform gracefully with uncertainty (e.g., on
https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository?aut
huser=0
https://www.kaggle.com/datasets
https://indiadataportal.com/
data.gov.in
data.gov
data.gov.au
You can reach out to me any day for project discussion. Send me an email first to
schedule the online/offline meeting.
Important
Links
https://research.cs.wisc.edu/dib
ook/
slides on the subject of information integration and application

slides on the subject of information integration and application

  • 1.
    W2 Information Integration scenarios, opportunitiesand challenges Acknowledgement: This lecture includes contents from many open sources.
  • 2.
    Transactional & Collaborative Applications Business Analytics Applications Information Sources Mastering informationacross the Information Supply Chain Trusted ◆ Relevant ◆ Governed Integrate Manage Cubes Streams Master Data Data Content Streaming Information Information Governance Data Warehouses Analyze Content Analytics Big Data Govern Quality Lifecycle Security & Privacy Standards
  • 3.
    In f o r m a ti o n I n t e g r a ti o n Information Integration • InformationIntegration refers to a category of middleware which lets applications access data as though it were in a single database, whether or not it is. • It enables the integration of data and content sources to provide real-time read and write access, to transform data for business analysis and data interchange, and to data placement for performance, currency and availability. • The goal of data integration: tie together different sources, controlled by many people, under a common schema.
  • 4.
    ▪ Numerous worksin the past 30 years ▪ In many communities: DB, AI, KDD, Web, Semantic Web 1 1 Long-Standing Challenges in Information Integration
  • 5.
  • 6.
    II Architecture: AData Warehousing Approach
  • 7.
  • 8.
    Integration for creatinga single site to search for jobs/rentals/…
  • 9.
    Data cleansing and normalization XML processing at large scales Informati on Extractio n Standardizati on Duplicate detection Query Interface Query Decomposer andOptimizer Single View of Researcher Researcher Value estimation Data Noise & Format handling Data correlation & De-dup Query/Analytics Distribution Infrastructure for enabling smart data use and analysis Application s Integration through (sub-)query federation (Mediator Approach) Entity Matching Researcher’s interest evolution Citation s/ DBLP DB <DBLP Data/> Patent DB https://sites.google.com/ site/ anhaidgroup/useful-stuff/ data Multiple Data Sources https://developer.uspto.gov/api-ca talog Context Builder
  • 10.
    Data cleansing and normalization XML processing at large scales Informati on Extractio n Standardizati on Duplicate detection QueryInterface Single View of Researcher Researcher Value estimation Data Noise & Format handling Data correlation & De-dup Query/Analytics Distribution Infrastructure for enabling smart data use and analysis Application s Materialization Approach Entity Matching Researcher’s interest evolution Citation s/ DBLP DB <DBLP Data/> Patent DB https://sites.google.com/ site/ anhaidgroup/useful-stuff/ data Multiple Data Sources https://developer.uspto.gov/api-ca talog Context Builder DBMS Integrated Master Data
  • 11.
    Integration for Singletruth -- Landline Phone Rahul K Sharma DOB: 06/17/1934 (022) 7314-5577 Satellite TV Rahul Kumar Sharma 55 Link Road (022) 7314-5577 XX/1133107 Mobile Phone R Sharma 55 Firoza Link Road (022) 9311234590 537-27-6402 XX/0001133107 1 Rahul K Sharma 55 Firoza Link Road (022) 9311234590 537-27-6402 XX/0001133107 Rahul Kumar Sharma 55 Firoza Link Road (022) 9311234590 537-27-6402 CEO: KP Technologies Member of IEEE Linked-In Rahul K Sharma 55 Firoza Link Road 537-27-6402 XX/0001133107 Proud owner of a santro XL Twitter Extended View -- CEO: KP Technologies Member of IEEE
  • 12.
    ‘Text + Data’Integration Data analysi s Data integration, data wrangling, … ● The raw data to insight pipeline is there any correlation between location and revenue?
  • 13.
    Building Data DrivenArtifacts Information Integration in Google search --
  • 14.
    Data Finds Data:Entity relationship discovery Where does he live ? Who is associated to him? Give me records on him? PolNet (photos) Passport Driving License Vehicle Registration Electoral Rolls Water Meter Mobile Phones Single/federated View Rangaga St Wamana, MTW • Visited Afganistan in last 3 months • Seen in Rally Bob Smyth Manish Deshraj Alert me based on events of interest around him Mogd Yokub Thapa Tracker Immigration Records FIR Data Bank Transactions International visitor gave address as Australia contact Challenges • Data collection and maintenance (with data quality) • Information extraction • Multi-modal data integration • Entity Matching (with privacy preserving) • Integrated Analysis and Intelligence
  • 15.
  • 16.
    ▪16 Integration of StructuredQuery Results with Unstructured Data RDBMS Search Engine SELECT name, max(price) - min(price) FROM stocks GROUP BY name ORDER BY 2 FETCH FIRST 3 ROWS ONLY “IBM” “ORCL” “MSFT” “Database” “Data Cloud Services” SCORE “Doctype:Patents” (optional directive) “Doctype:Patents” “Get the 3 companies with max price variation” And related documents (Keywords not required)
  • 17.
    Integrating Unstructured Documentswith Structured Data I am <Name> Bharat Kumar </Name> …. …… bought a <Company>Sony</Company> <product> DVD player </product> …. from <Company>JK Electronics </Company> …….. CustId StoreId Payment Discount Terms A756K9 S8976 Card (AMEX) Promo# 1236 NOREFND CustId Name Loyalty Club Addr A756K9 Bharat Kumar Platinum Royal Okhla Phase 3, New Delhi Additional “sidebar” information available as a result of the annotation
  • 18.
    58 Data Sources DB Data Stream We b MDB Adaptor Monitor AdaptorAdaptor Adaptor Monitor Monitor Monitor Connectors Business Logic/Process Feedback Active Functionalities Integration Hub Event based Information Integration for STP • Many applications that require a more proactive approach – Integration triggered by events • If the sales of a toy in the different regions are less than 100 units by 2 PM, give a discount of 10% • Useful for making timely business decisions
  • 19.
    A Platform forInformation Integration ▪VSAM ▪Sequential ▪IMS ▪Adabas ▪CA- Datacom ▪CA-IDMS ▪DB2 UDB ▪Informix ▪Oracle ▪Sybase ▪Teradata ▪Microsoft SQL Server ▪OLE DB ▪Excel ▪Flat files ▪IBM Lotus Extended Search ▪Web search ▪ LDAP DB2 CM Family Domino.doc ▪ ▪ ▪ ▪ ▪ ▪ ▪ ▪ Documentum FileNet Open Text Stellent Interwoven Hummingbird ▪WebSphere ▪FileNet ▪WebSphere BI Adaptors ▪SAP ▪PeopleSoft ▪Siebel Content Workflow & Imaging systems Relational databases Web Other Collaboration Systems XML Web services Packaged applications Mainframe files Mainframe databases ▪Lotus Notes ▪Microsoft Index Server ▪IBM Lotus Extended Search ▪Sametime ▪QuickPlace ▪Microsoft Exchange Any data Search SQL XQuery Content -- Multiple access paradigms -- Multiple integration disciplines Find Consoli date Publish FederateTransform Data and Content Access Metadata Management Integration Design Tools
  • 20.
    Information Integration KeyChallenges Managing different platforms: • Identifying relevant information from multiple data sources • Logical specification of data desired • Handle dynamic arrival and departure of data sources Automated data transformations: • Data curation, • Defining and working with data quality – What characteristics matter? What’s a “good” answer? – How does quality compose across sources? characteristics? For different activities? • Schema and Data heterogeneity: • Integrating diverse information from the recorded state of the business within cost and skill constraints • schema mapping, data mapping, information discovery, … • Uniform (or source specific) query access to data sources • Distributed query processing and optimization • Consolidating, transforming, and mining data for analysis. Can AI help in Information Integration cycle? • Reduce the effort needed to set up an data curation, integration tasks. • Enable the system to perform gracefully with uncertainty (e.g., on
  • 21.
    https://sites.google.com/site/anhaidgroup/useful-stuff/the-magellan-data-repository?aut huser=0 https://www.kaggle.com/datasets https://indiadataportal.com/ data.gov.in data.gov data.gov.au You can reachout to me any day for project discussion. Send me an email first to schedule the online/offline meeting. Important Links https://research.cs.wisc.edu/dib ook/