Enterprise have always grappled with the problem of information silos that needed to be merged using multiple data warehouses(DWs) and business intelligence(BI) tools so that enterprises could mine this disparate data for businessdecisions and strategy. Traditionally this data integration was done with ETL by consolidating multiple DBMS into a single data storage facility.
Data virtualization enables abstraction, transformation, federation, and delivery of data taken from variety of heterogeneous data sources as if it is a single virtual data source without the need to physically copy the data for integration. It allows consuming applications or users to access data from these various sources via a request to a single access point and delivers information-as-a-service (IaaS).
In this presentation, we will explore what data virtualization is and how it differs from the traditional data integration architecture. We’ll also look at validating the data virtualization and federation concepts by working through an example(see videos at the GitHub repo) to federate data across 2 heterogeneous data sources; mySQL and MongoDB using the JBoss Teiid data virtualization platform.
7. Problems with ETL
7
More than 1 copy of
data for staging
Intermediate data =>
Errors
Lead time to add new
source
Domain knowledge for
mapping
Batch Process => No
real time data
8. Problems with DBMS consolidation
8
Alternate approach =>
Single EIS (say RDBMS)
Extensive changes to
existing apps
Might not satisfy
everyone’s requirements
9. • Use cases
Agenda
9
• What does it mean?
• Implementation Frameworks
• Demo
• Questions?
• Architecture explained
10. Data Virtualization & Federation
10
Single API to access
data
Only metadata stored
at virtualization layer
Real time access without
copying/moving data
Federate data across
hetero/homogenous
sources
14. • Use cases
Agenda
14
• What does it mean?
• Implementation Frameworks
• Demo
• Questions?
• Architecture explained
15. Vendors
15
Commercial Products
Composite Software
http://www.compositesw.com/data-virtualization/
Denodo
http://www.denodo.com/en/product/overview.php?n=h
IBM
http://www-03.ibm.com/software/products/en/ibminfofedeserv
Informatica
http://www.informatica.com/us/data-virtualization/
Red Hat
http://www.redhat.com/products/jbossenterprisemiddleware/data-virtualization/
Open Source
Jboss Teiid
http://teiid.jboss.org/
16. Selected Platform – JBoss Teiid
16
Open Source
Number of
relational/NoSQL/E
RP/CRM data stores
JEE standards
Add custom EIS
support using JEE
components
Active & responsive
community Synerzip contribution: Defect
discovery, root cause analysis,
feature verification
17. Teiid Components
17
Virtual Database
container for components used to integrate data from
multiple data sources
Source Models
structure and characteristics of physical data sources
View Models
structure and characteristics of abstract structures you want to expose to your applications
Teiid Designer
Eclipse based UI to dynamically discover data source
objects and apply data federation
Generate virtual database from 1 or more sources
18. Teiid Components
18
Translator
Provides abstraction later between Teiid Query Engine and
source system
Convert Teiid SQL commands to source specific execution
commands
Convert result data from source system to Teiid specific
format
Resource Adapter
Provides connectivity to the physical data source
Integration provided through Java Connector Architecture
(JCA) API
19. Teiid – Supported EIS
Amazon SimpleDB
Apache Accumulo
Apache SOLR
Cassandra
File
Google Spreadsheet
JPA
LDAP
Excel – as file
SalesForce
JDBC
MS access, DB2, derby, excel-
odbc, greenplum, h2 , hive(for
accessing Hadoop), oracle,
teradata and most RDBMS
MongoDB
Object
OData
OLAP
Web Services
SAP Netweaver Gateway
19
20. Performance Characteristics
20
Access same data using Oracle and Teiid drivers
Retrieval times comparable when accessing tables having no
Blobs
0
5,000
10,000
15,000
20,000
25,000
No. of rows Vs Time: No Blobs
Oracle-JDBC
Teiid-JDBC
No. of rows
ms
21. Performance Characteristics
21
Teiid slower when accessing Blob data
Can be tuned
0
5,000
10,000
15,000
20,000
25,000
30,000
0 0 2 42 21,804 32,531 185,454
No. of rows Vs Time: Blobs
Oracle-JDBC
Teiid-JDBC
ms
No. of rows
22. • Use cases
Agenda
22
• What does it mean?
• Implementation Frameworks
• Demo
• Questions?
• Architecture explained
24. Demo-Steps
24
Pre-requisites
mySQL server 5.5+ installed
MongoDB 2.4.x+ installed
Steps
Load the mySql and MongoDB database with sample data
Setup environment – JBoss, Eclipse
Create Teiid project in Eclipse using Teiid designer
Import source model using JDBC
Create the virtual model and federate data from the source model
Create a virtual database (VDB) and deploy to JBoss
Access data using JDBC client or through browser using OData
33. Conclusion
33
Data Virtualization and Federation is a rapidly
emerging technology that solves traditional BI/ETL
problems.
It provides lower time to market, distributes data
across the enterprise as a service and provides real
time access to enterprise data.
Editor's Notes
Require more than 1 copy of data for staging
Creating, storing and manipulating this intermediate data can lead to errors in data quality
Lead time required to add data from new sources
Depends on domain knowledge of mapping entities between different data sources
Batch processing – information lagging behind real time data
Alternate approach is to move all enterprise data to a common Enterprise Information System (typically RDBMS)
Extensive changes to existing applications resulting in end user impact
Might not satisfy every group’s requirements – say group 1 has partitioned data but the target RDBMS doesn’t support partitioning
Single API to access data from heterogeneous sources
Only metadata stored at virtualization layer
Real time access of data without copying/moving data from the source Enterprise Information System (EIS)
Federate data across multiple heterogeneous/homogenous sources
An enterprise information system (EIS) is any kind of information system which improves the functions of an enterprise business processes by integration. An EIS could use a database/web service/flat files or any other custom system for storing this information.
Jboss Teiid
Open Source
Supports number of relational and non relational data sources
Integrated with the JBoss Application Server and JEE architecture
Ability to add custom data sources using standard JEE components
Very active and responsive community
Amazon SimpleDB - web service for running queries on structured data in real time
Apache Accumulo - sorted, distributed key value store
Apache SOLR - search system for indexing data/services
Cassandra - NoSQL database
File - exposes stored procedures to leverage file system resources
JPA - reverse a JPA object model into a relational model
LDAP - exposes an LDAP directory tree relationally
MongoDB - NoSQL database
Object - reading java objects from external sources (i.e., Infinispan Cache or Map cache)
OData - Consume OData web services and also act as web server to expose VDB as an OData service
OLAP - online analytical processing exposing data as 3-D arrays called cubes
SalesForce - CRM product
SAP Netweaver Gateway - Web service calls to SAP
Web Services - exposes stored procedures for calling web services