Information Virtualization: Query Federation on Data Lakes

© 2015 IBM Corporation
Information Virtualization: Query Federation on Data Lakes
Beate Porst
porst@us.ibm.com
Product Manager Information Server
Jo Ramos
joramos@us.ibm.com
Distinguished Engineer – Big Data and Analytics @IBM

© 2015 IBM Corporation2
Agenda
 Data Lakes and Data Reservoirs
 Information Virtualization and Federation
 Examples of Federation and Best Practices
 Information Integration on Hadoop

The true value of Big Data is in context
Raw data
Feature
extraction metadata
Domain linkages
Full
contextual analytics
Location risk
Occupational risk
Dietary risk
Family history
Actuarial data
Government statistics
Epidemic data
Chemical exposure
Personal financial situation
Social relationships
Travel history
Weather history
. . .
. . .
Patient records

A growing data demand … and organizational tensions
Data Scientists seeking data for new analytics models.
Marketer seeking data for new campaigns.
Fraud investigator seeking data to understand the details of suspicious activity.
Agility
Data Access Freedom
Any kinds of data
Powerful Analysis &
Visualization
Security
Data Privacy
Standards
..
Application Developer
Knowledge Worker
Lines of Business IT Organization

Why a Data Reservoir and Not a Lake
 Data flows in “naturally”
and just sits there
 Built to extract value from
the data
Data Lake Data Reservoir

The Data Reservoir subsystems
Data Reservoir
Information Management and Governance Fabric
Data Reservoir Repositories
SandBox
Master Data Management
Cache Data
Data Marts
Operational Data Stores
Information Warehouse (EDW)
Deep Data (aka Hadoop, Aka Data Lake)
Catalogue
Self-
Service
Access
Enterprise
IT Data
Exchange
Raw Data Interaction
Analytics
Teams
Governance, Risk and
Compliance Team
Information
Curator
Line of Business
Teams
Data Reservoir
Operations
Enterprise IT
New Sources
System of
Record
Systems of
Engagement

Data Reservoir Logical Architecture
Data Reservoir
DataReservoir
Repositories
Harvested
Data INFORMATION
WAREHOUSE
Descriptive
Data
INFORMATION
VIEWS
CATALOG
Shared
Operational
Data
ASSET
HUB
ACTIVITY
HUB
CODE
HUB
CONTENT
HUB
Deposited
Data
Historical
Data
DEEP DATA
AUDIT DATA
OPERATIONAL
HISTORY
SEARCH
INDEX
OFFLINE
ARCHIVE
Line of Business
Applications
Information
Service Calls
Search
Requests
Report
Requests
Deploy
Decision
Models
Information
Service Calls
Data
Access
Deploy
Real-time
Decision
Models
Data Reservoir
Operations
Curation
Interaction
Management
Data
Access
Data
Deposit
Data
Deposit
Decision Model
Management
Enterprise IT
Events to
Evaluate
Information
Service Calls
Data Out
Data In
Other Systems
Of Insight
Notifications
New Sources
Third Party Feeds
Third Party APIs
Internal Sources
Deploy
Real-time
Decision
Models
Understand
Information
Sources
Understand
Information
Sources
Understand
Compliance
Report
Compliance
Advertise
Information
Source
Compliance Team
Information
Curator
Catalog
Interfaces
Raw Data
Interaction
SAND
BOXES
Information
Integration &
Governance
INFORMATION
BROKER
OPERATIONAL
GOVERNANCE
HUB
CODE
HUB
WORKFLOWSTAGING AREAS GUARDSMONITOR
Enterprise IT
Interaction
Service
Interfaces
Data
Ingestion
Publishing
Feeds
Continuous
Analytics
STREAMING
ANALYTICS
Other Data
Reservoirs
Consumers
of Insight
Simple, ad hoc
Discovery
and Analysis
Reporting
Analytical Insight
Applications
Analytics Tools
View-based
Interaction
Access and
Feedback
Published
SAND
BOXES
REPORTING
DATA
MARTS
OBJECT
CACHE
System of
Record
Applications
Enterprise
ServiceBus
Systems of
Engagement
EVENT
CORRELATION

INFORMATION VIRTUALIZATION & FEDERATION

Information Virtualization hides the complexity of the information landscape
Information
Virtualization
Report on
Values
View related
Values
Search
Values
Browse
Sources
Analyze
Values
Provision
Information Provisioning
Information Delivery
Data Access APIs
Semantic/Business Objects
10001
01010
01010
Data Scientist
Line of
Business

Different Styles of Information Provisioning
Federation
Replication Caching
Consolidation
Analytical & Reporting Tools
Web Applications
Product
Performance
Real-time
Inventory Level
Consolidation
Headquarters Stores
Primary
Data Center
Backup
Data Center
Replication
Replication
Cache
Region 1
Product
Performance
Region 2
Product
Performance
Consolidation
Replication
Replication
Database
FederationFederation

Example – Integrating the enterprise across independent silos
ETL transforming
Data for consistency
Global View Global View
Silo 1 Silo 2 Silo 3 Silo 1 Silo 2 Silo 3
The optimal approach depends on how consistent the data is across the silos, how much spare capacity each
silo has to support additional queries and the appropriate availability of all silos to answer a global query.
Federated Queries
Consistent Data
Sources

Example – Creating a logical warehouse
Deep Data
(hadoop system)
System of
Record
Requested
View
Information virtualization hides the complexities of where the data is located. Here different repositories are
being used to host different workloads, but this complexity is hidden by the information virtualization layer.
Detailed data
maintained for
exploratory
analysis and
investigations.
Structured
information
optimized for
complex analytics
and reporting
?

Service Federation Semantic FederationDatabase Federation
Virtual Information Collection
14
1 2
Information
Federation
Process
3
• Relational Data Only
• SQL Pushdown
• Challenges:
• Query optimization
• Out-of-memory
• Complex SQL/joins
• Data is combined in-memory
Technology: SOA, Message Broker,
Spark, BI & Reporting Tools
• Challenges:
- Performance (network, memory, etc.)
• Use triple store and ontology to
create the virtualized interfaces on-
the-fly. New technology ie Spark
• Challenges:
• Query Optimization
• Security

IBM FEDERATION SOLUTIONS

BigSQL Query Fluid (federation)
 Data never lives in isolation
• Either as a landing zone or a queryable archive it is desirable to query data across Hadoop and active Data
warehouses
 Big SQL provides the ability to query heterogeneous systems
• Join Hadoop to other relational databases
• Query optimizer understands capabilities of external system
•Including available statistics
• As much work as possible is pushed to each system to process
Head Node
Big SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL

BigInsights (hadoop)
BIGSQL MPP Engine
Relational Engines
Relational Database
Engines
Applications
User
Interaction
BigSQL Fluid Query: Federation to RDBMS Engines
Local Data
Sources
SQL
??
Oracle
Teradata Netezza
DB2
1
7
Table-2 (local)
Table-1 (local)
Table-3 (local)
File Formats
Parquet
CSV
Seq
RC
Avro
JSON
Custom
ORC
Application
needs to join
Table-1, Table-2
and Table-3
HDFS & GPFS

BIGSQL MPP Engine
Federation
Engine
Relational Engines
Relational Database
Engines
Applications
User
Interaction
Local Data
Sources
SQL
Oracle
Teradata Netezza
DB2
1
8
Table-2 (local)
Table-1 (local)
Table-3 (local)
Table-2 (alias)
Table-1 (alias)
File Formats
Parquet
CSV
Seq
RC
Avro
JSON
Custom
ORC
Application
needs to join
Table-1, Table-2
and Table-3
1. Create Alias for Table-1 and Table-2 on
BigSQL Federation Engine.
HDFS & GPFS

BIGSQL MPP Engine
Federation
Engine
Relational Engines
Relational Database
Engines
Applications
User
Interaction
Local Data
Sources
SQL
• Joins, Predicates, Aggregation are pushed down to backend RDBMS engine to reduce
data transfers.
Oracle
Teradata Netezza
DB2
1
9
Table-2 (local)
Table-1 (local)
SQL
Table-3 (local)
Table-2 (alias)
Table-1 (alias)
File Formats
Parquet
CSV
Seq
RC
Avro
JSON
Custom
ORC
SQL
Application
needs to join
Table-1, Table-2
and Table-3
1. Create Alias for Table-1 and Table-2 on BigSQL
Federation Engine
2. Query Optimizer engine push part of the SQL to be
executed remote RDBMS.
3. Final Join/aggregation executed on BigSQL
HDFS & GPFS
ClientDriver
Client Driver
Data Access
Data flow

IBM Fluid Query V1.0
 Connectors:
• Routes PDA (Netezza) queries to the top Hadoop providers
 Data movement:
• Allows rapid data movement between PDA and Hadoop
• PDA to Hadoop
• Hadoop to PDA
 Initial Supported Hadoop SQL Query Engines
• BigInsights – Hive2, BigSQL v1, BigSQL v3, BigSQL v4
• Hortonworks – Hive2
• Cloudera – Hive2, Impala
Unifying PureData System for Analytics (PDA) with Hadoop

Applications
User
Interaction
PureData for Analytics
(Netezza)
Netezza Fluid Query to Hadoop Engines
NPS MPP Engine
Fluid
Query
Table-1 (alias)
Table-3 (local)
SQL SQL
Table-2 (alias)
Joins , Predicates, Aggregation are
applied on Hadoop via Views to
minimize data transfers.
Final Joins, Predicates and
aggregation are applied on Netezza.
ClientDriver
ClientDriver
Application
needs to join
Table-1, Table-2
and Table-3
2
1
Impala / Hive
BigSQL
Table-1 (local)
Table-2 (local)
SQL
Local Data
Sources
File Formats
Parquet
CSV
Seq
RC
Avro
JSON
ORC
HDFS
Data flow

Query Federation Best Practices
 Avoid Complex Joins Across Multiple Disparate Repositories
• Example: Join tables from BigSQL, Oracle, Teradata, Netezza on same SQL.
• Consider other techniques (copy data local, caching, etc.)
 Keep statistics current on every Table part of the Federated System
• Statistics are critical for query optimization.
 Watch out for network bandwidth and traffic
• You can overload network with large data transfers (intermediate results need to be generated)
 Consider Implement Workload Management and Query Governor
• Avoid a federated query to overload an system.
 Avoid Complex Data Transformations (in-flight transformation)
• Can impact any of the involved systems

When Apply Federation
 Build multi-temperature data systems
• Hot/Cold/Warm data on different repositories
 Data Dynamically changing, in particular schema evolution.
 Federated queries can perform reasonable without impact any of systems
involved
 Real-time access to small set of data on distributed systems
 When remote data can not be moved to local
• Regulatory issues
 Number of federated queries is manageable

Some considerations to provide access to information
Access in place
 Up-to-date information
 Cost-effective
 Slower access path
• Remote Access
• Reformatting
Make a local copy
 Specially formatted for use case
 Local data access
 Local control
 Local cost
 Potentially stale values
 Consider this questions and make the best choice
• How much information?
• How rapidly is it changing?
• How frequently is it accessed?
• How much transformation is required to consume the information?
• When is the information available?
• Who owns the information?
• How easily can it be changed?

IBM INFORMATION SERVER FOR HADOOP

The Data Reservoir subsystems
Data Reservoir
Information Management and Governance Fabric
Data Reservoir Repositories
SandBox
Master Data Management
In-Memory Cache
Data Marts
Operational Data Stores
Information Warehouse (EDW)
Deep Data (aka Hadoop, Aka Data Lake)
Catalogue
Self-
Service
Access
Enterprise
IT Data
Exchange
Raw Data Interaction
Analytics
Teams
Compliance Team
Information
Curator
Line of Business
Teams
Data Reservoir
Operations
Enterprise IT
New Sources
System of
Record
Systems of
Engagement

IBM Confidential
IAP PMOM Std DCP Template – V1 May, 2015
Introducing IBM Information Server for Apache Hadoop:
Information Empowerment for Your Hadoop Environment
Superfast data ingest and processing
Integrate, prepare and enrich data with speed and confidence
running natively on Hadoop with speeds 10-15x faster
than MapReduce
Complete confidence in your data
Understand what data is available and where it came from
monitor and cleanse quality of data; catalog metadata
assets and trace lineage
Higher Level of Productivity
Develop integration processes much faster than with
hand coding – based on existing enterprise skills
graphical data flow development environment with
100s of prebuilt stages and 1000s of prebuilt functions
no other vendor
has this
scale or speed
extend existing
leadership into
hadoop domain
proven
development
paradigm

IBM Confidential
IAP PMOM Std DCP Template – V1 May, 2015
• Optimize your integration and DQ workload based on data locality and
resources availability
• Design your transformation or cleansing once and run it on your Hadoop
cluster, on your traditional engine or optimize to run on your database
Traditional ETL EngineDatabases
Execute “Anywhere”
One Integration & Quality Design
Maximize your IT resources utilization through “anywhere” execution
this release
adds this
pattern to run
natively on the
hadoop cluster

zzzz
z
z
z
Questions?

REFERENCE MATERIAL
New Information Architectures and Capabilities

Information Virtualization: Query Federation on Data Lakes

More Related Content

What's hot

Similar to Information Virtualization: Query Federation on Data Lakes

More from DataWorks Summit

Recently uploaded

Information Virtualization: Query Federation on Data Lakes