© 2015 IBM Corporation
Information Virtualization: Query Federation on Data Lakes
Beate Porst
porst@us.ibm.com
Product Manager Information Server
Jo Ramos
joramos@us.ibm.com
Distinguished Engineer – Big Data and Analytics @IBM
© 2015 IBM Corporation2
Agenda
 Data Lakes and Data Reservoirs
 Information Virtualization and Federation
 Examples of Federation and Best Practices
 Information Integration on Hadoop
© 2015 IBM Corporation3
The true value of Big Data is in context
Raw data
Feature
extraction metadata
Domain linkages
Full
contextual analytics
Location risk
Occupational risk
Dietary risk
Family history
Actuarial data
Government statistics
Epidemic data
Chemical exposure
Personal financial situation
Social relationships
Travel history
Weather history
. . .
. . .
Patient records
© 2015 IBM Corporation4
A growing data demand … and organizational tensions
Data Scientists seeking data for new analytics models.
Marketer seeking data for new campaigns.
Fraud investigator seeking data to understand the details of suspicious activity.
Agility
Data Access Freedom
Any kinds of data
Powerful Analysis &
Visualization
Security
Data Privacy
Standards
..
Application Developer
Knowledge Worker
Lines of Business IT Organization
© 2015 IBM Corporation5
Why a Data Reservoir and Not a Lake
 Data flows in “naturally”
and just sits there
 Built to extract value from
the data
Data Lake Data Reservoir
© 2015 IBM Corporation6
The Data Reservoir subsystems
Data Reservoir
Information Management and Governance Fabric
Data Reservoir Repositories
SandBox
Master Data Management
Cache Data
Data Marts
Operational Data Stores
Information Warehouse (EDW)
Deep Data (aka Hadoop, Aka Data Lake)
Catalogue
Self-
Service
Access
Enterprise
IT Data
Exchange
Raw Data Interaction
Analytics
Teams
Governance, Risk and
Compliance Team
Information
Curator
Line of Business
Teams
Data Reservoir
Operations
Enterprise IT
New Sources
System of
Record
Systems of
Engagement
© 2015 IBM Corporation8
Data Reservoir Logical Architecture
Data Reservoir
DataReservoir
Repositories
Harvested
Data INFORMATION
WAREHOUSE
Descriptive
Data
INFORMATION
VIEWS
CATALOG
Shared
Operational
Data
ASSET
HUB
ACTIVITY
HUB
CODE
HUB
CONTENT
HUB
Deposited
Data
Historical
Data
DEEP DATA
AUDIT DATA
OPERATIONAL
HISTORY
SEARCH
INDEX
OFFLINE
ARCHIVE
Line of Business
Applications
Information
Service Calls
Search
Requests
Report
Requests
Deploy
Decision
Models
Information
Service Calls
Data
Access
Deploy
Real-time
Decision
Models
Data Reservoir
Operations
Curation
Interaction
Management
Data
Access
Data
Deposit
Data
Deposit
Decision Model
Management
Enterprise IT
Events to
Evaluate
Information
Service Calls
Data Out
Data In
Other Systems
Of Insight
Notifications
New Sources
Third Party Feeds
Third Party APIs
Internal Sources
Deploy
Real-time
Decision
Models
Understand
Information
Sources
Understand
Information
Sources
Understand
Compliance
Report
Compliance
Advertise
Information
Source
Governance, Risk and
Compliance Team
Information
Curator
Catalog
Interfaces
Raw Data
Interaction
SAND
BOXES
Information
Integration &
Governance
INFORMATION
BROKER
OPERATIONAL
GOVERNANCE
HUB
CODE
HUB
WORKFLOWSTAGING AREAS GUARDSMONITOR
Enterprise IT
Interaction
Service
Interfaces
Data
Ingestion
Publishing
Feeds
Continuous
Analytics
STREAMING
ANALYTICS
Other Data
Reservoirs
Consumers
of Insight
Simple, ad hoc
Discovery
and Analysis
Reporting
Analytical Insight
Applications
Analytics Tools
View-based
Interaction
Access and
Feedback
Published
SAND
BOXES
REPORTING
DATA
MARTS
OBJECT
CACHE
System of
Record
Applications
Enterprise
ServiceBus
Systems of
Engagement
EVENT
CORRELATION
© 2015 IBM Corporation9
INFORMATION VIRTUALIZATION & FEDERATION
© 2015 IBM Corporation10
Information Virtualization hides the complexity of the information landscape
Information
Virtualization
Report on
Values
View related
Values
Search
Values
Browse
Sources
Analyze
Values
Provision
Information Provisioning
Information Delivery
Data Access APIs
Semantic/Business Objects
10001
01010
01010
Data Scientist
Line of
Business
© 2015 IBM Corporation11
Different Styles of Information Provisioning
Federation
Replication Caching
Consolidation
Analytical & Reporting Tools
Web Applications
Product
Performance
Real-time
Inventory Level
Consolidation
Headquarters Stores
Primary
Data Center
Backup
Data Center
Replication
Replication
Cache
Region 1
Product
Performance
Region 2
Product
Performance
Consolidation
Replication
Replication
Database
FederationFederation
© 2015 IBM Corporation12
Example – Integrating the enterprise across independent silos
ETL transforming
Data for consistency
Global View Global View
Silo 1 Silo 2 Silo 3 Silo 1 Silo 2 Silo 3
The optimal approach depends on how consistent the data is across the silos, how much spare capacity each
silo has to support additional queries and the appropriate availability of all silos to answer a global query.
Federated Queries
Consistent Data
Sources
© 2015 IBM Corporation13
Example – Creating a logical warehouse
Deep Data
(hadoop system)
System of
Record
Requested
View
Information virtualization hides the complexities of where the data is located. Here different repositories are
being used to host different workloads, but this complexity is hidden by the information virtualization layer.
Detailed data
maintained for
exploratory
analysis and
investigations.
Structured
information
optimized for
complex analytics
and reporting
?
© 2015 IBM Corporation14
Service Federation Semantic FederationDatabase Federation
Virtual Information Collection
14
1 2
Information
Federation
Process
3
• Relational Data Only
• SQL Pushdown
• Challenges:
• Query optimization
• Out-of-memory
• Complex SQL/joins
• Data is combined in-memory
Technology: SOA, Message Broker,
Spark, BI & Reporting Tools
• Challenges:
- Performance (network, memory, etc.)
• Use triple store and ontology to
create the virtualized interfaces on-
the-fly. New technology ie Spark
• Challenges:
• Query Optimization
• Security
© 2015 IBM Corporation15
IBM FEDERATION SOLUTIONS
© 2015 IBM Corporation16
BigSQL Query Fluid (federation)
 Data never lives in isolation
• Either as a landing zone or a queryable archive it is desirable to query data across Hadoop and active Data
warehouses
 Big SQL provides the ability to query heterogeneous systems
• Join Hadoop to other relational databases
• Query optimizer understands capabilities of external system
•Including available statistics
• As much work as possible is pushed to each system to process
Head Node
Big SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
© 2015 IBM Corporation17
BigInsights (hadoop)
BIGSQL MPP Engine
Relational Engines
Relational Database
Engines
Applications
User
Interaction
BigSQL Fluid Query: Federation to RDBMS Engines
Local Data
Sources
SQL
??
Oracle
Teradata Netezza
DB2
1
7
Table-2 (local)
Table-1 (local)
Table-3 (local)
File Formats
Parquet
CSV
Seq
RC
Avro
JSON
Custom
ORC
Application
needs to join
Table-1, Table-2
and Table-3
HDFS & GPFS
© 2015 IBM Corporation18
BigInsights (hadoop)
BIGSQL MPP Engine
Federation
Engine
Relational Engines
Relational Database
Engines
Applications
User
Interaction
BigSQL Fluid Query: Federation to RDBMS Engines
Local Data
Sources
SQL
Oracle
Teradata Netezza
DB2
1
8
Table-2 (local)
Table-1 (local)
Table-3 (local)
Table-2 (alias)
Table-1 (alias)
File Formats
Parquet
CSV
Seq
RC
Avro
JSON
Custom
ORC
Application
needs to join
Table-1, Table-2
and Table-3
1. Create Alias for Table-1 and Table-2 on
BigSQL Federation Engine.
HDFS & GPFS
© 2015 IBM Corporation19
BigInsights (hadoop)
BIGSQL MPP Engine
Federation
Engine
Relational Engines
Relational Database
Engines
Applications
User
Interaction
BigSQL Fluid Query: Federation to RDBMS Engines
Local Data
Sources
SQL
• Joins, Predicates, Aggregation are pushed down to backend RDBMS engine to reduce
data transfers.
Oracle
Teradata Netezza
DB2
1
9
Table-2 (local)
Table-1 (local)
SQL
Table-3 (local)
Table-2 (alias)
Table-1 (alias)
File Formats
Parquet
CSV
Seq
RC
Avro
JSON
Custom
ORC
SQL
Application
needs to join
Table-1, Table-2
and Table-3
1. Create Alias for Table-1 and Table-2 on BigSQL
Federation Engine
2. Query Optimizer engine push part of the SQL to be
executed remote RDBMS.
3. Final Join/aggregation executed on BigSQL
HDFS & GPFS
ClientDriver
Client Driver
Data Access
Data flow
© 2015 IBM Corporation20
IBM Fluid Query V1.0
 Connectors:
• Routes PDA (Netezza) queries to the top Hadoop providers
 Data movement:
• Allows rapid data movement between PDA and Hadoop
• PDA to Hadoop
• Hadoop to PDA
 Initial Supported Hadoop SQL Query Engines
• BigInsights – Hive2, BigSQL v1, BigSQL v3, BigSQL v4
• Hortonworks – Hive2
• Cloudera – Hive2, Impala
Unifying PureData System for Analytics (PDA) with Hadoop
© 2015 IBM Corporation21
Applications
User
Interaction
PureData for Analytics
(Netezza)
Netezza Fluid Query to Hadoop Engines
NPS MPP Engine
Fluid
Query
Table-1 (alias)
Table-3 (local)
SQL SQL
Table-2 (alias)
Joins , Predicates, Aggregation are
applied on Hadoop via Views to
minimize data transfers.
Final Joins, Predicates and
aggregation are applied on Netezza.
ClientDriver
ClientDriver
Application
needs to join
Table-1, Table-2
and Table-3
2
1
Impala / Hive
BigSQL
Table-1 (local)
Table-2 (local)
SQL
Local Data
Sources
File Formats
Parquet
CSV
Seq
RC
Avro
JSON
ORC
HDFS
Data flow
© 2015 IBM Corporation22
Query Federation Best Practices
 Avoid Complex Joins Across Multiple Disparate Repositories
• Example: Join tables from BigSQL, Oracle, Teradata, Netezza on same SQL.
• Consider other techniques (copy data local, caching, etc.)
 Keep statistics current on every Table part of the Federated System
• Statistics are critical for query optimization.
 Watch out for network bandwidth and traffic
• You can overload network with large data transfers (intermediate results need to be generated)
 Consider Implement Workload Management and Query Governor
• Avoid a federated query to overload an system.
 Avoid Complex Data Transformations (in-flight transformation)
• Can impact any of the involved systems
© 2015 IBM Corporation23
When Apply Federation
 Build multi-temperature data systems
• Hot/Cold/Warm data on different repositories
 Data Dynamically changing, in particular schema evolution.
 Federated queries can perform reasonable without impact any of systems
involved
 Real-time access to small set of data on distributed systems
 When remote data can not be moved to local
• Regulatory issues
 Number of federated queries is manageable
© 2015 IBM Corporation24
Some considerations to provide access to information
Access in place
 Up-to-date information
 Cost-effective
 Slower access path
• Remote Access
• Reformatting
Make a local copy
 Specially formatted for use case
 Local data access
 Local control
 Local cost
 Potentially stale values
 Consider this questions and make the best choice
• How much information?
• How rapidly is it changing?
• How frequently is it accessed?
• How much transformation is required to consume the information?
• When is the information available?
• Who owns the information?
• How easily can it be changed?
© 2015 IBM Corporation25
IBM INFORMATION SERVER FOR HADOOP
© 2015 IBM Corporation26
The Data Reservoir subsystems
Data Reservoir
Information Management and Governance Fabric
Data Reservoir Repositories
SandBox
Master Data Management
In-Memory Cache
Data Marts
Operational Data Stores
Information Warehouse (EDW)
Deep Data (aka Hadoop, Aka Data Lake)
Catalogue
Self-
Service
Access
Enterprise
IT Data
Exchange
Raw Data Interaction
Analytics
Teams
Governance, Risk and
Compliance Team
Information
Curator
Line of Business
Teams
Data Reservoir
Operations
Enterprise IT
New Sources
System of
Record
Systems of
Engagement
© 2015 IBM Corporation27
IBM Confidential
IAP PMOM Std DCP Template – V1 May, 2015
Introducing IBM Information Server for Apache Hadoop:
Information Empowerment for Your Hadoop Environment
Superfast data ingest and processing
Integrate, prepare and enrich data with speed and confidence
running natively on Hadoop with speeds 10-15x faster
than MapReduce
Complete confidence in your data
Understand what data is available and where it came from
monitor and cleanse quality of data; catalog metadata
assets and trace lineage
Higher Level of Productivity
Develop integration processes much faster than with
hand coding – based on existing enterprise skills
graphical data flow development environment with
100s of prebuilt stages and 1000s of prebuilt functions
no other vendor
has this
scale or speed
extend existing
leadership into
hadoop domain
proven
development
paradigm
© 2015 IBM Corporation28
IBM Confidential
IAP PMOM Std DCP Template – V1 May, 2015
• Optimize your integration and DQ workload based on data locality and
resources availability
• Design your transformation or cleansing once and run it on your Hadoop
cluster, on your traditional engine or optimize to run on your database
Traditional ETL EngineDatabases
Execute “Anywhere”
One Integration & Quality Design
Maximize your IT resources utilization through “anywhere” execution
this release
adds this
pattern to run
natively on the
hadoop cluster
© 2015 IBM Corporation29
zzzz
z
z
z
Questions?
© 2015 IBM Corporation30
REFERENCE MATERIAL
New Information Architectures and Capabilities

Information Virtualization: Query Federation on Data Lakes

  • 1.
    © 2015 IBMCorporation Information Virtualization: Query Federation on Data Lakes Beate Porst porst@us.ibm.com Product Manager Information Server Jo Ramos joramos@us.ibm.com Distinguished Engineer – Big Data and Analytics @IBM
  • 2.
    © 2015 IBMCorporation2 Agenda  Data Lakes and Data Reservoirs  Information Virtualization and Federation  Examples of Federation and Best Practices  Information Integration on Hadoop
  • 3.
    © 2015 IBMCorporation3 The true value of Big Data is in context Raw data Feature extraction metadata Domain linkages Full contextual analytics Location risk Occupational risk Dietary risk Family history Actuarial data Government statistics Epidemic data Chemical exposure Personal financial situation Social relationships Travel history Weather history . . . . . . Patient records
  • 4.
    © 2015 IBMCorporation4 A growing data demand … and organizational tensions Data Scientists seeking data for new analytics models. Marketer seeking data for new campaigns. Fraud investigator seeking data to understand the details of suspicious activity. Agility Data Access Freedom Any kinds of data Powerful Analysis & Visualization Security Data Privacy Standards .. Application Developer Knowledge Worker Lines of Business IT Organization
  • 5.
    © 2015 IBMCorporation5 Why a Data Reservoir and Not a Lake  Data flows in “naturally” and just sits there  Built to extract value from the data Data Lake Data Reservoir
  • 6.
    © 2015 IBMCorporation6 The Data Reservoir subsystems Data Reservoir Information Management and Governance Fabric Data Reservoir Repositories SandBox Master Data Management Cache Data Data Marts Operational Data Stores Information Warehouse (EDW) Deep Data (aka Hadoop, Aka Data Lake) Catalogue Self- Service Access Enterprise IT Data Exchange Raw Data Interaction Analytics Teams Governance, Risk and Compliance Team Information Curator Line of Business Teams Data Reservoir Operations Enterprise IT New Sources System of Record Systems of Engagement
  • 7.
    © 2015 IBMCorporation8 Data Reservoir Logical Architecture Data Reservoir DataReservoir Repositories Harvested Data INFORMATION WAREHOUSE Descriptive Data INFORMATION VIEWS CATALOG Shared Operational Data ASSET HUB ACTIVITY HUB CODE HUB CONTENT HUB Deposited Data Historical Data DEEP DATA AUDIT DATA OPERATIONAL HISTORY SEARCH INDEX OFFLINE ARCHIVE Line of Business Applications Information Service Calls Search Requests Report Requests Deploy Decision Models Information Service Calls Data Access Deploy Real-time Decision Models Data Reservoir Operations Curation Interaction Management Data Access Data Deposit Data Deposit Decision Model Management Enterprise IT Events to Evaluate Information Service Calls Data Out Data In Other Systems Of Insight Notifications New Sources Third Party Feeds Third Party APIs Internal Sources Deploy Real-time Decision Models Understand Information Sources Understand Information Sources Understand Compliance Report Compliance Advertise Information Source Governance, Risk and Compliance Team Information Curator Catalog Interfaces Raw Data Interaction SAND BOXES Information Integration & Governance INFORMATION BROKER OPERATIONAL GOVERNANCE HUB CODE HUB WORKFLOWSTAGING AREAS GUARDSMONITOR Enterprise IT Interaction Service Interfaces Data Ingestion Publishing Feeds Continuous Analytics STREAMING ANALYTICS Other Data Reservoirs Consumers of Insight Simple, ad hoc Discovery and Analysis Reporting Analytical Insight Applications Analytics Tools View-based Interaction Access and Feedback Published SAND BOXES REPORTING DATA MARTS OBJECT CACHE System of Record Applications Enterprise ServiceBus Systems of Engagement EVENT CORRELATION
  • 8.
    © 2015 IBMCorporation9 INFORMATION VIRTUALIZATION & FEDERATION
  • 9.
    © 2015 IBMCorporation10 Information Virtualization hides the complexity of the information landscape Information Virtualization Report on Values View related Values Search Values Browse Sources Analyze Values Provision Information Provisioning Information Delivery Data Access APIs Semantic/Business Objects 10001 01010 01010 Data Scientist Line of Business
  • 10.
    © 2015 IBMCorporation11 Different Styles of Information Provisioning Federation Replication Caching Consolidation Analytical & Reporting Tools Web Applications Product Performance Real-time Inventory Level Consolidation Headquarters Stores Primary Data Center Backup Data Center Replication Replication Cache Region 1 Product Performance Region 2 Product Performance Consolidation Replication Replication Database FederationFederation
  • 11.
    © 2015 IBMCorporation12 Example – Integrating the enterprise across independent silos ETL transforming Data for consistency Global View Global View Silo 1 Silo 2 Silo 3 Silo 1 Silo 2 Silo 3 The optimal approach depends on how consistent the data is across the silos, how much spare capacity each silo has to support additional queries and the appropriate availability of all silos to answer a global query. Federated Queries Consistent Data Sources
  • 12.
    © 2015 IBMCorporation13 Example – Creating a logical warehouse Deep Data (hadoop system) System of Record Requested View Information virtualization hides the complexities of where the data is located. Here different repositories are being used to host different workloads, but this complexity is hidden by the information virtualization layer. Detailed data maintained for exploratory analysis and investigations. Structured information optimized for complex analytics and reporting ?
  • 13.
    © 2015 IBMCorporation14 Service Federation Semantic FederationDatabase Federation Virtual Information Collection 14 1 2 Information Federation Process 3 • Relational Data Only • SQL Pushdown • Challenges: • Query optimization • Out-of-memory • Complex SQL/joins • Data is combined in-memory Technology: SOA, Message Broker, Spark, BI & Reporting Tools • Challenges: - Performance (network, memory, etc.) • Use triple store and ontology to create the virtualized interfaces on- the-fly. New technology ie Spark • Challenges: • Query Optimization • Security
  • 14.
    © 2015 IBMCorporation15 IBM FEDERATION SOLUTIONS
  • 15.
    © 2015 IBMCorporation16 BigSQL Query Fluid (federation)  Data never lives in isolation • Either as a landing zone or a queryable archive it is desirable to query data across Hadoop and active Data warehouses  Big SQL provides the ability to query heterogeneous systems • Join Hadoop to other relational databases • Query optimizer understands capabilities of external system •Including available statistics • As much work as possible is pushed to each system to process Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  • 16.
    © 2015 IBMCorporation17 BigInsights (hadoop) BIGSQL MPP Engine Relational Engines Relational Database Engines Applications User Interaction BigSQL Fluid Query: Federation to RDBMS Engines Local Data Sources SQL ?? Oracle Teradata Netezza DB2 1 7 Table-2 (local) Table-1 (local) Table-3 (local) File Formats Parquet CSV Seq RC Avro JSON Custom ORC Application needs to join Table-1, Table-2 and Table-3 HDFS & GPFS
  • 17.
    © 2015 IBMCorporation18 BigInsights (hadoop) BIGSQL MPP Engine Federation Engine Relational Engines Relational Database Engines Applications User Interaction BigSQL Fluid Query: Federation to RDBMS Engines Local Data Sources SQL Oracle Teradata Netezza DB2 1 8 Table-2 (local) Table-1 (local) Table-3 (local) Table-2 (alias) Table-1 (alias) File Formats Parquet CSV Seq RC Avro JSON Custom ORC Application needs to join Table-1, Table-2 and Table-3 1. Create Alias for Table-1 and Table-2 on BigSQL Federation Engine. HDFS & GPFS
  • 18.
    © 2015 IBMCorporation19 BigInsights (hadoop) BIGSQL MPP Engine Federation Engine Relational Engines Relational Database Engines Applications User Interaction BigSQL Fluid Query: Federation to RDBMS Engines Local Data Sources SQL • Joins, Predicates, Aggregation are pushed down to backend RDBMS engine to reduce data transfers. Oracle Teradata Netezza DB2 1 9 Table-2 (local) Table-1 (local) SQL Table-3 (local) Table-2 (alias) Table-1 (alias) File Formats Parquet CSV Seq RC Avro JSON Custom ORC SQL Application needs to join Table-1, Table-2 and Table-3 1. Create Alias for Table-1 and Table-2 on BigSQL Federation Engine 2. Query Optimizer engine push part of the SQL to be executed remote RDBMS. 3. Final Join/aggregation executed on BigSQL HDFS & GPFS ClientDriver Client Driver Data Access Data flow
  • 19.
    © 2015 IBMCorporation20 IBM Fluid Query V1.0  Connectors: • Routes PDA (Netezza) queries to the top Hadoop providers  Data movement: • Allows rapid data movement between PDA and Hadoop • PDA to Hadoop • Hadoop to PDA  Initial Supported Hadoop SQL Query Engines • BigInsights – Hive2, BigSQL v1, BigSQL v3, BigSQL v4 • Hortonworks – Hive2 • Cloudera – Hive2, Impala Unifying PureData System for Analytics (PDA) with Hadoop
  • 20.
    © 2015 IBMCorporation21 Applications User Interaction PureData for Analytics (Netezza) Netezza Fluid Query to Hadoop Engines NPS MPP Engine Fluid Query Table-1 (alias) Table-3 (local) SQL SQL Table-2 (alias) Joins , Predicates, Aggregation are applied on Hadoop via Views to minimize data transfers. Final Joins, Predicates and aggregation are applied on Netezza. ClientDriver ClientDriver Application needs to join Table-1, Table-2 and Table-3 2 1 Impala / Hive BigSQL Table-1 (local) Table-2 (local) SQL Local Data Sources File Formats Parquet CSV Seq RC Avro JSON ORC HDFS Data flow
  • 21.
    © 2015 IBMCorporation22 Query Federation Best Practices  Avoid Complex Joins Across Multiple Disparate Repositories • Example: Join tables from BigSQL, Oracle, Teradata, Netezza on same SQL. • Consider other techniques (copy data local, caching, etc.)  Keep statistics current on every Table part of the Federated System • Statistics are critical for query optimization.  Watch out for network bandwidth and traffic • You can overload network with large data transfers (intermediate results need to be generated)  Consider Implement Workload Management and Query Governor • Avoid a federated query to overload an system.  Avoid Complex Data Transformations (in-flight transformation) • Can impact any of the involved systems
  • 22.
    © 2015 IBMCorporation23 When Apply Federation  Build multi-temperature data systems • Hot/Cold/Warm data on different repositories  Data Dynamically changing, in particular schema evolution.  Federated queries can perform reasonable without impact any of systems involved  Real-time access to small set of data on distributed systems  When remote data can not be moved to local • Regulatory issues  Number of federated queries is manageable
  • 23.
    © 2015 IBMCorporation24 Some considerations to provide access to information Access in place  Up-to-date information  Cost-effective  Slower access path • Remote Access • Reformatting Make a local copy  Specially formatted for use case  Local data access  Local control  Local cost  Potentially stale values  Consider this questions and make the best choice • How much information? • How rapidly is it changing? • How frequently is it accessed? • How much transformation is required to consume the information? • When is the information available? • Who owns the information? • How easily can it be changed?
  • 24.
    © 2015 IBMCorporation25 IBM INFORMATION SERVER FOR HADOOP
  • 25.
    © 2015 IBMCorporation26 The Data Reservoir subsystems Data Reservoir Information Management and Governance Fabric Data Reservoir Repositories SandBox Master Data Management In-Memory Cache Data Marts Operational Data Stores Information Warehouse (EDW) Deep Data (aka Hadoop, Aka Data Lake) Catalogue Self- Service Access Enterprise IT Data Exchange Raw Data Interaction Analytics Teams Governance, Risk and Compliance Team Information Curator Line of Business Teams Data Reservoir Operations Enterprise IT New Sources System of Record Systems of Engagement
  • 26.
    © 2015 IBMCorporation27 IBM Confidential IAP PMOM Std DCP Template – V1 May, 2015 Introducing IBM Information Server for Apache Hadoop: Information Empowerment for Your Hadoop Environment Superfast data ingest and processing Integrate, prepare and enrich data with speed and confidence running natively on Hadoop with speeds 10-15x faster than MapReduce Complete confidence in your data Understand what data is available and where it came from monitor and cleanse quality of data; catalog metadata assets and trace lineage Higher Level of Productivity Develop integration processes much faster than with hand coding – based on existing enterprise skills graphical data flow development environment with 100s of prebuilt stages and 1000s of prebuilt functions no other vendor has this scale or speed extend existing leadership into hadoop domain proven development paradigm
  • 27.
    © 2015 IBMCorporation28 IBM Confidential IAP PMOM Std DCP Template – V1 May, 2015 • Optimize your integration and DQ workload based on data locality and resources availability • Design your transformation or cleansing once and run it on your Hadoop cluster, on your traditional engine or optimize to run on your database Traditional ETL EngineDatabases Execute “Anywhere” One Integration & Quality Design Maximize your IT resources utilization through “anywhere” execution this release adds this pattern to run natively on the hadoop cluster
  • 28.
    © 2015 IBMCorporation29 zzzz z z z Questions?
  • 29.
    © 2015 IBMCorporation30 REFERENCE MATERIAL New Information Architectures and Capabilities