2. Please Note
• IBM’s statements regarding its plans, directions, and intent are subject to change or
withdrawal without notice at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality. Information
about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described
for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks
in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as
the amount of multiprogramming in the user’s job stream, the I/O configuration, the
storage configuration, and the workload processed. Therefore, no assurance can be
given that an individual user will achieve results similar to those stated here.
2
3. Agenda
• Big Data & the Hybrid Data Warehouse
A quick recap on BigData
The Hybrid Data Warehouse
• Big SQL 3.0 – SQL on Hadoop without compromises
What is Big SQL ?
Features
• Setting up an Hybrid Data Warehouse with Big SQL
Federation Overview
Configuration
3
5. 2009
800,000 petabytes
2020
35 zettabytes
as much Data and Content
Over Coming Decade
44x Business leaders frequently make
decisions based on information they
don’t trust, or don’t have1in3
83%
of CIOs cited “Business
intelligence and analytics” as
part of their visionary plans
to enhance competitiveness
Business leaders say they don’t
have access to the information they
need to do their jobs
1in2
of CEOs need to do a better job
capturing and understanding
information rapidly in order to
make swift business decisions
60%
… And Organizations
Need Deeper Insights
Of world’s data
is unstructured
80%
Information is at the Center
of a New Wave of Opportunity…
6. Hadoop
• Framework / ecosystem of component aimed at distributing work
across (very) big cluster for parallelizing …
• Components
A distributed file system running on commodity hardware (HDFS)
The MapReduce programming model & associated APIs
Hive, HBase, Sqoop, etc ...
• Hadoop nodes store and process data
Bringing the program to the data
Easy scalability – just add more nodes
7. Bringing the Program to the Data
1
Cluster
32
MR
Job
MR
Job MR
JobApp
(Compute)
MR
Job
Result
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
1
2
3
Logical File
Splits
Map Reduce Job
8. Hadoop for Big Data ?
• Well suited for Big Data
Distributed, parallel execution
Cheap(er) per storage unit
Flexible: no schema
Highly available
• But …
Data needs to be integrated to generate insights
Prioritization of applications
Capitalizing on investment
High priority / high
performance SQL
applications
Semi-structured
data applications
Low priority, low
performance SQL
applications
Experimental
applications
9. Hybrid Data Warehouse
Exploration,
Integrated
Warehouse, and
Mart Zones
Discovery
Deep Reflection
Operational
Predictive
All Data Sources
Decision
Management
BI and Predictive
Analytics
Analytic
Applications
Intelligence
Analysis
Raw Data
Structured Data
Text Analytics
Data Mining
Entity Analytics
Machine
Learning
Video/Audio
Network/Sensor
Entity Analytics
Predictive
Stream
Processing
Data
Integration
Master Data
Streams
Analytic Applications
Real-time
Analytic Zone
Landing Area,
Analytics Zone
and Archive
Information
Ingestion and
Operational
Information
Big Data Reference Architecture
10. Hybrid Data Warehouse
Exploration,
Integrated
Warehouse, and
Mart Zones
Discovery
Deep Reflection
Operational
Predictive
All Data Sources
Decision
Management
BI and Predictive
Analytics
Analytic
Applications
Intelligence
Analysis
Raw Data
Structured Data
Text Analytics
Data Mining
Entity Analytics
Machine
Learning
Video/Audio
Network/Sensor
Entity Analytics
Predictive
Stream
Processing
Data
Integration
Master Data
Streams
Analytic Applications
Real-time
Analytic Zone
Landing Area,
Analytics Zone
and Archive
Information
Ingestion and
Operational
Information
11. Hybrid Data Warehouse – Business Scenarios
Hybrid Data
Warehouse2
Application and
Data Portability1
• Analytic Sandbox
Analytic Dev / Test
Exploratory analytic
Lower cost analytic
Unstructured data
• DW Off-load
Complete off-load
Low priority app.
Cold data
• Hadoop for ETL
Big or unstructured
data data set
Using Map Reduce
• Joining with Archive
Archived & hot
data
• Correlating data
across multiple
sources and types
Structured &
unstructured
• Access Data where
Optimal
Where it makes
most performance
sense
Exploration,
Integrated
Warehouse, and
Mart Zones
Discovery
Deep Reflection
Operational
Predictive
Raw Data
Structured Data
Text Analytics
Data Mining
Entity Analytics
Machine
Learning
Landing Area,
Analytics Zone
and Archive
12. SQL on Hadoop ?
• Hadoop / HDFS great for storing lots of data cheaply
• But …
Requires strong programming expertise
Steep learning curve
• Yet, many, if not most, uses cases about structured data
• Why not use SQL in places its strength shine ?
Familiar, widely used syntax
Separation of what you want vs. how to get it
Robust ecosystems of tools
14. Big SQL 3.0
• IBM BigInsights SQL-on-Hadoop solution
BigInsights is IBM's enterprise ready Hadoop
distribution
Big SQL is a standard component of BigInsights
• Integrates seamlessly with other components
• Big SQL applies SQL to your data in Hadoop
Performant
Rich SQL Compliance
15. What is Big SQL ?
• Architected for low latency and high
throughput
• MapReduce replaced with a modern MPP
architecture
Compiler and runtime are native code
Worker daemons live on cluster nodes
• Continuously running
• Processing happens locally at the data
Message passing allows data to flow
directly between nodes
• Operations occur in memory with the
ability to spill to disk
Supports aggregations and sorts larger
than available RAM
InfoSphere BigInsights
Big SQL
SQL MPP Runtime
Data Sources
Parquet CSV Seq RC
Avro ORC JSON Custom
SQL-based
Application
IBM Data Server Client
16. Architecture
Management Node
Big SQL
Master Node
Management Node
Big SQL
Scheduler
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Database
Service
Hive
Metastore
Hive
Server
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
DDL
FMP
UDF
FMP
*FMP = Fenced mode process
17. Open Architecture
• No proprietory storage format
• A “table” is simply a view on your
Hadoop data
• Table definitions shared with Hive
The Hive Metastore catalogs table
definitions
Reading/writing data logic is shared
with Hive
Definitions can be shared across the
Hadoop ecosystem
• Can still use MapReduce & other
Hadoop ecosystem tools
Sometimes SQL is not the answer
Hive
Hive
Metastore
Hadoop
Cluster
Pig
Hive APIs
Sqoop
Hive APIs
Big SQL
Hive APIs
18. Big SQL 3.0 - At a glance
Application Portability &
Integration
Data shared with Hadoop ecosystem
Comprehensive file format support
Superior enablement of IBM software
Enhanced by Third Party software
Performance
Modern MPP runtime
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
Results not constrained by memory
Federation
Distributed requests to multiple data
sources within a single SQL statement
Main data sources supported:
DB2 LUW, Teradata, Oracle, Netezza
Enterprise Features
Advanced security/auditing
Resource and workload management
Self tuning memory management
Comprehensive monitoring
Rich SQL
Comprehensive SQL Support
IBM SQL PL compatibility
Next in Jasmine A !!
IIH-4666: Big SQL Is
Here: Blistering Fast SQL
Access To Your Hadoop
data
19. Rich SQL Support
• Modern SQL capabilities
• All standard join operations
Standard and ANSI join syntax - inner, outer, and full outer joins
Equality, non-equality, cross join support
Multi-value join (WHERE (c1, c2) = …)
UNION, INTERSECT, EXCEPT
• Full support for subqueries
In SELECT, FROM, WHERE and HAVING
Correlated and uncorrelated
Equality, non-equality subqueries
EXISTS, NOT EXISTS, IN, ANY, SOME, etc.
• SQL Procedures, Flow of Control, etc ...
20. Drivers & Tooling Support
• Big SQL 3.0 adopts IBM's standard Data Server Client Drivers
Robust, standards compliant ODBC, JDBC, and .NET drivers
Same driver used for DB2 LUW, DB2/z and Informix
Expands support to numerous languages (Python, Ruby, Perl, etc.)
• Together with rich SQL support, provides application portability
Allows interaction with external Information Management tools
21. Federation at a Glance
• Federation enables Big SQL 3.0 to access a variety of existing
relational data stores and run queries across the two systems
Joins, mixed workloads
• Wide range of vendors & system supported
DB2 LUW, Oracle, Netezza, Terradata
More coming …
• Fully integrated into the Big SQL processing engine
Optimizer will pick most efficient plan based on remote data source
capabilities
23. Federation – A Closer Look
• Server: A remote data source.
• Wrapper: Library handling communication between engine &
remote database client.
• Nickname: A local name representing an object (Table, View,
etc…) in the remote data source.
• User mapping: An association between a local and a remote
authorization ID.
24. The Wrapper
• A library to access a particular type of data sources
Typically one wrapper for each external vendor and / or version
• One wrapper to be registered for each data source type, regardless
of the number of data sources of that type
• Implemented by means of a library of routines called wrapper
module
Performs remote query execution using remote system client APIs
• Register in federated database using CREATE WRAPPER
statement
25. The Server
• Defines the properties and options of a specific data source
Type and version of data source
Database name for the data source
Access parameters & other metadata specific to the data source
• Server is defined using CREATE SERVER statement
• A wrapper for this type of data source must have been previously
registered to the federated server
• Multiple servers can be defined for the same remote data source
instance
• e.g. Multiple servers may be defined for two different databases
from remote Oracle instance
26. Nicknames & Mappings
• Nickname
A nickname “maps” a remote table or view in Big SQL
Once declared, can be used transparently by the application
• Mappings
Possible to map other obects from remote data source locally
• User ID
• Data Types
• Functions
28. Putting it All Together
Big SQL 3.0 Management Node
Big SQL
Engine W
R
A
P
P
E
R
S
Nickname
Optimizer
Runtime
SQL
SQL
C
L
I
E
N
T
Data Source client
SQL
(Remote
Statement)
Client/
Application
Remote
Relational Data
Source
Big SQL
Worker
Node
Big SQL 3.0 Compute Node
Big SQL
Worker
Node
Big SQL 3.0 Compute Node
Big SQL
Worker
Node
Big SQL 3.0 Compute Node
29. Practical Example
What the application submits:
SELECT c.cust_nation, sum(o.totalprice) FROM customers c, orders o
WHERE o.orderstatus = 'OPEN' and c.custkey = o.custkey and c.mktsegment = 'BUILDING'
GROUP BY c.cust_nation
SELECT o.custkey
FROM orders o
WHERE o.orderstatus = 'OPEN'
SELECT c.custkey, c_cust_nation
FROM customers.c
WHERE c.mktsegment = 'BUILDING'
What Big SQL Federation does:
Join rows from both source
Sort them by cust_nation and sum up total order price for each nation
Return result to application
Orders Customers
30. Optimization & Push down
• Federated execution plan
chosen by cost based
optimizer
• Different plans depending
on how much work is
executed locally vs. how
much is executed on
remote data source
“push-down”
31. String Comparison
• Things to consider when comparing string types
Collation sequence - A > B ?
Blank sensitivity - “ABC” = “ABC ” ?
Empty string as NULL - “” = NULL ?
• Big SQL uses BINARY collation
Byte for byte comparison of the data – Hive behaviour
Blank sensitive, empty string are not NULL
• More String operation processing can be pushed on remote
server if the collations are compatible
Use the COLLATING_SEQUENCE parameter to indicate
compatibility with the BINARY collation
32. Remote Data Source String Comparison Behaviour
• Big SQL must be made aware of the remote data source string
comparison behaviour
OPTIONS parameter when declaring server
Data
Source
Collation
Blank
Sensitive
Empty
Strings
are NULL
SERVER OPTIONS
parameter (recommended)
DB2 LUW Identity N N
COLLATING_SEQUENCE=N
PUSHDOWN=Y
Oracle Binary Y Y
COLLATING_SEQUENCE=N
PUSHDOWN=Y
Teradata ASCII N N
COLLATING_SEQUENCE=N
PUSHDOWN=Y
Netezza Binary Y N
COLLATING_SEQUENCE=Y
PUSHDOWN=Y
33. Federation - Installation & Support
• Available out of the box
Setup by BigInsights installation
Wrappers available for following RDBMs: DB2, Teradata, Oracle,
Netezza
• Version supported
RDBM Version Wrapper Library
DB2 LUW 9.7, 9.8, 10.1, 10.5 libdb2drda.so
Teradata 12, 13 libdb2teradata.so
Oracle 11g, 11gR1, 11gR2 libdb2net8.so
Netezza 4.6, 5.0, 6.0, 7.1, 7.2 libdb2rcodbc.so
34. Federation – Set up Overview
• Set-up the environment
• Set-up the data source client
Environment variable and/or entries in $HOME/sqllib/cfg/db2dj.ini
If db2dj.ini modified, reload the db2profile
• Create the wrapper
• Define the server object on Big SQL
• Create the nicknames
• Create the user & data type mappings
35. Federation – Example Detailed Set-up for Teradata
• Set-up the data source client
Add the Terradata client location to your environment
export TERADATA_LIB_DIR=/opt/teradata/client/lib64
Run the djxlinkTeradata tool
su root
<HOME>/sqllib/bin/djxlinkTeradata
Set the TERADATA_CHARSET variable in the db2dj.ini file
TERADATA_CHARSET=ASCII
• Create the wrapper
CREATE WRAPPER TERA LIBRARY 'libdb2teradata.so'
36. Federation – Example Detailed Set-up for Teradata
• Define the server object on Big SQL
CREATE SERVER TERASERV TYPE TERADATA
VERSION 13 WRAPPER TERA
AUTHORIZATION 'terauser' PASSWORD 'terapwd'
OPTIONS (NODE 'teranode.ibm.com', PUSHDOWN 'Y',
COLLATING_SEQUENCE 'N')
• Create the user & data type mappings
CREATE USER MAPPING FOR dbuser SERVER TERASERV
OPTIONS (REMOTE_AUTHID 'terauser', REMOTE_PASSWORD
'terapwd')
• Create the Nickname
CREATE NICKNAME TD_CUSTOMERS
FOR TERARVER.SALES.CUSTOMERS
37. Federation – Example Detailed Set-up for Teradata
• Run a query involving the remote table through the nickname
SELECT * FROM TD_CUSTOMERS
38. Conclusion
• Big Data offers new opportunities & potential
for competitive advantages
• Hadoop is the solution, but must be integrated
with existing data management infrastructure
• Big SQL 3.0 offers a performant, rich SQL on
Hadoop solution, with a powerful federation
capabilities, on which to build your hybrid data
warehouse
39. Reference
• IBM InfoSphere BigInsights 3.0 Knowledge Center
• IBM Big SQL 3.0
• Set up and use federation in InfoSphere BigInsights Big SQL V3.0
• Try it for yourself at the drop-in lab
LCI-6260: Federating Big SQL 3.0 with a Relational Data Store
• Coming next in this room !
IIH-4666: Big SQL Is Here: Blistering Fast SQL Access To Your
Hadoop data
40. • Get it
BigInsights Quick Start Edition VM
• ibm.co/quickstart
Analytics for Hadoop Service
• bluemix.net
Big SQL Tech Preview
• bigsql.imdemocloud.com
• Learn it
Follow online tutorials
• ibm.biz/tutorial
Enroll in online classes
• BigDataUniversity.com
• HadoopDev
Links all available
Watch video demos, read articles, etc.
https://developer.ibm.com/hadoop
Want to Learn More ?
41. We Value Your Feedback!
• Don’t forget to submit your Insight session and speaker feedback!
Your feedback is very important to us – we use it to continually
improve the conference.
• Access the Insight Conference Connect tool to quickly submit your
surveys from your smartphone, laptop or conference kiosk.
41