Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse

Using the Power of Big SQL 3.0 to Build
a Big Data-Ready Hybrid Warehouse
IIH-5529
Olivier Bernin & Rizaldy Ignacio
October 30th
© 2014 IBM Corporation

Please Note
• IBM’s statements regarding its plans, directions, and intent are subject to change or
withdrawal without notice at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality. Information
about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described
for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks
in a controlled environment. The actual throughput or performance that any user will
experience will vary depending upon many factors, including considerations such as
the amount of multiprogramming in the user’s job stream, the I/O configuration, the
storage configuration, and the workload processed. Therefore, no assurance can be
given that an individual user will achieve results similar to those stated here.
2

Agenda
• Big Data & the Hybrid Data Warehouse
 A quick recap on BigData
 The Hybrid Data Warehouse
• Big SQL 3.0 – SQL on Hadoop without compromises
 What is Big SQL ?
 Features
• Setting up an Hybrid Data Warehouse with Big SQL
 Federation Overview
 Configuration
3

Big Data &
The Hybrid Warehouse

2009
800,000 petabytes
2020
35 zettabytes
as much Data and Content
Over Coming Decade
44x Business leaders frequently make
decisions based on information they
don’t trust, or don’t have1in3
83%
of CIOs cited “Business
intelligence and analytics” as
part of their visionary plans
to enhance competitiveness
Business leaders say they don’t
have access to the information they
need to do their jobs
1in2
of CEOs need to do a better job
capturing and understanding
information rapidly in order to
make swift business decisions
60%
… And Organizations
Need Deeper Insights
Of world’s data
is unstructured
80%
Information is at the Center
of a New Wave of Opportunity…

Hadoop
• Framework / ecosystem of component aimed at distributing work
across (very) big cluster for parallelizing …
• Components
 A distributed file system running on commodity hardware (HDFS)
 The MapReduce programming model & associated APIs
 Hive, HBase, Sqoop, etc ...
• Hadoop nodes store and process data
 Bringing the program to the data
 Easy scalability – just add more nodes

Bringing the Program to the Data
1
Cluster
32
MR
Job
MR
Job MR
JobApp
(Compute)
MR
Job
Result
10110100
10100100
11100111
11100101
00111010
01010010
11001001
01010011
00010100
10111010
11101011
11011011
01010110
10010101
1
2
3
Logical File
Splits
Map Reduce Job

Hadoop for Big Data ?
• Well suited for Big Data
 Distributed, parallel execution
 Cheap(er) per storage unit
 Flexible: no schema
 Highly available
• But …
 Data needs to be integrated to generate insights
 Prioritization of applications
 Capitalizing on investment
High priority / high
performance SQL
applications
Semi-structured
data applications
Low priority, low
performance SQL
applications
Experimental
applications

Hybrid Data Warehouse
Exploration,
Integrated
Warehouse, and
Mart Zones
 Discovery
 Deep Reflection
 Operational
 Predictive
All Data Sources
Decision
Management
BI and Predictive
Analytics
Analytic
Applications
Intelligence
Analysis
 Raw Data
 Structured Data
 Text Analytics
 Data Mining
 Entity Analytics
 Machine
Learning
 Video/Audio
 Network/Sensor
 Predictive
 Stream
Processing
 Data
Integration
 Master Data
Streams
Analytic Applications
Real-time
Analytic Zone
Landing Area,
Analytics Zone
and Archive
Information
Ingestion and
Operational
Information
Big Data Reference Architecture

Hybrid Data Warehouse
Exploration,
Integrated
Warehouse, and
Mart Zones
 Discovery
 Deep Reflection
 Operational
 Predictive
All Data Sources
Decision
Management
BI and Predictive
Analytics
Analytic
Applications
Intelligence
Analysis
 Raw Data
 Structured Data
 Text Analytics
 Data Mining
 Machine
Learning
 Video/Audio
 Network/Sensor
 Predictive
 Stream
Processing
 Data
Integration
 Master Data
Streams
Analytic Applications
Real-time
Analytic Zone
Landing Area,
Analytics Zone
and Archive
Information
Ingestion and
Operational
Information

Hybrid Data Warehouse – Business Scenarios
Hybrid Data
Warehouse2
Application and
Data Portability1
• Analytic Sandbox
 Analytic Dev / Test
 Exploratory analytic
 Lower cost analytic
 Unstructured data
• DW Off-load
 Complete off-load
 Low priority app.
 Cold data
• Hadoop for ETL
 Big or unstructured
data data set
 Using Map Reduce
• Joining with Archive
 Archived & hot
data
• Correlating data
across multiple
sources and types
 Structured &
unstructured
• Access Data where
Optimal
 Where it makes
most performance
sense
Exploration,
Integrated
Warehouse, and
Mart Zones
 Discovery
 Deep Reflection
 Operational
 Predictive
 Raw Data
 Structured Data
 Text Analytics
 Data Mining
 Machine
Learning
Landing Area,
Analytics Zone
and Archive

SQL on Hadoop ?
• Hadoop / HDFS great for storing lots of data cheaply
• But …
 Requires strong programming expertise
 Steep learning curve
• Yet, many, if not most, uses cases about structured data
• Why not use SQL in places its strength shine ?
 Familiar, widely used syntax
 Separation of what you want vs. how to get it
 Robust ecosystems of tools

Big SQL 3.0
• IBM BigInsights SQL-on-Hadoop solution
 BigInsights is IBM's enterprise ready Hadoop
distribution
 Big SQL is a standard component of BigInsights
• Integrates seamlessly with other components
• Big SQL applies SQL to your data in Hadoop
 Performant
 Rich SQL Compliance

What is Big SQL ?
• Architected for low latency and high
throughput
• MapReduce replaced with a modern MPP
architecture
 Compiler and runtime are native code
 Worker daemons live on cluster nodes
• Continuously running
• Processing happens locally at the data
 Message passing allows data to flow
directly between nodes
• Operations occur in memory with the
ability to spill to disk
 Supports aggregations and sorts larger
than available RAM
InfoSphere BigInsights
Big SQL
SQL MPP Runtime
Data Sources
Parquet CSV Seq RC
Avro ORC JSON Custom
SQL-based
Application
IBM Data Server Client

Architecture
Management Node
Big SQL
Master Node
Management Node
Big SQL
Scheduler
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Database
Service
Hive
Metastore
Hive
Server
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
Big SQL
Worker Node
Java
I/O
FMP
Native
I/O
FMP
HDFS
Data
Node
MRTask
Tracker
Other
ServiceHDFS
Data HDFS
Data HDFS
Data
Temp
Data
UDF
FMP
Compute Node
DDL
FMP
UDF
FMP
*FMP = Fenced mode process

Open Architecture
• No proprietory storage format
• A “table” is simply a view on your
Hadoop data
• Table definitions shared with Hive
 The Hive Metastore catalogs table
definitions
 Reading/writing data logic is shared
with Hive
 Definitions can be shared across the
Hadoop ecosystem
• Can still use MapReduce & other
Hadoop ecosystem tools
 Sometimes SQL is not the answer
Hive
Hive
Metastore
Hadoop
Cluster
Pig
Hive APIs
Sqoop
Hive APIs
Big SQL
Hive APIs

Big SQL 3.0 - At a glance
Application Portability &
Integration
Data shared with Hadoop ecosystem
Comprehensive file format support
Superior enablement of IBM software
Enhanced by Third Party software
Performance
Modern MPP runtime
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
Results not constrained by memory
Federation
Distributed requests to multiple data
sources within a single SQL statement
Main data sources supported:
DB2 LUW, Teradata, Oracle, Netezza
Enterprise Features
Advanced security/auditing
Resource and workload management
Self tuning memory management
Comprehensive monitoring
Rich SQL
Comprehensive SQL Support
IBM SQL PL compatibility
Next in Jasmine A !!
IIH-4666: Big SQL Is
Here: Blistering Fast SQL
Access To Your Hadoop
data

Rich SQL Support
• Modern SQL capabilities
• All standard join operations
 Standard and ANSI join syntax - inner, outer, and full outer joins
 Equality, non-equality, cross join support
 Multi-value join (WHERE (c1, c2) = …)
 UNION, INTERSECT, EXCEPT
• Full support for subqueries
 In SELECT, FROM, WHERE and HAVING
 Correlated and uncorrelated
 Equality, non-equality subqueries
 EXISTS, NOT EXISTS, IN, ANY, SOME, etc.
• SQL Procedures, Flow of Control, etc ...

Drivers & Tooling Support
• Big SQL 3.0 adopts IBM's standard Data Server Client Drivers
 Robust, standards compliant ODBC, JDBC, and .NET drivers
 Same driver used for DB2 LUW, DB2/z and Informix
 Expands support to numerous languages (Python, Ruby, Perl, etc.)
• Together with rich SQL support, provides application portability
 Allows interaction with external Information Management tools

Federation at a Glance
• Federation enables Big SQL 3.0 to access a variety of existing
relational data stores and run queries across the two systems
 Joins, mixed workloads
• Wide range of vendors & system supported
 DB2 LUW, Oracle, Netezza, Terradata
 More coming …
• Fully integrated into the Big SQL processing engine
 Optimizer will pick most efficient plan based on remote data source
capabilities

Using Federation to set up
an Hybrid Data Warehouse

Federation – A Closer Look
• Server: A remote data source.
• Wrapper: Library handling communication between engine &
remote database client.
• Nickname: A local name representing an object (Table, View,
etc…) in the remote data source.
• User mapping: An association between a local and a remote
authorization ID.

The Wrapper
• A library to access a particular type of data sources
 Typically one wrapper for each external vendor and / or version
• One wrapper to be registered for each data source type, regardless
of the number of data sources of that type
• Implemented by means of a library of routines called wrapper
module
 Performs remote query execution using remote system client APIs
• Register in federated database using CREATE WRAPPER
statement

The Server
• Defines the properties and options of a specific data source
 Type and version of data source
 Database name for the data source
 Access parameters & other metadata specific to the data source
• Server is defined using CREATE SERVER statement
• A wrapper for this type of data source must have been previously
registered to the federated server
• Multiple servers can be defined for the same remote data source
instance
• e.g. Multiple servers may be defined for two different databases
from remote Oracle instance

Nicknames & Mappings
• Nickname
 A nickname “maps” a remote table or view in Big SQL
 Once declared, can be used transparently by the application
• Mappings
 Possible to map other obects from remote data source locally
• User ID
• Data Types
• Functions

Federation
Big SQL 3.0
SQL
SQL
SQL
(Remote
Statement)
Client/
Application
Relational Data
Source
Big SQL
Engine W
R
A
P
P
E
R
S
Nickname
Optimizer
Runtime
C
L
I
E
N
T
Data Source client

Putting it All Together
Big SQL 3.0 Management Node
Big SQL
Engine W
R
A
P
P
E
R
S
Nickname
Optimizer
Runtime
SQL
SQL
C
L
I
E
N
T
Data Source client
SQL
(Remote
Statement)
Client/
Application
Remote
Relational Data
Source
Big SQL
Worker
Node
Big SQL 3.0 Compute Node
Big SQL
Worker
Node
Big SQL
Worker
Node

Practical Example
What the application submits:
SELECT c.cust_nation, sum(o.totalprice) FROM customers c, orders o
WHERE o.orderstatus = 'OPEN' and c.custkey = o.custkey and c.mktsegment = 'BUILDING'
GROUP BY c.cust_nation
SELECT o.custkey
FROM orders o
WHERE o.orderstatus = 'OPEN'
SELECT c.custkey, c_cust_nation
FROM customers.c
WHERE c.mktsegment = 'BUILDING'
What Big SQL Federation does:
Join rows from both source
Sort them by cust_nation and sum up total order price for each nation
Return result to application
Orders Customers

Optimization & Push down
• Federated execution plan
chosen by cost based
optimizer
• Different plans depending
on how much work is
executed locally vs. how
much is executed on
remote data source
 “push-down”

String Comparison
• Things to consider when comparing string types
 Collation sequence - A > B ?
 Blank sensitivity - “ABC” = “ABC ” ?
 Empty string as NULL - “” = NULL ?
• Big SQL uses BINARY collation
 Byte for byte comparison of the data – Hive behaviour
 Blank sensitive, empty string are not NULL
• More String operation processing can be pushed on remote
server if the collations are compatible
 Use the COLLATING_SEQUENCE parameter to indicate
compatibility with the BINARY collation

Remote Data Source String Comparison Behaviour
• Big SQL must be made aware of the remote data source string
comparison behaviour
 OPTIONS parameter when declaring server
Data
Source
Collation
Blank
Sensitive
Empty
Strings
are NULL
SERVER OPTIONS
parameter (recommended)
DB2 LUW Identity N N
COLLATING_SEQUENCE=N
PUSHDOWN=Y
Oracle Binary Y Y
PUSHDOWN=Y
Teradata ASCII N N
PUSHDOWN=Y
Netezza Binary Y N
COLLATING_SEQUENCE=Y
PUSHDOWN=Y

Federation - Installation & Support
• Available out of the box
 Setup by BigInsights installation
 Wrappers available for following RDBMs: DB2, Teradata, Oracle,
Netezza
• Version supported
RDBM Version Wrapper Library
DB2 LUW 9.7, 9.8, 10.1, 10.5 libdb2drda.so
Teradata 12, 13 libdb2teradata.so
Oracle 11g, 11gR1, 11gR2 libdb2net8.so
Netezza 4.6, 5.0, 6.0, 7.1, 7.2 libdb2rcodbc.so

Federation – Set up Overview
• Set-up the environment
• Set-up the data source client
 Environment variable and/or entries in $HOME/sqllib/cfg/db2dj.ini
 If db2dj.ini modified, reload the db2profile
• Create the wrapper
• Define the server object on Big SQL
• Create the nicknames
• Create the user & data type mappings

Federation – Example Detailed Set-up for Teradata
• Set-up the data source client
 Add the Terradata client location to your environment
export TERADATA_LIB_DIR=/opt/teradata/client/lib64
 Run the djxlinkTeradata tool
su root
<HOME>/sqllib/bin/djxlinkTeradata
 Set the TERADATA_CHARSET variable in the db2dj.ini file
TERADATA_CHARSET=ASCII
• Create the wrapper
CREATE WRAPPER TERA LIBRARY 'libdb2teradata.so'

• Define the server object on Big SQL
CREATE SERVER TERASERV TYPE TERADATA
VERSION 13 WRAPPER TERA
AUTHORIZATION 'terauser' PASSWORD 'terapwd'
OPTIONS (NODE 'teranode.ibm.com', PUSHDOWN 'Y',
COLLATING_SEQUENCE 'N')
• Create the user & data type mappings
CREATE USER MAPPING FOR dbuser SERVER TERASERV
OPTIONS (REMOTE_AUTHID 'terauser', REMOTE_PASSWORD
'terapwd')
• Create the Nickname
CREATE NICKNAME TD_CUSTOMERS
FOR TERARVER.SALES.CUSTOMERS

• Run a query involving the remote table through the nickname
SELECT * FROM TD_CUSTOMERS

Conclusion
• Big Data offers new opportunities & potential
for competitive advantages
• Hadoop is the solution, but must be integrated
with existing data management infrastructure
• Big SQL 3.0 offers a performant, rich SQL on
Hadoop solution, with a powerful federation
capabilities, on which to build your hybrid data
warehouse

Reference
• IBM InfoSphere BigInsights 3.0 Knowledge Center
• IBM Big SQL 3.0
• Set up and use federation in InfoSphere BigInsights Big SQL V3.0
• Try it for yourself at the drop-in lab
 LCI-6260: Federating Big SQL 3.0 with a Relational Data Store
• Coming next in this room !
 IIH-4666: Big SQL Is Here: Blistering Fast SQL Access To Your
Hadoop data

• Get it
 BigInsights Quick Start Edition VM
• ibm.co/quickstart
 Analytics for Hadoop Service
• bluemix.net
 Big SQL Tech Preview
• bigsql.imdemocloud.com
• Learn it
 Follow online tutorials
• ibm.biz/tutorial
 Enroll in online classes
• BigDataUniversity.com
• HadoopDev
 Links all available
 Watch video demos, read articles, etc.
 https://developer.ibm.com/hadoop
Want to Learn More ?

We Value Your Feedback!
• Don’t forget to submit your Insight session and speaker feedback!
Your feedback is very important to us – we use it to continually
improve the conference.
• Access the Insight Conference Connect tool to quickly submit your
surveys from your smartphone, laptop or conference kiosk.
41

Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in
which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for
informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.
While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without
warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this
presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or
representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use
of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have
achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to,
nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other
results.
© Copyright IBM Corporation 2014. All rights reserved.
— U.S. Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
IBM, the IBM logo, ibm.com, Information Management and InfoSphere BigInsights are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first
occurrence in this information with a trademark symbol (® or TM), these symbols indicate U.S. registered or common law trademarks owned by
IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current
list of IBM trademarks is available on the Web at
•“Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
•Other company, product, or service names may be trademarks or service marks of others.
42

Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse

Similar to Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse (20)

Recently uploaded

Recently uploaded (20)

Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse