Got Big Data? Then check out what Big SQL can do for you . . . . Learn how IBM's industry-standard SQL interface enables you to leverage your existing SQL skills to query, analyze, and manipulate data managed in an Apache Hadoop environment on cloud or on premise. This quick technical tour is filled with practical examples designed to get you started working with Big SQL in no time. Specifically, you'll learn how to create Big SQL tables over Hadoop data in HDFS, Hive, or HBase; populate Big SQL tables with data from HDFS, a remote file system, or a remote RDBMS; execute simple and complex Big SQL queries; work with non-traditional data formats and more. These charts are for session ALB-3663 at the IBM World of Watson 2016 conference.
SQL Database Design For Developers at php[tek] 2024
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 session ALB-3663)
1. Getting off to a fast start
with Big Data analytics
using Big SQL (ALB-3663)
C. M. Saracco
Oct. 25, 2016
2. Please note IBM’s statements regardingits plans, directions,and intent are subject to changeor withdrawal without notice and
at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general productdirection andit should
not be relied on in making a purchasing decision.
The information mentionedregardingpotential futureproducts is not a commitment, promise, or legal obligation to
deliver any material, code or functionality. Information about potential futureproducts may not be incorporatedinto
any contract.
The development, release, and timing of any future features or functionality described for our products remains at
our sole discretion.
Performance is based on measurements andprojections using standard IBM benchmarks in a controlled
environment. The actual throughput or performancethatany user will experience will vary depending uponmany
factors, including considerations suchas the amount of multiprogramming in the user’s job stream, the I/O
configuration, thestorageconfiguration,and theworkload processed. Therefore, no assurancecan be giventhat
an individual user will achieve results similar to those stated here.
10/19/16World ofWatson 20162
3. • Big SQL = industry-standard SQL for Hadoop platforms
• Easy on-ramp to Hadoop for SQL professionals
• Supports familiar SQL tools / apps (via JDBC and ODBC drivers)
• How to get started?
• Create tables / views. Store data in HDFS, HBase, or Hive warehouse
• Load data into tables (from remote RDBMS, files)
• Query data (project, restrict, join, union, wide range of sub-queries, wide range of built-
in functions, UDFs, . . . . )
• Explore advanced features
– Collect statistics and inspect data access plan
– Transparently join/union Hadoop & RDBMS data (query federation)
– Leverage Big Data technologies: SerDes, Spark, . . .
Overview
4. • Command line: Java SQL Shell (JSqsh)
• Web tooling (Data Server Manager)
• Tools supporting IBM JDBC/ODBC
driver
Invocation options
5. Creating a Big SQL table
Standard CREATE TABLE DDL with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;
Worth noting:
• “Hadoop” keyword creates table in DFS
• Row format delimited and textfile formats are default
• Constraints not enforced (but useful for query optimization)
6. Results from previous CREATE TABLE . . .
• Data stored in subdirectory of Hive warehouse
. . . /hive/warehouse/myid.db/users
• Default schema is user ID. Can create new schemas
• “Table” is just a subdirectory under schema.db
• Table’s data are files within table subdirectory
• Meta data collected (Big SQL & Hive)
• SYSCAT.* and SYSHADOOP.* views
• Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL
schema over existing DFS directory contents
• Useful if table contents already in DFS
• Avoids need to LOAD data
7. Populating Tables via LOAD
• Typically best runtime performance
• Load data from remote file system
load hadoop using file url
'sftp://myID:myPassword@myServer.ibm.com:22/install-
dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES
('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
• Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC
load hadoop using jdbc connection url
'jdbc:db2://some.host.com:portNum/sampledb' with parameters (user='myID',
password='myPassword') from table MEDIA columns (ID, NAME) where 'CONTACTDATE
< ''2012-02-01''' into table media_db2table_jan overwrite with load properties
('num.map.tasks' = 10);
8. Querying your Big SQL tables
• Same as ISO SQL-compliant RDBMS
• No special query syntax for Hadoop tables
• Projections, restrictions
• UNION, INTERSECT, EXCEPT
• Wide range of built-in functions (e.g. OLAP)
• Full support for subqueries
• All standard join operations
• . . .
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;
9. A word about . . . SerDes
• Custom serializers / deserializers (SerDes)
• Read / write complex or “unusual” data formats (e.g., JSON)
• Commonly used by Hadoop community. Developed by user or available publicly
• Add SerDes to directories; reference SerDe when creating table
-- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe
-- Location clause points to DFS dir containing JSON data
-- External clause means DFS dir & data won’t be drop after DROP TABLE command
create external hadoop table socialmedia-json (Country varchar(20), . . . )
row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
location '</hdfs_path>/myJSON';
select * from socialmedia-json;
10. Sample JSON input for previous example
JSON-based social media data to load into Big SQL Table socialmedia-json defined with SerDe
11. Sample Big SQL query output for JSON data
Sample output: Select * from socialmedia-json
12. Accessing Big SQL data from Spark shell
// establish a Hive context and query some Big SQL data
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val saleFacts = sqlContext.sql("select * from bigsql.sls_sales_fact")
// take some action on the data – count # of rows
saleFacts.count()
. . .
// transform the data as needed (create a Vector with data from 2 cols)
val subset = saleFacts.map {row => Vectors.dense(row.getDouble(16),row.getDouble(17))}
// invoke basic Spark MLlib statistical function over the data
val stats = Statistics.colStats(subset)
// print one of the statistics collected
println(stats.mean)
13. Big SQL query federation = virtualized data access
Transparent
§ Appears to be one source
§ Programmers don’t need to know how /
where data is stored
Heterogeneous
§ Accesses data from diverse sources
High Function
§ Full query support against all data
§ Capabilities of sources as well
Autonomous
§ Non-disruptive to data sources, existing
applications, systems.
High Performance
§ Optimization of distributed queries
SQL tools,
applications Data sources
Virtualized
data
15. Creating and using federated objects (example)
-- Create wrapper to identify clientlibrary (Oracle Net8)
CREATE WRAPPER ORA LIBRARY 'libdb2net8.so'
-- Create server for Oracle data source
CREATE SERVER ORASERV TYPE ORACLE VERSION 11 WRAPPER ORA AUTHORIZATION
”orauser” PASSWORD ”orauser” OPTIONS (NODE 'TNSNODENAME',PUSHDOWN 'Y', COLLATING_SEQUENCE 'N');
-- Map the local user 'orauser'to the Oracle user 'orauser'/ password 'orauser'
CREATE USER MAPPING FOR orauser SERVER ORASERV OPTIONS ( REMOTE_PASSWORD 'orauser');
-- Create nickname for Oracle table / view
CREATE NICKNAME NICK1 FOR ORASERV.ORAUSER.TABLE1;
-- Query the nickname
SELECT * FROM NICK1 WHERE COL1 < 10;
16. A word about . . . data access plans
• Cost-based optimizer with query rewrite
• Example: SELECT DISTINCT COL_PK,
COL_X . . . FROM TABLE
• Automatically rewrite query to avoid sort
– PK constraint implies no nulls or duplicates
– Transparent to programmer
• ANALYZE TABLE … collects statistics
• Automatic or manual collection
• Efficient runtime performance
• EXPLAIN reports detailed access plan
• Subset shown at right
17. Get started with Big SQL: External resources
Hadoop Dev: videos, tutorials, forum, . . . https://developer.ibm.com/hadoop/
http://www.slideshare.net/CynthiaSaracco/presentations
19. Notices and
disclaimers
continued
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other
publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of
performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be
addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-
party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED,
INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents,
copyrights, trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document
Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM
SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,
OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,
Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.
19 10/19/16World ofWatson 2016