SlideShare a Scribd company logo
1 of 22
Download to read offline
© 2015 IBM Corporation
Big SQL: A Technical Introduction
IBM and WANdisco (partner) Meetup in Jersey City
“Continuous Data Management for Open Source Storage”
Dec 9th, 2015
Virender Thakur - Big Data Sales Specialist – NY/NJ, IBM
Brett Rudenstein - Director of Product Management, Big Data, Wandisco
2 © 2015 IBM Corporation
Text Analytics
POSIX Distributed
Filesystem
Multi-workload, multi-tenant
scheduling
IBM BigInsights
Enterprise Management
Machine Learning on
Big R
Big R (R support)
IBM Open Platform with Apache Hadoop*
(HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig,
Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider)
IBM BigInsights
Data Scientist
IBM BigInsights
Analyst
Big SQL
BigSheets
Industry standard SQL
(Big SQL)
Spreadsheet-style
tool (BigSheets)
*IBM Open Platform with Apache Hadoop is a 100% open source Apache Hadoop distribution.
IBM will include the Open Data Platform common kernel once available.
Overview of BigInsights
Free Quick Start (non production):
• IBM Open Platform
• BigInsights Analyst, Data Scientist
features
• Community support
. . .
3 © 2015 IBM Corporation
What is Big SQL?
SQL-based
Application
Big SQL Engine
Data Storage
IBM data server
client
SQL MPP Run-time
DFS
3
 Comprehensive, standard SQL
– SELECT: joins, unions, aggregates, subqueries . . .
– GRANT/REVOKE, INSERT … INTO
– PL/SQL
– Stored procs, user-defined functions
– IBM data server JDBC and ODBC drivers
 Optimization and performance
– IBM MPP engine (C++) replaces Java MapReduce layer
– Continuous running daemons (no start up latency)
– Message passing allow data to flow between nodes
without persisting intermediate results
– In-memory operations with ability to spill to disk (useful
for aggregations, sorts that exceed available RAM)
– Cost-based query optimization with 140+ rewrite rules
 Various storage formats supported
– Text (delimited), Sequence, RCFile, ORC, Avro, Parquet
– Data persisted in DFS, Hive, HBase
– No IBM proprietary format required
 Integration with RDBMSs via LOAD, query
federation
BigInsights
4 © 2015 IBM Corporation
Hive is Really 3 Things…
Storage Format, Metastore, and Execution Engine
4
SQL Execution Engine
Hive
(Open Source)
Hive Storage Model
(open source)
CSV Parquet RC Others…Tab Delim.
Hive Metastore
(open source)MapReduce
Applications
5 © 2015 IBM Corporation
Big SQL preserves open source foundation
Leverages Hive metastore and storage formats.
No Lock-in. Data part of Hadoop, not BigSQL. Fall back to Open Source Hive Engine at any time.
5
SQL Execution Engines
IBM BigSQL
(IBM)
Hive
(Open Source)
Hive Storage Model
(open source)
CSV Parquet RC Others…Tab Delim.
Hive Metastore
(open source)
Applications
6 © 2015 IBM Corporation
Big SQL enterprise readiness capabilities simplify
management and lower impact of failures
Installation & YARN Integration High Availability & Disaster Recovery
Data Management
SPARK
Hive
Pig
BigSQL
Data Access
HDFS
MapReduce
YARN
 Simplified install with Ambari & YARN
 Hadoop 2.0 introduced YARN for comprehensive
resource management
– Dynamically manages cluster resources
– Works across heterogeneous cluster services
– Allocates CPU and memory resources based
upon a given application’s current needs
 Big SQL integrates with YARN via the Slider project
– YARN chooses suitable hosts for Big SQL worker nodes
– Big SQL resources are accounted for by YARN
– Size of the Big SQL cluster may dynamically grow or
shrink as needed
Big SQL Head node & Catalog HA
Big SQL head node maintain state
• Applications are connected to head node
• Catalog is hosted on head node
• Note: Worker nodes do not need state
Logs are shipped between primary & secondary head node
to maintain state
Queries are transferred seamlessly to another nodes if a
worker node fails
Manage Big
SQL just like
any other
YARN enabled
process!
User &
application
continues
without impact
if failure occurs!
7 © 2015 IBM Corporation
Application Portability & Integration
Data shared with Hadoop ecosystem
Comprehensive file format support
Superior enablement of IBM and Third Party
software
Performance
Modern MPP runtime
Powerful SQL query rewriter
Cost based optimizer
Optimized for concurrent user throughput
Results not constrained by memory
Federation
Distributed requests to multiple data sources
within a single SQL statement
Main data sources supported:
DB2 LUW, Teradata, Oracle, Netezza,
Informix, SQL Server
Enterprise Features
Advanced security/auditing
Resource and workload management
Self tuning memory management
Comprehensive monitoring
Rich SQL
Comprehensive SQL Support
IBM SQL PL compatibility
Extensive Analytic Functions
Distinguishing characteristics
8 © 2015 IBM Corporation
IBM Big SQL – Runs 100% of the queries
Key points
 With Impala and Hive, many
queries needed to be re-written,
some significantly
 Owing to various restrictions,
some queries could not be re-
written or failed at run-time
 Re-writing queries in a
benchmark scenario where
results are known is one thing –
doing this against real databases
in production is another
Other environments require significant effort at scale
Results for 10TB scale shown here
9 © 2015 IBM Corporation
A word about . . . query federation
 Data rarely lives in isolation
 Big SQL transparently queries heterogeneous systems
– Join Hadoop to RDBMSs
– Query optimizer understands capabilities of external system
• Including available statistics
– As much work as possible is pushed to each system to process
Head Node
Big SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
Compute Node
Task
Tracker
Data
Node
Big
SQL
10 © 2015 IBM Corporation
Invocation options
 Command-line interface:
Java SQL Shell (JSqsh)
 Web tooling (Data Server
Manager)
 Tools that support IBM
JDBC/ODBC driver
11 © 2015 IBM Corporation
Creating a Big SQL table
Standard CREATE TABLE DDL with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;
Create view
create view my_users as
select fname, lname from biadmin.users where id > 100;
12 © 2015 IBM Corporation
Populating Tables via LOAD
 Typically best runtime performance
 Load data from local or remote file system
load hadoop using file url
'sftp://myID:myPassword@myServer.ibm.com:22/install-
dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES
('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
 Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);
13 © 2015 IBM Corporation
Populating Tables via INSERT
 INSERT INTO . . . SELECT FROM . . .
– Parallel read and write operations
insert into big_sales_parquet
SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en
FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
and sales.quantity > 5500;
 CREATE . . . TABLE . . . AS SELECT . . .
CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90)
)
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, extern.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;
14 © 2015 IBM Corporation
SQL capability highlights
 Query operations
– Projections, restrictions
– UNION, INTERSECT, EXCEPT
– Wide range of built-in functions (e.g. OLAP)
 Full support for subqueries
– In SELECT, FROM, WHERE and
HAVING clauses
– Correlated and uncorrelated
– Equality, non-equality subqueries
– EXISTS, NOT EXISTS, IN, ANY,
SOME, etc.
 All standard join operations
– Standard and ANSI join syntax
– Inner, outer, and full outer joins
– Equality, non-equality, cross join support
– Multi-value join
 Stored procedures, UDFs
– DB2 compatible PL/SQL support
– Cursors, flow of control (if/then/else, error handling, …), etc.
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;
15 © 2015 IBM Corporation
A word about . . . column masking
2) Create permissions *
CREATE MASK SALARY_MASK ON SAL_TBL FOR
COLUMN SALARY RETURN
CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1
THEN SALARY
ELSE 0.00
END
ENABLE
3) Enable access control *
ALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROL
4b) Select as a MANAGER
CONNECT TO TESTDB USER socrates
SELECT "*" FROM SAL_TBL
EMP_NO FIRST_NAME SALARY
------- ------------ -----------
1 Steve 250000
2 Chris 200000
3 Paula 1000000
3 record(s) selected.
Data
SELECT "*" FROM SAL_TBL
EMP_NO FIRST_NAME SALARY
------- ------------ -----------
1 Steve 250000
2 Chris 200000
3 Paula 1000000
1) Create and grant access and roles *
CREATE ROLE MANAGER
CREATE ROLE EMPLOYEE
GRANT SELECT ON SAL_TBL TO USER socrates
GRANT SELECT ON SAL_TBL TO USER newton
GRANT ROLE MANAGER TO USER socrates
GRANT ROLE EMPLOYEE TO USER newton
4a) Select as an EMPLOYEE
CONNECT TO TESTDB USER newton
SELECT "*" FROM SAL_TBL
EMP_NO FIRST_NAME SALARY
------- ------------ -----------
1 Steve 0
2 Chris 0
3 Paula 0
3 record(s) selected.
* Note: Steps 1, 2, and 3 are done by a user with SECADM
authority.
16 © 2015 IBM Corporation
A word about . . . row-based access control
2) Create permissions *
CREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBL
FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE') = 1
AND
BRANCH_TBL.BRANCH_NAME = 'Branch_A')
ENFORCED FOR ALL ACCESS
ENABLE
3) Enable access control *
ALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROL
4) Select as Branch_A user
CONNECT TO TESTDB USER newton
SELECT "*" FROM BRANCH_TBL
EMP_NO FIRST_NAME BRANCH_NAME
----------- ------------ -----------
2 Chris Branch_A
3 Paula Branch_A
5 Pete Branch_A
8 Chrissie Branch_A
4 record(s) selected.
Data
SELECT "*" FROM BRANCH_TBL
EMP_NO FIRST_NAME BRANCH_NAME
------- ------------ -----------
1 Steve Branch_B
2 Chris Branch_A
3 Paula Branch_A
4 Craig Branch_B
5 Pete Branch_A
6 Stephanie Branch_B
7 Julie Branch_B
8 Chrissie Branch_A
1) Create and grant access and roles *
CREATE ROLE BRANCH_A_ROLE
GRANT ROLE BRANCH_A_ROLE TO USER newton
GRANT SELECT ON BRANCH_TBL TO USER newton
* Note: Steps 1, 2, and 3 are done by a user with SECADM
authority.
17 © 2015 IBM Corporation
A word about . . . SerDes
 Custom serializers / deserializers (SerDes)
– Read / write complex or “unusual” data formats (e.g., JSON)
– Commonly used by Hadoop community
– Developed by user or available publicly
 Users add SerDes to appropriate directory and reference SerDe when
creating table
 Example
-- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe
-- Location clause points to DFS dir containing JSON data
-- External clause means DFS dir & data won’t be drop after DROP TABLE command
create external hadoop table socialmedia-json (Country varchar(20), FeedInfo
varchar(300), . . . )
row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
location '</hdfs_path>/myJSON';
select * from socialmedia-json;
18 © 2015 IBM Corporation
Get started with Big SQL: External resources
 Hadoop Dev: links to videos, white paper, lab, . . . .
https://developer.ibm.com/hadoop/
19 © 2015 IBM Corporation
Big SQL vs Spark SQL: Right Tool for the Right Job
Machine Learning
Simpler SQL
Good Performance
Ideal tool for BI Data Analysts and
production workloads
Ideal tool for Data Scientists and
discovery
Big SQL Spark SQL
Migrating existing
workloads to Hadoop
Security
Many Concurrent Users
Best Performance
Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
20 © 2015 IBM Corporation
Performance Test – Hadoop-DS
20 (Physical) Cluster
 TPC-DS stands for Transaction Processing Council – Decision
Support (workload) which is an industry standard benchmark for
SQL. Hadoop-DS is IBM derivative to work on Hadoop
Spark 1.5.0
IBM Open Platform V4.1
20 Nodes
Big SQL V4.1
IBM Open Platform V4.1
20 Nodes
IBM Open Platform V4.1 shipped with Spark 1.4.1 but upgraded to 1.5.0
*Not an official TPC-DS Benchmark.
21 © 2015 IBM Corporation
Big SQL runs more SQL out-of-box
Big SQL 4.1 Spark SQL 1.5.0
1 hour 3-4 weeksPorting Effort:
Big SQL is the
only engine that
can execute all 99
queries with
minimal porting
effort
22 © 2015 IBM Corporation
Get started with Big SQL: External resources
 Hadoop Dev: links to videos, white paper, lab, . . . .
https://developer.ibm.com/hadoop/

More Related Content

What's hot

The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on HadoopDataWorks Summit
 
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMats Kindahl
 
A-Team Mobile Persistence Accelerator Overview
A-Team Mobile Persistence Accelerator OverviewA-Team Mobile Persistence Accelerator Overview
A-Team Mobile Persistence Accelerator OverviewSteven Davelaar
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS dataCynthia Saracco
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopWilfried Hoge
 
Avoiding.the.pitfallsof.oracle.migration.2013
Avoiding.the.pitfallsof.oracle.migration.2013Avoiding.the.pitfallsof.oracle.migration.2013
Avoiding.the.pitfallsof.oracle.migration.2013EDB
 
Spark and scala reference architecture
Spark and scala reference architectureSpark and scala reference architecture
Spark and scala reference architectureAdrian Tanase
 
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)avanttic Consultoría Tecnológica
 
EDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 WebinarEDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 WebinarEDB
 
IBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsIBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsHugo Blanco
 
IBM Power Systems Announcement Update
IBM Power Systems Announcement UpdateIBM Power Systems Announcement Update
IBM Power Systems Announcement UpdateDavid Spurway
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu
 
Running ADF Faces on Tablets and Mobile Phones
Running ADF Faces on Tablets and Mobile PhonesRunning ADF Faces on Tablets and Mobile Phones
Running ADF Faces on Tablets and Mobile PhonesSteven Davelaar
 
TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing Postgr...
TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing Postgr...TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing Postgr...
TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing Postgr...Trivadis
 

What's hot (20)

The Challenges of SQL on Hadoop
The Challenges of SQL on HadoopThe Challenges of SQL on Hadoop
The Challenges of SQL on Hadoop
 
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal ScalingMySQL Sharding: Tools and Best Practices for Horizontal Scaling
MySQL Sharding: Tools and Best Practices for Horizontal Scaling
 
Resume
ResumeResume
Resume
 
A-Team Mobile Persistence Accelerator Overview
A-Team Mobile Persistence Accelerator OverviewA-Team Mobile Persistence Accelerator Overview
A-Team Mobile Persistence Accelerator Overview
 
NetWeaver Gateway- Introduction to REST
NetWeaver Gateway- Introduction to RESTNetWeaver Gateway- Introduction to REST
NetWeaver Gateway- Introduction to REST
 
Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS data
 
Machine Learning in SQL Server 2019
Machine Learning in SQL Server 2019Machine Learning in SQL Server 2019
Machine Learning in SQL Server 2019
 
NetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service ConsumptionNetWeaver Gateway- Gateway Service Consumption
NetWeaver Gateway- Gateway Service Consumption
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
NetWeaver Gateway- Introduction to OData
NetWeaver Gateway- Introduction to ODataNetWeaver Gateway- Introduction to OData
NetWeaver Gateway- Introduction to OData
 
Avoiding.the.pitfallsof.oracle.migration.2013
Avoiding.the.pitfallsof.oracle.migration.2013Avoiding.the.pitfallsof.oracle.migration.2013
Avoiding.the.pitfallsof.oracle.migration.2013
 
Spark and scala reference architecture
Spark and scala reference architectureSpark and scala reference architecture
Spark and scala reference architecture
 
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
Meetup Oracle Database MAD_BCN: 1.2 Oracle Database 18c (autonomous database)
 
EDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 WebinarEDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 Webinar
 
IBM Power leading Cognitive Systems
IBM Power leading Cognitive SystemsIBM Power leading Cognitive Systems
IBM Power leading Cognitive Systems
 
IBM Power Systems Announcement Update
IBM Power Systems Announcement UpdateIBM Power Systems Announcement Update
IBM Power Systems Announcement Update
 
SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption SAP NetWeaver Gateway - Gateway Service Consumption
SAP NetWeaver Gateway - Gateway Service Consumption
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
Running ADF Faces on Tablets and Mobile Phones
Running ADF Faces on Tablets and Mobile PhonesRunning ADF Faces on Tablets and Mobile Phones
Running ADF Faces on Tablets and Mobile Phones
 
TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing Postgr...
TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing Postgr...TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing Postgr...
TechEvent 2019: Status of the partnership Trivadis and EDB - Comparing Postgr...
 

Similar to Big SQL NYC Event December by Virender

Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014Wilfried Hoge
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceWilfried Hoge
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Informix warehouse and accelerator overview
Informix warehouse and accelerator overviewInformix warehouse and accelerator overview
Informix warehouse and accelerator overviewKeshav Murthy
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceDataWorks Summit
 
Breaching the 100TB Mark with SQL Over Hadoop
Breaching the 100TB Mark with SQL Over HadoopBreaching the 100TB Mark with SQL Over Hadoop
Breaching the 100TB Mark with SQL Over HadoopDataWorks Summit
 
Consolidate your SAP System landscape Teched && d-code 2014
Consolidate your SAP System landscape Teched && d-code 2014Consolidate your SAP System landscape Teched && d-code 2014
Consolidate your SAP System landscape Teched && d-code 2014Goetz Lessmann
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit MumbaiAnand Haridass
 
Prateek tulsiyan app_integration_tcs
Prateek tulsiyan app_integration_tcsPrateek tulsiyan app_integration_tcs
Prateek tulsiyan app_integration_tcsPrateek Tulsiyan
 
OFF SHORE RECRUITER TRAINING
OFF SHORE RECRUITER TRAININGOFF SHORE RECRUITER TRAINING
OFF SHORE RECRUITER TRAININGsatish_kumar646
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseRizaldy Ignacio
 
IMS capabilities today
IMS capabilities todayIMS capabilities today
IMS capabilities todayKyle Charlet
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics systemModusOptimum
 
The Central View of your Data with Postgres
The Central View of your Data with PostgresThe Central View of your Data with Postgres
The Central View of your Data with PostgresEDB
 
Hadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionHadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionTony Baer
 
Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionDataWorks Summit
 
Informatica
InformaticaInformatica
Informaticamukharji
 

Similar to Big SQL NYC Event December by Virender (20)

Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
2014.07.11 biginsights data2014
2014.07.11 biginsights data20142014.07.11 biginsights data2014
2014.07.11 biginsights data2014
 
InfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experienceInfoSphere BigInsights - Analytics power for Hadoop - field experience
InfoSphere BigInsights - Analytics power for Hadoop - field experience
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Informix warehouse and accelerator overview
Informix warehouse and accelerator overviewInformix warehouse and accelerator overview
Informix warehouse and accelerator overview
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
 
Breaching the 100TB Mark with SQL Over Hadoop
Breaching the 100TB Mark with SQL Over HadoopBreaching the 100TB Mark with SQL Over Hadoop
Breaching the 100TB Mark with SQL Over Hadoop
 
Consolidate your SAP System landscape Teched && d-code 2014
Consolidate your SAP System landscape Teched && d-code 2014Consolidate your SAP System landscape Teched && d-code 2014
Consolidate your SAP System landscape Teched && d-code 2014
 
2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai2016 August POWER Up Your Insights - IBM System Summit Mumbai
2016 August POWER Up Your Insights - IBM System Summit Mumbai
 
Prateek tulsiyan app_integration_tcs
Prateek tulsiyan app_integration_tcsPrateek tulsiyan app_integration_tcs
Prateek tulsiyan app_integration_tcs
 
OFF SHORE RECRUITER TRAINING
OFF SHORE RECRUITER TRAININGOFF SHORE RECRUITER TRAINING
OFF SHORE RECRUITER TRAINING
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid WarehouseUsing the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
Using the Power of Big SQL 3.0 to Build a Big Data-Ready Hybrid Warehouse
 
IMS capabilities today
IMS capabilities todayIMS capabilities today
IMS capabilities today
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
 
The Central View of your Data with Postgres
The Central View of your Data with PostgresThe Central View of your Data with Postgres
The Central View of your Data with Postgres
 
Hadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or QuestionHadoop, SQL & NoSQL: No Longer an Either-or Question
Hadoop, SQL & NoSQL: No Longer an Either-or Question
 
Hadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or questionHadoop, SQL and NoSQL, No longer an either/or question
Hadoop, SQL and NoSQL, No longer an either/or question
 
Informatica
InformaticaInformatica
Informatica
 

Big SQL NYC Event December by Virender

  • 1. © 2015 IBM Corporation Big SQL: A Technical Introduction IBM and WANdisco (partner) Meetup in Jersey City “Continuous Data Management for Open Source Storage” Dec 9th, 2015 Virender Thakur - Big Data Sales Specialist – NY/NJ, IBM Brett Rudenstein - Director of Product Management, Big Data, Wandisco
  • 2. 2 © 2015 IBM Corporation Text Analytics POSIX Distributed Filesystem Multi-workload, multi-tenant scheduling IBM BigInsights Enterprise Management Machine Learning on Big R Big R (R support) IBM Open Platform with Apache Hadoop* (HDFS, YARN, MapReduce, Ambari, Hbase, Hive, Oozie, Parquet, Parquet Format, Pig, Snappy, Solr, Spark, Sqoop, Zookeeper, Open JDK, Knox, Slider) IBM BigInsights Data Scientist IBM BigInsights Analyst Big SQL BigSheets Industry standard SQL (Big SQL) Spreadsheet-style tool (BigSheets) *IBM Open Platform with Apache Hadoop is a 100% open source Apache Hadoop distribution. IBM will include the Open Data Platform common kernel once available. Overview of BigInsights Free Quick Start (non production): • IBM Open Platform • BigInsights Analyst, Data Scientist features • Community support . . .
  • 3. 3 © 2015 IBM Corporation What is Big SQL? SQL-based Application Big SQL Engine Data Storage IBM data server client SQL MPP Run-time DFS 3  Comprehensive, standard SQL – SELECT: joins, unions, aggregates, subqueries . . . – GRANT/REVOKE, INSERT … INTO – PL/SQL – Stored procs, user-defined functions – IBM data server JDBC and ODBC drivers  Optimization and performance – IBM MPP engine (C++) replaces Java MapReduce layer – Continuous running daemons (no start up latency) – Message passing allow data to flow between nodes without persisting intermediate results – In-memory operations with ability to spill to disk (useful for aggregations, sorts that exceed available RAM) – Cost-based query optimization with 140+ rewrite rules  Various storage formats supported – Text (delimited), Sequence, RCFile, ORC, Avro, Parquet – Data persisted in DFS, Hive, HBase – No IBM proprietary format required  Integration with RDBMSs via LOAD, query federation BigInsights
  • 4. 4 © 2015 IBM Corporation Hive is Really 3 Things… Storage Format, Metastore, and Execution Engine 4 SQL Execution Engine Hive (Open Source) Hive Storage Model (open source) CSV Parquet RC Others…Tab Delim. Hive Metastore (open source)MapReduce Applications
  • 5. 5 © 2015 IBM Corporation Big SQL preserves open source foundation Leverages Hive metastore and storage formats. No Lock-in. Data part of Hadoop, not BigSQL. Fall back to Open Source Hive Engine at any time. 5 SQL Execution Engines IBM BigSQL (IBM) Hive (Open Source) Hive Storage Model (open source) CSV Parquet RC Others…Tab Delim. Hive Metastore (open source) Applications
  • 6. 6 © 2015 IBM Corporation Big SQL enterprise readiness capabilities simplify management and lower impact of failures Installation & YARN Integration High Availability & Disaster Recovery Data Management SPARK Hive Pig BigSQL Data Access HDFS MapReduce YARN  Simplified install with Ambari & YARN  Hadoop 2.0 introduced YARN for comprehensive resource management – Dynamically manages cluster resources – Works across heterogeneous cluster services – Allocates CPU and memory resources based upon a given application’s current needs  Big SQL integrates with YARN via the Slider project – YARN chooses suitable hosts for Big SQL worker nodes – Big SQL resources are accounted for by YARN – Size of the Big SQL cluster may dynamically grow or shrink as needed Big SQL Head node & Catalog HA Big SQL head node maintain state • Applications are connected to head node • Catalog is hosted on head node • Note: Worker nodes do not need state Logs are shipped between primary & secondary head node to maintain state Queries are transferred seamlessly to another nodes if a worker node fails Manage Big SQL just like any other YARN enabled process! User & application continues without impact if failure occurs!
  • 7. 7 © 2015 IBM Corporation Application Portability & Integration Data shared with Hadoop ecosystem Comprehensive file format support Superior enablement of IBM and Third Party software Performance Modern MPP runtime Powerful SQL query rewriter Cost based optimizer Optimized for concurrent user throughput Results not constrained by memory Federation Distributed requests to multiple data sources within a single SQL statement Main data sources supported: DB2 LUW, Teradata, Oracle, Netezza, Informix, SQL Server Enterprise Features Advanced security/auditing Resource and workload management Self tuning memory management Comprehensive monitoring Rich SQL Comprehensive SQL Support IBM SQL PL compatibility Extensive Analytic Functions Distinguishing characteristics
  • 8. 8 © 2015 IBM Corporation IBM Big SQL – Runs 100% of the queries Key points  With Impala and Hive, many queries needed to be re-written, some significantly  Owing to various restrictions, some queries could not be re- written or failed at run-time  Re-writing queries in a benchmark scenario where results are known is one thing – doing this against real databases in production is another Other environments require significant effort at scale Results for 10TB scale shown here
  • 9. 9 © 2015 IBM Corporation A word about . . . query federation  Data rarely lives in isolation  Big SQL transparently queries heterogeneous systems – Join Hadoop to RDBMSs – Query optimizer understands capabilities of external system • Including available statistics – As much work as possible is pushed to each system to process Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  • 10. 10 © 2015 IBM Corporation Invocation options  Command-line interface: Java SQL Shell (JSqsh)  Web tooling (Data Server Manager)  Tools that support IBM JDBC/ODBC driver
  • 11. 11 © 2015 IBM Corporation Creating a Big SQL table Standard CREATE TABLE DDL with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null) row format delimited fields terminated by '|' stored as textfile; Create view create view my_users as select fname, lname from biadmin.users where id > 100;
  • 12. 12 © 2015 IBM Corporation Populating Tables via LOAD  Typically best runtime performance  Load data from local or remote file system load hadoop using file url 'sftp://myID:myPassword@myServer.ibm.com:22/install- dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;  Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection load hadoop using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb' with parameters (user='myID', password='myPassword') from table MEDIA columns (ID, NAME) where 'CONTACTDATE < ''2012-02-01''' into table media_db2table_jan overwrite with load properties ('num.map.tasks' = 10);
  • 13. 13 © 2015 IBM Corporation Populating Tables via INSERT  INSERT INTO . . . SELECT FROM . . . – Parallel read and write operations insert into big_sales_parquet SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key and sales.quantity > 5500;  CREATE . . . TABLE . . . AS SELECT . . . CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat ( product_key INT NOT NULL , product_line_code INT NOT NULL , product_type_key INT NOT NULL , product_type_code INT NOT NULL , product_line_en VARCHAR(90) , product_line_de VARCHAR(90) ) as select product_key, d.product_line_code, product_type_key, product_type_code, product_line_en, product_line_de from extern.sls_product_dim d, extern.sls_product_line_lookup l where d.product_line_code = l.product_line_code;
  • 14. 14 © 2015 IBM Corporation SQL capability highlights  Query operations – Projections, restrictions – UNION, INTERSECT, EXCEPT – Wide range of built-in functions (e.g. OLAP)  Full support for subqueries – In SELECT, FROM, WHERE and HAVING clauses – Correlated and uncorrelated – Equality, non-equality subqueries – EXISTS, NOT EXISTS, IN, ANY, SOME, etc.  All standard join operations – Standard and ANSI join syntax – Inner, outer, and full outer joins – Equality, non-equality, cross join support – Multi-value join  Stored procedures, UDFs – DB2 compatible PL/SQL support – Cursors, flow of control (if/then/else, error handling, …), etc. SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name;
  • 15. 15 © 2015 IBM Corporation A word about . . . column masking 2) Create permissions * CREATE MASK SALARY_MASK ON SAL_TBL FOR COLUMN SALARY RETURN CASE WHEN VERIFY_ROLE_FOR_USER(SESSION_USER,'MANAGER') = 1 THEN SALARY ELSE 0.00 END ENABLE 3) Enable access control * ALTER TABLE SAL_TBL ACTIVATE COLUMN ACCESS CONTROL 4b) Select as a MANAGER CONNECT TO TESTDB USER socrates SELECT "*" FROM SAL_TBL EMP_NO FIRST_NAME SALARY ------- ------------ ----------- 1 Steve 250000 2 Chris 200000 3 Paula 1000000 3 record(s) selected. Data SELECT "*" FROM SAL_TBL EMP_NO FIRST_NAME SALARY ------- ------------ ----------- 1 Steve 250000 2 Chris 200000 3 Paula 1000000 1) Create and grant access and roles * CREATE ROLE MANAGER CREATE ROLE EMPLOYEE GRANT SELECT ON SAL_TBL TO USER socrates GRANT SELECT ON SAL_TBL TO USER newton GRANT ROLE MANAGER TO USER socrates GRANT ROLE EMPLOYEE TO USER newton 4a) Select as an EMPLOYEE CONNECT TO TESTDB USER newton SELECT "*" FROM SAL_TBL EMP_NO FIRST_NAME SALARY ------- ------------ ----------- 1 Steve 0 2 Chris 0 3 Paula 0 3 record(s) selected. * Note: Steps 1, 2, and 3 are done by a user with SECADM authority.
  • 16. 16 © 2015 IBM Corporation A word about . . . row-based access control 2) Create permissions * CREATE PERMISSION BRANCH_A_ACCESS ON BRANCH_TBL FOR ROWS WHERE(VERIFY_ROLE_FOR_USER(SESSION_USER,'BRANCH_A_ROLE') = 1 AND BRANCH_TBL.BRANCH_NAME = 'Branch_A') ENFORCED FOR ALL ACCESS ENABLE 3) Enable access control * ALTER TABLE BRANCH_TBL ACTIVATE ROW ACCESS CONTROL 4) Select as Branch_A user CONNECT TO TESTDB USER newton SELECT "*" FROM BRANCH_TBL EMP_NO FIRST_NAME BRANCH_NAME ----------- ------------ ----------- 2 Chris Branch_A 3 Paula Branch_A 5 Pete Branch_A 8 Chrissie Branch_A 4 record(s) selected. Data SELECT "*" FROM BRANCH_TBL EMP_NO FIRST_NAME BRANCH_NAME ------- ------------ ----------- 1 Steve Branch_B 2 Chris Branch_A 3 Paula Branch_A 4 Craig Branch_B 5 Pete Branch_A 6 Stephanie Branch_B 7 Julie Branch_B 8 Chrissie Branch_A 1) Create and grant access and roles * CREATE ROLE BRANCH_A_ROLE GRANT ROLE BRANCH_A_ROLE TO USER newton GRANT SELECT ON BRANCH_TBL TO USER newton * Note: Steps 1, 2, and 3 are done by a user with SECADM authority.
  • 17. 17 © 2015 IBM Corporation A word about . . . SerDes  Custom serializers / deserializers (SerDes) – Read / write complex or “unusual” data formats (e.g., JSON) – Commonly used by Hadoop community – Developed by user or available publicly  Users add SerDes to appropriate directory and reference SerDe when creating table  Example -- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe -- Location clause points to DFS dir containing JSON data -- External clause means DFS dir & data won’t be drop after DROP TABLE command create external hadoop table socialmedia-json (Country varchar(20), FeedInfo varchar(300), . . . ) row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde' location '</hdfs_path>/myJSON'; select * from socialmedia-json;
  • 18. 18 © 2015 IBM Corporation Get started with Big SQL: External resources  Hadoop Dev: links to videos, white paper, lab, . . . . https://developer.ibm.com/hadoop/
  • 19. 19 © 2015 IBM Corporation Big SQL vs Spark SQL: Right Tool for the Right Job Machine Learning Simpler SQL Good Performance Ideal tool for BI Data Analysts and production workloads Ideal tool for Data Scientists and discovery Big SQL Spark SQL Migrating existing workloads to Hadoop Security Many Concurrent Users Best Performance Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
  • 20. 20 © 2015 IBM Corporation Performance Test – Hadoop-DS 20 (Physical) Cluster  TPC-DS stands for Transaction Processing Council – Decision Support (workload) which is an industry standard benchmark for SQL. Hadoop-DS is IBM derivative to work on Hadoop Spark 1.5.0 IBM Open Platform V4.1 20 Nodes Big SQL V4.1 IBM Open Platform V4.1 20 Nodes IBM Open Platform V4.1 shipped with Spark 1.4.1 but upgraded to 1.5.0 *Not an official TPC-DS Benchmark.
  • 21. 21 © 2015 IBM Corporation Big SQL runs more SQL out-of-box Big SQL 4.1 Spark SQL 1.5.0 1 hour 3-4 weeksPorting Effort: Big SQL is the only engine that can execute all 99 queries with minimal porting effort
  • 22. 22 © 2015 IBM Corporation Get started with Big SQL: External resources  Hadoop Dev: links to videos, white paper, lab, . . . . https://developer.ibm.com/hadoop/