SlideShare a Scribd company logo
1 of 43
Download to read offline
© 2016 IBM Corporation
Using your DB2 skills with Hadoop and Spark
Presented to TRIDEX DB2 Users Group, June 2017
C. M. Saracco, IBM Silicon Valley Lab
https://www.slideshare.net/CynthiaSaracco/presentations
© 2016 IBM Corporation2
Executive summary
§ About Apache Hadoop and Spark
− Popular open source technologies for working with Big Data
• Clustered computing > scalability
• Varied data > no pre-set structure or schema requirements
− Hadoop: distributed file system (storage), MapReduce API, . . .
− Spark: in-memory data processing (speed), built-in libraries, . . .
§ About Big SQL
− DB2-compatible query engine for Hadoop data (IBM or Hortonworks distributions)
− Based on decades of IBM R&D investment in RDBMS technology, including database
parallelism and query optimization. Strong runtime performance for analytical workloads.
§ Some ways to leverage DB2 SQL skills
− Create / manage / query “local” or distributed tables in Hadoop
− Query / join Hadoop data with DB2, Oracle, Teradata, etc. data via query federation
− Leverage Spark to query and manipulate Big SQL or DB2 data
− Leverage Big SQL to initiate Spark jobs and analyze result
© 2016 IBM Corporation3
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation4
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation5
Business leaders frequently make
decisions based on information they
don’ttrust, or don’t have1in3
83%
of CIOs cited “Business
intelligence and analytics” as part
of their visionary plans
to enhance competitiveness
Business leaders say they don’t
have access to the information they
need to do their jobs
1in2
of CEOs need to do a better job
capturing and understanding
information rapidly in order to
make swift business decisions
60%
… and organizations
need deeper insights
Information is at the center
of a new wave of opportunity…
4 million “likes” per
minute
300,000 tweets
per minute
150 million emails
per minute 2.78 million video
views per minute
2.5 TB per day
per A350 plane
> 1 PB per day
gas turbines
1 ZB = 1 billion TB
© 2016 IBM Corporation6
Big Data adoption (study results)
2012 to 2014 2015
22%-27% 25% 0%
change
2012 to 2014 2015
24%-26% 10% 250%
decrease
Educate:
Learning about
big data capabilities
2012 to 2014 2015
43%-47% 53% 125%
increase
Explore:
Exploring internal use cases and
developing a strategy
Engage:
Implementing infrastructure and
running pilot activities
2012 to 2014 2015
5%-6% 13% 210%
increase
Execute:
Using big data and analytics
pervasively across the enterprise
2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
© 2016 IBM Corporation7
Return on investment period for big data and analytics projects
as reported by respondents
Big Data ROI often < 18 months
2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
© 2016 IBM Corporation8
§ Both open source Apache projects
− Exploit distributed computing environments
− Enable processing of large volumes of varied data
§ Hadoop
− Inspired by Google technologies (MapReduce, GFS)
− Originally designed for batch-oriented, read-intensive applications
− “Core” consists of distributed file system, MapReduce, job scheduler, utilities
− Complementary projects span data warehousing, workflow management,
columnar data storage, activity monitoring, . . .
§ Spark
− Began as a UC Berkeley project
− Fast, general-purpose engine for working with Big Data in memory
− Popular built-in libraries for machine learning, streaming data, query (SQL), . . .
− No built-in storage. Interfaces to Hadoop, other stores
About Hadoop and Spark
© 2016 IBM Corporation9
IBM contributions: Hadoop and Spark
Snapshots taken Jan. 2017.
Latest content available online
via Apache dashboards.
IOP relates to Hadoop; STC
relates to Spark.
© 2016 IBM Corporation10
What is Big SQL?
SQL-based
Application
Big SQL Engine
Data Storage
IBM data server
client
SQL MPP Run-time
HDFS
§ Comprehensive, standard SQL for Hadoop
– SELECT: joins, unions, aggregates, subqueries . . .
– UPDATE/DELETE (HBase-managed tables)
– GRANT/REVOKE, INSERT … INTO
– SQL procedural logic (SQL PL)
– Stored procs, user-defined functions
– IBM data server JDBC and ODBC drivers
§ Optimization and performance
– IBM MPP engine (C++) replaces Java MapReduce layer
– Continuous running daemons (no start up latency)
– Message passing allow data to flow between nodes
without persisting intermediate results
– In-memory operations with ability to spill to disk (useful
for aggregations, sorts that exceed available RAM)
– Cost-based query optimization with 140+ rewrite rules
§ Various storage formats supported
– Text (delimited), Sequence, RCFile, ORC, Avro, Parquet
– Data persisted in DFS, Hive, HBase
– No IBM proprietary format required
§ Integration with RDBMSs via LOAD, query
federation
IBM Open Platform or
Hortonworks Data Platform
© 2016 IBM Corporation11
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation12
§ Big SQL
− Easy on-ramp to Hadoop for DB2 SQL professionals
− Create query-ready data lake
− Offload “cold” RDBMS warehouse data to Hadoop
− . . . .
§ Some ways to use Big SQL . . .
− Create tables
− Load / insert data
− Execute complex queries
− Exploit various DB2 features: UDFs, EXPLAIN, workload management, Oracle /
Netezza SQL compatibility. . . .
− Exploit various Hadoop features: Hive, HBase, SerDes, . . .
About Hadoop and Big SQL
© 2016 IBM Corporation13
Invocation options
§ Command-line interface:
Java SQL Shell (JSqsh)
§ Web tooling (Data Server
Manager)
§ Tools that support IBM
JDBC/ODBC driver
© 2016 IBM Corporation14
Creating a Big SQL table
§ Standard CREATE TABLE DDL with extensions
create hadoop table users
(
id int not null primary key,
office_id int null,
fname varchar(30) not null,
lname varchar(30) not null)
row format delimited
fields terminated by '|'
stored as textfile;
Worth noting:
• “Hadoop” keyword creates table in HDFS
• Row format delimited and textfile formats are default
• Constraints not enforced (but useful for query optimization)
• Examples in these charts focus on HDFS storage, both within or external to Hive
warehouse. HBase examples provided separately
© 2016 IBM Corporation15
CREATE VIEW
§ Standard SQL syntax
create view my_users as
select fname, lname from biadmin.users where id > 100;
© 2016 IBM Corporation16
Populating tables via LOAD
§ Typically best runtime performance
§ Load data from local or remote file system
load hadoop using file url
'sftp://myID:myPassword@myServer.ibm.com:22/install-
dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES
('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;
§ Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL,
Informix) via JDBC connection
load hadoop
using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb'
with parameters (user='myID', password='myPassword')
from table MEDIA columns (ID, NAME)
where 'CONTACTDATE < ''2012-02-01'''
into table media_db2table_jan overwrite
with load properties ('num.map.tasks' = 10);
© 2016 IBM Corporation17
Populating tables via INSERT
§ INSERT INTO . . . SELECT FROM . . .
− Parallel read and write operations
CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet
( product_key INT NOT NULL, product_name VARCHAR(150),
Quantity INT, order_method_en VARCHAR(90) )
STORED AS parquetfile;
-- source tables do not need to be in Parquet format
insert into big_sales_parquet
SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en
FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
and sales.quantity > 5500;
§ INSERT INTO . . . VALUES(. . . )
− Not parallelized. 1 file per INSERT. Not recommended except for quick tests
CREATE HADOOP TABLE foo col1 int, col2 varchar(10));
INSERT INTO foo VALUES (1, ‘hello’);
© 2016 IBM Corporation18
CREATE . . . TABLE . . . AS SELECT . . .
§ Create a Big SQL table based on contents of other table(s)
§ Source tables can be in different file formats or use different
underlying storage mechanisms
-- source tables in this example are external (just DFS files)
CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_line_en VARCHAR(90)
, product_line_de VARCHAR(90)
)
as select product_key, d.product_line_code, product_type_key,
product_type_code, product_line_en, product_line_de
from extern.sls_product_dim d, extern.sls_product_line_lookup l
where d.product_line_code = l.product_line_code;
© 2016 IBM Corporation19
SQL capability highlights
§ Query operations
− Projections, restrictions
− UNION, INTERSECT, EXCEPT
− Wide range of built-in functions (e.g. OLAP)
− Various Oracle, Netezza compatibility items
§ Full support for subqueries
− In SELECT, FROM, WHERE and
HAVING clauses
− Correlated and uncorrelated
− Equality, non-equality subqueries
− EXISTS, NOT EXISTS, IN, ANY,
SOME, etc.
§ All standard join operations
− Standard and ANSI join syntax
− Inner, outer, and full outer joins
− Equality, non-equality, cross join support
− Multi-value join
§ Stored procedures, user-defined
functions, user-defined aggregates
SELECT
s_name,
count(*) AS numwait
FROM
supplier,
lineitem l1,
orders,
nation
WHERE
s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT
*
FROM
lineitem l2
WHERE
l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey
)
AND NOT EXISTS (
SELECT
*
FROM
lineitem l3
WHERE
l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate >
l3.l_commitdate
)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name;
© 2016 IBM Corporation20
Power of standard SQL
§ Big SQL executes all 22 TPC-H queries without modification
§ Big SQL executes all 99 TPC-DS queries without modification
§ Big SQL leverages DB2 query rewrite technology for efficient optimization
SELECT s_name, count(*) AS numwait
FROM supplier, lineitem l1, orders, nation
WHERE s_suppkey = l1.l_suppkey
AND o_orderkey = l1.l_orderkey
AND o_orderstatus = 'F'
AND l1.l_receiptdate > l1.l_commitdate
AND EXISTS (
SELECT *
FROM lineitem l2
WHERE l2.l_orderkey = l1.l_orderkey
AND l2.l_suppkey <> l1.l_suppkey)
AND NOT EXISTS (
SELECT *
FROM lineitem l3
WHERE l3.l_orderkey = l1.l_orderkey
AND l3.l_suppkey <> l1.l_suppkey
AND l3.l_receiptdate > l3.l_commitdate)
AND s_nationkey = n_nationkey
AND n_name = ':1'
GROUP BY s_name
ORDER BY numwait desc, s_name
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM orders o
JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM nation n
JOIN supplier s
ON s.s_nationkey = n.n_nationkey
AND n.n_name = 'INDONESIA'
JOIN lineitem l
ON s.s_suppkey = l.l_suppkey
WHERE l.l_receiptdate > l.l_commitdate) l1
ON o.o_orderkey = l1.l_orderkey
AND o.o_orderstatus = 'F') l2
ON l2.l_orderkey = t1.l_orderkey) a
WHERE (count_suppkey > 1) or ((count_suppkey=1)
AND (l_suppkey <> max_suppkey))) l3
ON l3.l_orderkey = t2.l_orderkey) b
WHERE (count_suppkey is null)
OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c
GROUP BY s_name
ORDER BY numwait DESC, s_name
SELECT s_name, count(1) AS numwait
FROM
(SELECT s_name FROM
(SELECT s_name, t2.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
WHERE l_receiptdate > l_commitdate
GROUP BY l_orderkey) t2
RIGHT OUTER JOIN
(SELECT s_name, l_orderkey, l_suppkey
FROM
(SELECT s_name, t1.l_orderkey, l_suppkey,
count_suppkey, max_suppkey
FROM
(SELECT l_orderkey,
count(distinct l_suppkey) as count_suppkey,
max(l_suppkey) as max_suppkey
FROM lineitem
GROUP BY l_orderkey) t1
Original Query
Re-written query
© 2016 IBM Corporation21
Query federation = virtualized data access
Transparent
§ Appears to be one source
§ Programmers don’t need to know how /
where data is stored
Heterogeneous
§ Accesses data from diverse sources
High Function
§ Full query support against all data
§ Capabilities of sources as well
Autonomous
§ Non-disruptive to data sources, existing
applications, systems.
High Performance
§ Optimization of distributed queries
SQL tools,
applications Data sources
Virtualized
data
© 2016 IBM Corporation22
Federation in practice
§ Admin enables
federation
§ Apps connect to Big
SQL database
§ Nicknames look like
tables to the app
§ Big SQL optimizer
creates global data
access plan with cost
analysis, query push
down
§ Query fragments
executed remotely
Nickname
Nickname
Table
Cost-based optimizer
Wrapper
Client library
Wrapper
Client library
Local + Remote
Execution Plans
Remote sources
Federation server (Big SQL)
Native dialect
Connect to bigsql
© 2016 IBM Corporation23
Joining data across sources
© 2016 IBM Corporation24
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation25
About Spark and Big SQL
§ Easy to query Big SQL (or DB2 LUW) tables through Spark SQL
− See link to self-study lab in “Resources” section
§ Follow typical Spark SQL JDBC data source pattern
− Identify JDBC driver and connection properties
− Load table contents into DataFrame, Spark SQL temporary view
− Execute Spark SQL queries
− Applies to Big SQL tables in Hive warehouse, HBase, or arbitrary HDFS
directory
− Query results can be manipulated via other Spark libraries
§ Technical preview: Launch Spark jobs from Big SQL via UDF
© 2016 IBM Corporation26
Accessing Big SQL data from Spark shell
// based on BigInsights tech preview release that includes Spark 2.1
// Launch shell with --driver-class-path pointing to JDBC driver .jar
// read data from Big SQL table “t1” and load into a DataFrame
val sampleDF = spark.read.format("jdbc")
.option("url”,"jdbc:db2://yourHost.com:32051/BIGSQL")
.option("dbtable",”yourSchema.t1")
.option("user", "yourID").option("password", "yourPassword")
.load()
// display full contents
sampleDF.show()
// create a Spark SQL temporary view to query
sampleDF.createOrReplaceTempView("v1")
// query the view and display the results
sql("select col1, col3 from v1 where col2 > 100 limit 15”).show()
© 2016 IBM Corporation27
Technical preview: launch Spark jobs from Big SQL
§ Spark jobs can be invoked from Big SQL using a table UDF
abstraction
§ Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick
off a Spark job that reads a JSON file stored on HDFS
SELECT *
FROM TABLE(SYSHADOOP.EXECSPARK(
language => 'scala',
class =>
'com.ibm.biginsights.bigsql.examples.ReadJsonFile',
uri =>
'hdfs://host.port.com:8020/user/bigsql/demo.json',
card => 100000)) AS doc
WHERE doc.country IS NOT NULL
© 2016 IBM Corporation28
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/
§ Summary
© 2016 IBM Corporation29
What	is	TPC-DS?
§ TPC	=	Transaction	Processing	Council
− Non-profit	corporation	 (vendor	independent)
− Defines	various	industry	driven	database	benchmarks….	DS	=	Decision	Support
− Models	a	multi-domain	 data	warehouse	environment	 for	a	hypothetical	retailer
Retail Sales Web Sales Inventory Demographics Promotions
Multiple	scale	factors:	
100GB,	300GB,	1TB,	3TB,	10TB,	30TB	and	100TB
99 Pre-Defined
Queries
Query Classes:
Reporting Ad HocIterative
OLAP
Data
Mining
© 2016 IBM Corporation30
100TB	TPC-DS	is	BIGdata
© 2016 IBM Corporation31
Benchmark Environment: IBM “F1” Spark SQL Cluster
§ 28 Nodes Total (Lenovo x3640 M5)
§ Each configured as:
• 2 sockets (18 cores/socket)
• 1.5 TB RAM
• 8x 2TB SSD
§ 2 Racks
− 20x 2U servers per rack (42U racks)
§ 1 Switch, 100GbE, 32 ports Mellanox
SN2700
© 2016 IBM Corporation32
PERFORMANCE
SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE
WORKING QUERIES
COMPRESSION
60%SPACE SAVED
WITH
PARQUET
Spark SQL completes more
TPC-DS queries than any other
open source SQL engine for Hadoop
@ 100TB Scale
© 2016 IBM Corporation33
Query Compliance Through the Scale Factors
§ SQL	compliance	is	important	because	Business	Intelligence	tools	generate	standard	SQL
− Rewriting	queries	is	painful	and	impacts	productivity
§ Spark	SQL	2.1	can	run	all	99	TPC-DS	
queries	but	only	at	lower	scale	factors
§ Spark	SQL	Failures	@	100	TB:
− 12	runtime	errors	
− 4	timeout	(>	10	hours)
Spark	SQL
§ Big	SQL	has	been	successfully	executing	all	
99	queries	since	Oct	2014
§ IBM	is	the	only	vendor	that	has	proven	SQL	
compatibility	at	scale	factors	up	to	100TB
Big	SQL
© 2016 IBM Corporation34
Big SQL is 3.2X faster than Spark 2.1
(4 Concurrent Streams)
Big	SQL	@	99	queries	still
outperforms	 Spark	SQL	@	83	queries
© 2016 IBM Corporation35
PERFORMANCE
Big SQL 3.2x faster
HADOOP-DS @ 100TB: AT A GLANCE
WORKING QUERIES
CPU (vs Spark)
Big SQL uses 3.7x less CPU
I/O (vs Spark)
Big SQL reads 12x less data
Big SQL writes 30x less data
COMPRESSION
60%SPACE SAVED
WITH PARQUET
AVERAGE CPU
USAGE
76.4%
MAX I/O
THROUGHPUT:
READ 4.4 GB/SEC
WRITE 2.8 GB/SEC
© 2016 IBM Corporation36
Recommendation: Right Tool for the Right Job
Machine Learning
Simpler SQL
Good Performance
Ideal tool for BI Data Analysts
and production workloads
Ideal tool for Data Scientists
and discovery
Big SQL Spark SQL
Migrating existing
workloads to Hadoop
Security
Many Concurrent Users
Best Performance
Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
© 2016 IBM Corporation37
Agenda
§ Big Data background
− Market drivers
− Open source technologies: Hadoop, Spark
− Big SQL architecture / capabilities
§ Using Hadoop and Big SQL
− Create tables / populate with data
− Query tables
− Explore query federation
§ Using Spark and Big SQL
− Query data using Spark SQL
− Launch Spark jobs from Big SQL
§ Performance: 100TB benchmark summary
§ Summary
© 2016 IBM Corporation38
Summary
§ Big SQL = easy path for DB2 professionals to work with Big Data
§ Runs on popular Hadoop platforms from IBM, Hortonworks
§ Integrates with Spark
§ Compatible with DB2 and ISO SQL
§ Brings high-performance, enterprise-grade query engine to popular
open source Big Data platforms
© 2016 IBM Corporation39
Want to learn more?
§ Hadoop Dev
https://developer.ibm.com/hadoop/
§ Labs: Big SQL intro, Spark / Big SQL, . . .
https://developer.ibm.com/hadoop/docs/getting-
started/tutorials/big-sql-hadoop-tutorial/
§ 100TB benchmark
https://developer.ibm.com/hadoop/2017/02/07/experiences-
comparing-big-sql-and-spark-sql-at-100tb/
§ This presentation
https://www.slideshare.net/CynthiaSaracco/presentations
© 2016 IBM Corporation40
Supplemental
© 2016 IBM Corporation41
Big SQL architecture
§ Head (coordinator / management) node
− Listens to the JDBC/ODBC connections
− Compiles and optimizes the query
− Optionally store user data in DB2-compatible table (single node only). Useful for some reference data.
§ Big SQL worker processes reside on compute nodes (some or all)
§ Worker nodes stream data between each other as needed
§ Workers can spill large data sets to local disk if needed
− Allows Big SQL to work with data sets larger than available memory
© 2016 IBM Corporation42
CPU Profile for Big SQL vs. Spark SQL
Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark	SQL	uses	almost	3x more	systemCPU.
These	are	wasted	CPU	cycles.
Average	CPU	
Utilization:	76.4%
Average	CPU
Utilization:	88.2%
© 2016 IBM Corporation43
I/O Profile for Big SQL vs. Spark SQL
Hadoop-DS @ 100TB, 4 Concurrent Streams
Spark SQL required
3.6X more reads
9.5X more writes
Big SQL can
drive peak I/O nearly
2X more

More Related Content

What's hot

Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS dataCynthia Saracco
 
Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data:  Querying complex JSON data with BigInsights and HadoopBig Data:  Querying complex JSON data with BigInsights and Hadoop
Big Data: Querying complex JSON data with BigInsights and HadoopCynthia Saracco
 
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLHands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Cynthia Saracco
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase Cynthia Saracco
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM Cynthia Saracco
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study labCynthia Saracco
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeNicolas Morales
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Nicolas Morales
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Cynthia Saracco
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixNicolas Morales
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopWilfried Hoge
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Data Con LA
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on HadoopSenturus
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterEdureka!
 

What's hot (17)

Big Data: SQL query federation for Hadoop and RDBMS data
Big Data:  SQL query federation for Hadoop and RDBMS dataBig Data:  SQL query federation for Hadoop and RDBMS data
Big Data: SQL query federation for Hadoop and RDBMS data
 
Big Data: Querying complex JSON data with BigInsights and Hadoop
Big Data:  Querying complex JSON data with BigInsights and HadoopBig Data:  Querying complex JSON data with BigInsights and Hadoop
Big Data: Querying complex JSON data with BigInsights and Hadoop
 
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLHands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
 
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
Big Data: Using free Bluemix Analytics Exchange Data with Big SQL
 
Big Data: Big SQL and HBase
Big Data:  Big SQL and HBase Big Data:  Big SQL and HBase
Big Data: Big SQL and HBase
 
Big Data: SQL on Hadoop from IBM
Big Data:  SQL on Hadoop from IBM Big Data:  SQL on Hadoop from IBM
Big Data: SQL on Hadoop from IBM
 
Big Data: Explore Hadoop and BigInsights self-study lab
Big Data:  Explore Hadoop and BigInsights self-study labBig Data:  Explore Hadoop and BigInsights self-study lab
Big Data: Explore Hadoop and BigInsights self-study lab
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0Taming Big Data with Big SQL 3.0
Taming Big Data with Big SQL 3.0
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLAdding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
 
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
Big Data: Getting off to a fast start with Big SQL (World of Watson 2016 sess...
 
Getting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with BluemixGetting started with Hadoop on the Cloud with Bluemix
Getting started with Hadoop on the Cloud with Bluemix
 
Big SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on HadoopBig SQL 3.0 - Fast and easy SQL on Hadoop
Big SQL 3.0 - Fast and easy SQL on Hadoop
 
Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014Hadoop Innovation Summit 2014
Hadoop Innovation Summit 2014
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Running Cognos on Hadoop
Running Cognos on HadoopRunning Cognos on Hadoop
Running Cognos on Hadoop
 
Advanced Security In Hadoop Cluster
Advanced Security In Hadoop ClusterAdvanced Security In Hadoop Cluster
Advanced Security In Hadoop Cluster
 

Similar to Using your DB2 SQL Skills with Hadoop and Spark

Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev Kumar
 
Big SQL NYC Event December by Virender
Big SQL NYC Event December by VirenderBig SQL NYC Event December by Virender
Big SQL NYC Event December by Virendervithakur
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformHortonworks
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?samthemonad
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Run Oracle Apps in the Cloud with dashDB
Run Oracle Apps in the Cloud with dashDBRun Oracle Apps in the Cloud with dashDB
Run Oracle Apps in the Cloud with dashDBIBM Cloud Data Services
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudKellyn Pot'Vin-Gorman
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaMopuru Babu
 
Chris Asano.dba.20160512a
Chris Asano.dba.20160512aChris Asano.dba.20160512a
Chris Asano.dba.20160512aChris Asano
 
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data FederationNRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data FederationNRB
 
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation NRB
 
Rama prasad owk etl hadoop_developer
Rama prasad owk etl hadoop_developerRama prasad owk etl hadoop_developer
Rama prasad owk etl hadoop_developerramaprasad owk
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Professional Portfolio
Professional PortfolioProfessional Portfolio
Professional PortfolioMoniqueO Opris
 

Similar to Using your DB2 SQL Skills with Hadoop and Spark (20)

ESGYN Overview
ESGYN OverviewESGYN Overview
ESGYN Overview
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developer
 
Big SQL NYC Event December by Virender
Big SQL NYC Event December by VirenderBig SQL NYC Event December by Virender
Big SQL NYC Event December by Virender
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
 
Uotm workshop
Uotm workshopUotm workshop
Uotm workshop
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
SAMADMohammad
SAMADMohammadSAMADMohammad
SAMADMohammad
 
Run Oracle Apps in the Cloud with dashDB
Run Oracle Apps in the Cloud with dashDBRun Oracle Apps in the Cloud with dashDB
Run Oracle Apps in the Cloud with dashDB
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Power BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle CloudPower BI with Essbase in the Oracle Cloud
Power BI with Essbase in the Oracle Cloud
 
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scalaSunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
Sunshine consulting mopuru babu cv_java_j2ee_spring_bigdata_scala
 
Chris Asano.dba.20160512a
Chris Asano.dba.20160512aChris Asano.dba.20160512a
Chris Asano.dba.20160512a
 
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data FederationNRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation
 
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
NRB - BE MAINFRAME DAY 2017 - Data spark and the data federation
 
Rama prasad owk etl hadoop_developer
Rama prasad owk etl hadoop_developerRama prasad owk etl hadoop_developer
Rama prasad owk etl hadoop_developer
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Professional Portfolio
Professional PortfolioProfessional Portfolio
Professional Portfolio
 

Recently uploaded

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Using your DB2 SQL Skills with Hadoop and Spark

  • 1. © 2016 IBM Corporation Using your DB2 skills with Hadoop and Spark Presented to TRIDEX DB2 Users Group, June 2017 C. M. Saracco, IBM Silicon Valley Lab https://www.slideshare.net/CynthiaSaracco/presentations
  • 2. © 2016 IBM Corporation2 Executive summary § About Apache Hadoop and Spark − Popular open source technologies for working with Big Data • Clustered computing > scalability • Varied data > no pre-set structure or schema requirements − Hadoop: distributed file system (storage), MapReduce API, . . . − Spark: in-memory data processing (speed), built-in libraries, . . . § About Big SQL − DB2-compatible query engine for Hadoop data (IBM or Hortonworks distributions) − Based on decades of IBM R&D investment in RDBMS technology, including database parallelism and query optimization. Strong runtime performance for analytical workloads. § Some ways to leverage DB2 SQL skills − Create / manage / query “local” or distributed tables in Hadoop − Query / join Hadoop data with DB2, Oracle, Teradata, etc. data via query federation − Leverage Spark to query and manipulate Big SQL or DB2 data − Leverage Big SQL to initiate Spark jobs and analyze result
  • 3. © 2016 IBM Corporation3 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 4. © 2016 IBM Corporation4 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 5. © 2016 IBM Corporation5 Business leaders frequently make decisions based on information they don’ttrust, or don’t have1in3 83% of CIOs cited “Business intelligence and analytics” as part of their visionary plans to enhance competitiveness Business leaders say they don’t have access to the information they need to do their jobs 1in2 of CEOs need to do a better job capturing and understanding information rapidly in order to make swift business decisions 60% … and organizations need deeper insights Information is at the center of a new wave of opportunity… 4 million “likes” per minute 300,000 tweets per minute 150 million emails per minute 2.78 million video views per minute 2.5 TB per day per A350 plane > 1 PB per day gas turbines 1 ZB = 1 billion TB
  • 6. © 2016 IBM Corporation6 Big Data adoption (study results) 2012 to 2014 2015 22%-27% 25% 0% change 2012 to 2014 2015 24%-26% 10% 250% decrease Educate: Learning about big data capabilities 2012 to 2014 2015 43%-47% 53% 125% increase Explore: Exploring internal use cases and developing a strategy Engage: Implementing infrastructure and running pilot activities 2012 to 2014 2015 5%-6% 13% 210% increase Execute: Using big data and analytics pervasively across the enterprise 2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
  • 7. © 2016 IBM Corporation7 Return on investment period for big data and analytics projects as reported by respondents Big Data ROI often < 18 months 2015 IBV study “Analytics: The Upside of Disruption” (ibm.biz/w3_2015analytics)
  • 8. © 2016 IBM Corporation8 § Both open source Apache projects − Exploit distributed computing environments − Enable processing of large volumes of varied data § Hadoop − Inspired by Google technologies (MapReduce, GFS) − Originally designed for batch-oriented, read-intensive applications − “Core” consists of distributed file system, MapReduce, job scheduler, utilities − Complementary projects span data warehousing, workflow management, columnar data storage, activity monitoring, . . . § Spark − Began as a UC Berkeley project − Fast, general-purpose engine for working with Big Data in memory − Popular built-in libraries for machine learning, streaming data, query (SQL), . . . − No built-in storage. Interfaces to Hadoop, other stores About Hadoop and Spark
  • 9. © 2016 IBM Corporation9 IBM contributions: Hadoop and Spark Snapshots taken Jan. 2017. Latest content available online via Apache dashboards. IOP relates to Hadoop; STC relates to Spark.
  • 10. © 2016 IBM Corporation10 What is Big SQL? SQL-based Application Big SQL Engine Data Storage IBM data server client SQL MPP Run-time HDFS § Comprehensive, standard SQL for Hadoop – SELECT: joins, unions, aggregates, subqueries . . . – UPDATE/DELETE (HBase-managed tables) – GRANT/REVOKE, INSERT … INTO – SQL procedural logic (SQL PL) – Stored procs, user-defined functions – IBM data server JDBC and ODBC drivers § Optimization and performance – IBM MPP engine (C++) replaces Java MapReduce layer – Continuous running daemons (no start up latency) – Message passing allow data to flow between nodes without persisting intermediate results – In-memory operations with ability to spill to disk (useful for aggregations, sorts that exceed available RAM) – Cost-based query optimization with 140+ rewrite rules § Various storage formats supported – Text (delimited), Sequence, RCFile, ORC, Avro, Parquet – Data persisted in DFS, Hive, HBase – No IBM proprietary format required § Integration with RDBMSs via LOAD, query federation IBM Open Platform or Hortonworks Data Platform
  • 11. © 2016 IBM Corporation11 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 12. © 2016 IBM Corporation12 § Big SQL − Easy on-ramp to Hadoop for DB2 SQL professionals − Create query-ready data lake − Offload “cold” RDBMS warehouse data to Hadoop − . . . . § Some ways to use Big SQL . . . − Create tables − Load / insert data − Execute complex queries − Exploit various DB2 features: UDFs, EXPLAIN, workload management, Oracle / Netezza SQL compatibility. . . . − Exploit various Hadoop features: Hive, HBase, SerDes, . . . About Hadoop and Big SQL
  • 13. © 2016 IBM Corporation13 Invocation options § Command-line interface: Java SQL Shell (JSqsh) § Web tooling (Data Server Manager) § Tools that support IBM JDBC/ODBC driver
  • 14. © 2016 IBM Corporation14 Creating a Big SQL table § Standard CREATE TABLE DDL with extensions create hadoop table users ( id int not null primary key, office_id int null, fname varchar(30) not null, lname varchar(30) not null) row format delimited fields terminated by '|' stored as textfile; Worth noting: • “Hadoop” keyword creates table in HDFS • Row format delimited and textfile formats are default • Constraints not enforced (but useful for query optimization) • Examples in these charts focus on HDFS storage, both within or external to Hive warehouse. HBase examples provided separately
  • 15. © 2016 IBM Corporation15 CREATE VIEW § Standard SQL syntax create view my_users as select fname, lname from biadmin.users where id > 100;
  • 16. © 2016 IBM Corporation16 Populating tables via LOAD § Typically best runtime performance § Load data from local or remote file system load hadoop using file url 'sftp://myID:myPassword@myServer.ibm.com:22/install- dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite; § Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC connection load hadoop using jdbc connection url 'jdbc:db2://some.host.com:portNum/sampledb' with parameters (user='myID', password='myPassword') from table MEDIA columns (ID, NAME) where 'CONTACTDATE < ''2012-02-01''' into table media_db2table_jan overwrite with load properties ('num.map.tasks' = 10);
  • 17. © 2016 IBM Corporation17 Populating tables via INSERT § INSERT INTO . . . SELECT FROM . . . − Parallel read and write operations CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet ( product_key INT NOT NULL, product_name VARCHAR(150), Quantity INT, order_method_en VARCHAR(90) ) STORED AS parquetfile; -- source tables do not need to be in Parquet format insert into big_sales_parquet SELECT sales.product_key, pnumb.product_name, sales.quantity, meth.order_method_en FROM sls_sales_fact sales, sls_product_dim prod,sls_product_lookup pnumb, sls_order_method_dim meth WHERE pnumb.product_language='EN' AND sales.product_key=prod.product_key AND prod.product_number=pnumb.product_number AND meth.order_method_key=sales.order_method_key and sales.quantity > 5500; § INSERT INTO . . . VALUES(. . . ) − Not parallelized. 1 file per INSERT. Not recommended except for quick tests CREATE HADOOP TABLE foo col1 int, col2 varchar(10)); INSERT INTO foo VALUES (1, ‘hello’);
  • 18. © 2016 IBM Corporation18 CREATE . . . TABLE . . . AS SELECT . . . § Create a Big SQL table based on contents of other table(s) § Source tables can be in different file formats or use different underlying storage mechanisms -- source tables in this example are external (just DFS files) CREATE HADOOP TABLE IF NOT EXISTS sls_product_flat ( product_key INT NOT NULL , product_line_code INT NOT NULL , product_type_key INT NOT NULL , product_type_code INT NOT NULL , product_line_en VARCHAR(90) , product_line_de VARCHAR(90) ) as select product_key, d.product_line_code, product_type_key, product_type_code, product_line_en, product_line_de from extern.sls_product_dim d, extern.sls_product_line_lookup l where d.product_line_code = l.product_line_code;
  • 19. © 2016 IBM Corporation19 SQL capability highlights § Query operations − Projections, restrictions − UNION, INTERSECT, EXCEPT − Wide range of built-in functions (e.g. OLAP) − Various Oracle, Netezza compatibility items § Full support for subqueries − In SELECT, FROM, WHERE and HAVING clauses − Correlated and uncorrelated − Equality, non-equality subqueries − EXISTS, NOT EXISTS, IN, ANY, SOME, etc. § All standard join operations − Standard and ANSI join syntax − Inner, outer, and full outer joins − Equality, non-equality, cross join support − Multi-value join § Stored procedures, user-defined functions, user-defined aggregates SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey ) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate ) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name;
  • 20. © 2016 IBM Corporation20 Power of standard SQL § Big SQL executes all 22 TPC-H queries without modification § Big SQL executes all 99 TPC-DS queries without modification § Big SQL leverages DB2 query rewrite technology for efficient optimization SELECT s_name, count(*) AS numwait FROM supplier, lineitem l1, orders, nation WHERE s_suppkey = l1.l_suppkey AND o_orderkey = l1.l_orderkey AND o_orderstatus = 'F' AND l1.l_receiptdate > l1.l_commitdate AND EXISTS ( SELECT * FROM lineitem l2 WHERE l2.l_orderkey = l1.l_orderkey AND l2.l_suppkey <> l1.l_suppkey) AND NOT EXISTS ( SELECT * FROM lineitem l3 WHERE l3.l_orderkey = l1.l_orderkey AND l3.l_suppkey <> l1.l_suppkey AND l3.l_receiptdate > l3.l_commitdate) AND s_nationkey = n_nationkey AND n_name = ':1' GROUP BY s_name ORDER BY numwait desc, s_name JOIN (SELECT s_name, l_orderkey, l_suppkey FROM orders o JOIN (SELECT s_name, l_orderkey, l_suppkey FROM nation n JOIN supplier s ON s.s_nationkey = n.n_nationkey AND n.n_name = 'INDONESIA' JOIN lineitem l ON s.s_suppkey = l.l_suppkey WHERE l.l_receiptdate > l.l_commitdate) l1 ON o.o_orderkey = l1.l_orderkey AND o.o_orderstatus = 'F') l2 ON l2.l_orderkey = t1.l_orderkey) a WHERE (count_suppkey > 1) or ((count_suppkey=1) AND (l_suppkey <> max_suppkey))) l3 ON l3.l_orderkey = t2.l_orderkey) b WHERE (count_suppkey is null) OR ((count_suppkey=1) AND (l_suppkey = max_suppkey))) c GROUP BY s_name ORDER BY numwait DESC, s_name SELECT s_name, count(1) AS numwait FROM (SELECT s_name FROM (SELECT s_name, t2.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem WHERE l_receiptdate > l_commitdate GROUP BY l_orderkey) t2 RIGHT OUTER JOIN (SELECT s_name, l_orderkey, l_suppkey FROM (SELECT s_name, t1.l_orderkey, l_suppkey, count_suppkey, max_suppkey FROM (SELECT l_orderkey, count(distinct l_suppkey) as count_suppkey, max(l_suppkey) as max_suppkey FROM lineitem GROUP BY l_orderkey) t1 Original Query Re-written query
  • 21. © 2016 IBM Corporation21 Query federation = virtualized data access Transparent § Appears to be one source § Programmers don’t need to know how / where data is stored Heterogeneous § Accesses data from diverse sources High Function § Full query support against all data § Capabilities of sources as well Autonomous § Non-disruptive to data sources, existing applications, systems. High Performance § Optimization of distributed queries SQL tools, applications Data sources Virtualized data
  • 22. © 2016 IBM Corporation22 Federation in practice § Admin enables federation § Apps connect to Big SQL database § Nicknames look like tables to the app § Big SQL optimizer creates global data access plan with cost analysis, query push down § Query fragments executed remotely Nickname Nickname Table Cost-based optimizer Wrapper Client library Wrapper Client library Local + Remote Execution Plans Remote sources Federation server (Big SQL) Native dialect Connect to bigsql
  • 23. © 2016 IBM Corporation23 Joining data across sources
  • 24. © 2016 IBM Corporation24 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 25. © 2016 IBM Corporation25 About Spark and Big SQL § Easy to query Big SQL (or DB2 LUW) tables through Spark SQL − See link to self-study lab in “Resources” section § Follow typical Spark SQL JDBC data source pattern − Identify JDBC driver and connection properties − Load table contents into DataFrame, Spark SQL temporary view − Execute Spark SQL queries − Applies to Big SQL tables in Hive warehouse, HBase, or arbitrary HDFS directory − Query results can be manipulated via other Spark libraries § Technical preview: Launch Spark jobs from Big SQL via UDF
  • 26. © 2016 IBM Corporation26 Accessing Big SQL data from Spark shell // based on BigInsights tech preview release that includes Spark 2.1 // Launch shell with --driver-class-path pointing to JDBC driver .jar // read data from Big SQL table “t1” and load into a DataFrame val sampleDF = spark.read.format("jdbc") .option("url”,"jdbc:db2://yourHost.com:32051/BIGSQL") .option("dbtable",”yourSchema.t1") .option("user", "yourID").option("password", "yourPassword") .load() // display full contents sampleDF.show() // create a Spark SQL temporary view to query sampleDF.createOrReplaceTempView("v1") // query the view and display the results sql("select col1, col3 from v1 where col2 > 100 limit 15”).show()
  • 27. © 2016 IBM Corporation27 Technical preview: launch Spark jobs from Big SQL § Spark jobs can be invoked from Big SQL using a table UDF abstraction § Example: Call the SYSHADOOP.EXECSPARK built-in UDF to kick off a Spark job that reads a JSON file stored on HDFS SELECT * FROM TABLE(SYSHADOOP.EXECSPARK( language => 'scala', class => 'com.ibm.biginsights.bigsql.examples.ReadJsonFile', uri => 'hdfs://host.port.com:8020/user/bigsql/demo.json', card => 100000)) AS doc WHERE doc.country IS NOT NULL
  • 28. © 2016 IBM Corporation28 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary https://developer.ibm.com/hadoop/2017/02/07/experiences-comparing-big-sql-and-spark-sql-at-100tb/ § Summary
  • 29. © 2016 IBM Corporation29 What is TPC-DS? § TPC = Transaction Processing Council − Non-profit corporation (vendor independent) − Defines various industry driven database benchmarks…. DS = Decision Support − Models a multi-domain data warehouse environment for a hypothetical retailer Retail Sales Web Sales Inventory Demographics Promotions Multiple scale factors: 100GB, 300GB, 1TB, 3TB, 10TB, 30TB and 100TB 99 Pre-Defined Queries Query Classes: Reporting Ad HocIterative OLAP Data Mining
  • 30. © 2016 IBM Corporation30 100TB TPC-DS is BIGdata
  • 31. © 2016 IBM Corporation31 Benchmark Environment: IBM “F1” Spark SQL Cluster § 28 Nodes Total (Lenovo x3640 M5) § Each configured as: • 2 sockets (18 cores/socket) • 1.5 TB RAM • 8x 2TB SSD § 2 Racks − 20x 2U servers per rack (42U racks) § 1 Switch, 100GbE, 32 ports Mellanox SN2700
  • 32. © 2016 IBM Corporation32 PERFORMANCE SPARK SQL 2.1 HADOOP-DS @ 100TB: AT A GLANCE WORKING QUERIES COMPRESSION 60%SPACE SAVED WITH PARQUET Spark SQL completes more TPC-DS queries than any other open source SQL engine for Hadoop @ 100TB Scale
  • 33. © 2016 IBM Corporation33 Query Compliance Through the Scale Factors § SQL compliance is important because Business Intelligence tools generate standard SQL − Rewriting queries is painful and impacts productivity § Spark SQL 2.1 can run all 99 TPC-DS queries but only at lower scale factors § Spark SQL Failures @ 100 TB: − 12 runtime errors − 4 timeout (> 10 hours) Spark SQL § Big SQL has been successfully executing all 99 queries since Oct 2014 § IBM is the only vendor that has proven SQL compatibility at scale factors up to 100TB Big SQL
  • 34. © 2016 IBM Corporation34 Big SQL is 3.2X faster than Spark 2.1 (4 Concurrent Streams) Big SQL @ 99 queries still outperforms Spark SQL @ 83 queries
  • 35. © 2016 IBM Corporation35 PERFORMANCE Big SQL 3.2x faster HADOOP-DS @ 100TB: AT A GLANCE WORKING QUERIES CPU (vs Spark) Big SQL uses 3.7x less CPU I/O (vs Spark) Big SQL reads 12x less data Big SQL writes 30x less data COMPRESSION 60%SPACE SAVED WITH PARQUET AVERAGE CPU USAGE 76.4% MAX I/O THROUGHPUT: READ 4.4 GB/SEC WRITE 2.8 GB/SEC
  • 36. © 2016 IBM Corporation36 Recommendation: Right Tool for the Right Job Machine Learning Simpler SQL Good Performance Ideal tool for BI Data Analysts and production workloads Ideal tool for Data Scientists and discovery Big SQL Spark SQL Migrating existing workloads to Hadoop Security Many Concurrent Users Best Performance Not Mutually Exclusive. Big SQL & Spark SQL can co-exist in the cluster
  • 37. © 2016 IBM Corporation37 Agenda § Big Data background − Market drivers − Open source technologies: Hadoop, Spark − Big SQL architecture / capabilities § Using Hadoop and Big SQL − Create tables / populate with data − Query tables − Explore query federation § Using Spark and Big SQL − Query data using Spark SQL − Launch Spark jobs from Big SQL § Performance: 100TB benchmark summary § Summary
  • 38. © 2016 IBM Corporation38 Summary § Big SQL = easy path for DB2 professionals to work with Big Data § Runs on popular Hadoop platforms from IBM, Hortonworks § Integrates with Spark § Compatible with DB2 and ISO SQL § Brings high-performance, enterprise-grade query engine to popular open source Big Data platforms
  • 39. © 2016 IBM Corporation39 Want to learn more? § Hadoop Dev https://developer.ibm.com/hadoop/ § Labs: Big SQL intro, Spark / Big SQL, . . . https://developer.ibm.com/hadoop/docs/getting- started/tutorials/big-sql-hadoop-tutorial/ § 100TB benchmark https://developer.ibm.com/hadoop/2017/02/07/experiences- comparing-big-sql-and-spark-sql-at-100tb/ § This presentation https://www.slideshare.net/CynthiaSaracco/presentations
  • 40. © 2016 IBM Corporation40 Supplemental
  • 41. © 2016 IBM Corporation41 Big SQL architecture § Head (coordinator / management) node − Listens to the JDBC/ODBC connections − Compiles and optimizes the query − Optionally store user data in DB2-compatible table (single node only). Useful for some reference data. § Big SQL worker processes reside on compute nodes (some or all) § Worker nodes stream data between each other as needed § Workers can spill large data sets to local disk if needed − Allows Big SQL to work with data sets larger than available memory
  • 42. © 2016 IBM Corporation42 CPU Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams Spark SQL uses almost 3x more systemCPU. These are wasted CPU cycles. Average CPU Utilization: 76.4% Average CPU Utilization: 88.2%
  • 43. © 2016 IBM Corporation43 I/O Profile for Big SQL vs. Spark SQL Hadoop-DS @ 100TB, 4 Concurrent Streams Spark SQL required 3.6X more reads 9.5X more writes Big SQL can drive peak I/O nearly 2X more