SlideShare a Scribd company logo
1 of 93
Download to read offline
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Gluent New World #02: 

SQL-on-Hadoop with Mark Rittman
Mark Rittman, CTO, Rittman Mead
April 2016
info@rittmanmead.com www.rittmanmead.com @rittmanmead 2
•Mark Rittman, Co-Founder of Rittman Mead

‣Oracle ACE Director, specialising in Oracle BI&DW

‣14 Years Experience with Oracle Technology

‣Regular columnist for Oracle Magazine

•Author of two Oracle Press Oracle BI books

‣Oracle Business Intelligence Developers Guide

‣Oracle Exalytics Revealed

‣Writer for Rittman Mead Blog :

http://www.rittmanmead.com/blog

•Email : mark.rittman@rittmanmead.com

•Twitter : @markrittman
About the Speaker
info@rittmanmead.com www.rittmanmead.com @rittmanmead 3
•Why Hadoop? And what are the key Hadoop platform features?

•Introducing SQL-on-Hadoop, and Apache Hive

•How Hive works, and how it’s not just about SELECTing data

•Solving Hive’s ad-hoc query performance problem

•So what’s all this about Apache Drill?

•…. and Oracle Big Data SQL, IBM Big SQL?

•Apache Spark, and Spark SQL

•Security, Hadoop and SQL-on-Hadoop

•Selecting a SQL-on-Hadoop query engine
Agenda
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Everyone’s talking about Hadoop and “Big Data”
Hadoop is the Big Hot Topic In IT / Analytics
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Highly Scalable (and Affordable) Cluster Computing
•Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering

‣Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-offs

•Hadoop was designed from outside for massive horizontal scalability - using cheap hardware

•Anticipates hardware failure and makes multiple copies of data as protection

•More nodes you add, more stable it becomes

•And at a fraction of the cost of traditional

RDBMS platforms
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Store and analyze huge volumes of structured and unstructured data 

•In the past, we had to throw away the detail

•No need to define a data model during ingest

•Supports multiple, flexible schemas

•Separation of storage from compute engine

•Allows multiple query engines and frameworks

to work on the same raw datasets
Store Everything Forever - And Process in Many Ways
Hadoop	Data	Lake
Webserver

Log	Files	(txt)
Social	Media

Logs	(JSON)
DB	Archives

(CSV)
Sensor	Data

(XML)
`Spatial	&	Graph

(XML,	txt)
IoT	Logs

(JSON,	txt)
Chat	Transcripts

(Txt)
DB	Transactions

(CSV,	XML)
Blogs,	Articles

(TXT,	HTML)
Raw	Data Processed	Data
NoSQL	Key-Value

Store	DB Tabular	Data

(Hive	Tables)
Aggregates

(Impala	Tables) NoSQL	Document	

Store	DB
info@rittmanmead.com www.rittmanmead.com @rittmanmead 7
•Data for customer 360 system typically landed into a Hadoop & NoSQL-based 

•Applies aggregation, joining and machine-learning processes to extract insights
Design Pattern : “Data Lake” or “Data Reservoir”
Data	Transfer Data	Access
Data	Factory
Data	Reservoir
Business	
Intelligence	Tools
Hadoop	Platform
File	Based	
Integration
Stream	
Based	
Integration
Data	streams
Discovery	&	Development	Labs
Safe	&	secure	Discovery	and	Development	
environment
Data	sets	and	
samples
Models	and	
programs
Marketing	/
Sales	Applications
Models
Machine
Learning
Segments
Operational	Data
Transactions
Customer
Master	ata
Unstructured	Data
Voice	+	Chat	
Transcripts
ETL	Based
Integration
Raw	
Customer	Data
Data	stored	in	
the	original	
format	(usually	
files)		such	as	
SS7,	ASN.1,	
JSON	etc.
Mapped	
Customer	Data
Data	sets	
produced	by	
mapping	and	
transforming	
raw	data
info@rittmanmead.com www.rittmanmead.com @rittmanmead 8
•Combine with a traditional data warehouse to add storage, support for new datatypes

•Land raw data in real-time into Hadoop, then process and store
Combine with Traditional Data Warehouse
info@rittmanmead.com www.rittmanmead.com @rittmanmead 9
•Hadoop is the overall framework for enabling low-cost, scalable cluster computing

‣HDFS cluster filesystem stores the data, in a process/query neutral form (files)

‣YARN resource manager allocates resources to Hadoop jobs

‣MapReduce and other processing frameworks 

then work on that data

•Data is decoupled from the engine that processes it

•Layers can be swapped out (Mesos for YARN etc)

•Hadoop takes care of the overall cluster framework
Key Hadoop Platform Technologies
Hadoop	Distributed	Filesystem	(HDFS)
YARN	Resource	Manager
Query	and	Processing	Engines
Batch

(MapReduce)
In-Memory

(Spark)
Streaming	

(Spark,	Storm)
Graph	+	Search

(Solr,	Giraph)
Unstructured	/

Semi-Structured

Log	Data
Offloaded

Archive

Data
Social	Graphs

&	Networks
Smart	Meter

&	Sensor	Data
sounds good
but I’m a DBA
it’s all files
I don’t know MapReduce
or Scala
or
or whatever the latest

made-up Hadoop language is
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Introducing SQL-on-Hadoop
•Hadoop is not a cheap substitute for enterprise DW
platforms - don’t use it like this

•But adding SQL processing and abstraction can help
in many scenarios:

• Query access to data stored in Hadoop as an
archive

• Aggregating, sorting, filtering data

• Set-based transformation capabilities for other
frameworks (e.g. Spark)

• Ad-hoc analysis and data discovery in-real time

• Providing tabular abstractions over complex
datatypes
19
Hadoop	Distributed	Filesystem	(HDFS)
YARN	Resource	Manager
Query	and	Processing	Engines
Batch

(MapReduce)
In-Memory

(Spark)
Streaming	

(Spark,	Storm)
Graph	+	Search

(Solr,	Giraph)
Unstructured	/

Semi-Structured

Log	Data
Offloaded

Archive

Data
Social	Graphs

&	Networks
Smart	Meter

&	Sensor	Data
SQL

Engine
SQL

Engine
info@rittmanmead.com www.rittmanmead.com @rittmanmead 20
•Modern SQL-on-Hadoop engines often provide connectivity

to data sources outside of the Hadoop cluster

‣Traditional DW platforms

‣No-SQL databases e.g. MongoDB

‣Files, JDBC etc

•Provide a framework for data integration

and data federation, using JDBC drivers
Enables Integration with External (And Internal) Data
Hadoop	Distributed	Filesystem	(HDFS)
YARN	Resource	Manager
Query	and	Processing	Engines
In-Memory

(Spark)
Unstructured	/

Semi-Structured

Log	Data
Offloaded

Archive

Data
Social	Graphs

&	Networks
Smart	Meter

&	Sensor	Data
SQL

Engine
20
NoSQL	Key-Value

Store	DB
info@rittmanmead.com www.rittmanmead.com @rittmanmead 21
•Most Traditional data warehousing vendors offer a Hadoop integration option

•Oracle Big Data SQL

•IBM Big SQL etc

•Leverage lower-level SQL-on-Hadoop

metadata but use own server process

•Allows DBAs to write SQL using RDBMS

SQL dialect, run across relational, Hadoop

and NoSQL servers
Hadoop	Distributed	Filesystem	(HDFS)
YARN	Resource	Manager
Query	and	Processing	Engines
Oracle

Big	Data	SQL

Server
Unstructured	/

Semi-Structured

Log	Data
Offloaded

Archive

Data
Social	Graphs

&	Networks
Smart	Meter

&	Sensor	Data
21
NoSQL	Key-Value

Store	DB
Platform for Traditional DW Integration with Hadoop
Oracle

RDBMS
So how do they work?
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Original SQL-on-Hadoop engine developed at Facebook, now within the Hadoop project

•Allows users to query Hadoop data using SQL-like language

•Tabular metadata layer that overlays files, can interpret semi-structured data (e.g. JSON)

•Generates MapReduce code to return required data

•Extensible through SerDes and Storage Handlers

•JDBC and ODBC drivers for most platforms/tools

•Perfect for set-based access + batch ETL work
23
Apache Hive : SQL Metadata + Engine over Hadoop
YARN	Resource	Manager
Hadoop	Distributed	Filesystem	(HDFS)
Unstructured	/

Semi-Structured

Log	Data
Offloaded

Archive

Data
Social	Graphs

&	Networks
Smart	Meter

&	Sensor	Data
2323
MapReduce	Processing	Framework
Apache	Hive	SQL	Processing	Engine
HiveQL	SQL	Commands
Java	JARs
Submitted	Jobs
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Queries come in via JDBC/ODBC, the Hive Thrift Server,

from the CLI or via Hue (for example)

•The Hive Metastore (data dictionary) maps files and

other Hadoop data structures onto tables and columns

•The Hive SQL engine parses, plans and then executes

the query, using an execution plan similar to Oracle,

SQL Server and other RBDMS engines

•MapReduce code is then auto-generated, and submitted

to YARN, and then run on the Hadoop cluster
24
Apache Hive Logical Architecture
Hive	Thrift	Server
JDBC	/	ODBC
Parser Planner
Execution	Engine
Metastore
HueCLI
MapReduce
HDFS
hive> select count(*) from src_customer;


Total MapReduce jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=

In order to set a constant number of reducers:

set mapred.reduce.tasks=

Starting Job = job_201303171815_0003, Tracking URL = 

http://localhost.localdomain:50030/jobdetails.jsp…

Kill Command = /usr/lib/hadoop-0.20/bin/

hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 

-kill job_201303171815_0003



2013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0%

2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0%

2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33%

2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100%

Ended Job = job_201303171815_0003

OK

25

Time taken: 22.21 seconds
HiveQL

Query
MapReduce

Job	submitted
Results	

returned
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Data integration tools such as Oracle Data Integrator can load and process Hadoop data

•BI tools such as Oracle Business Intelligence 12c can report on Hadoop data

•Generally use MapReduce and Hive to access data

‣ODBC and JDBC access to Hive tabular data

‣Allows Hadoop unstructured/semi-structured

data on HDFS to be accessed like RDBMS
Provides a SQL Interface for BI + ETL Tools
Access direct Hive or extract using ODI12c
for structured OBIEE dashboard analysis
What pages are people visiting?
Who is referring to us on Twitter?
What content has the most reach?
T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 

+61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India)
E : info@rittmanmead.com
W : www.rittmanmead.com
Connecting to Hive using Beeline CLI
•From the command-line, either use Hive CLI, or beeline CLI

‣HUE (“Hadoop User Experience”) provides Web interface into Hive (think Oracle Apex)
[iot@cdh-node1 ~]$ beeline -u jdbc:hive2://cdh-node1:10000 -n iot -p welcome1 -d org.apache.hive.jdbc.HiveDriver
Connecting to jdbc:hive2://cdh-node1:10000

Connected to: Apache Hive (version 1.1.0-cdh5.5.1)

Driver: Hive JDBC (version 1.1.0-cdh5.5.1)

Transaction isolation: TRANSACTION_REPEATABLE_READ

Beeline version 1.1.0-cdh5.5.1 by Apache Hive
0: jdbc:hive2://cdh-node1:10000> show tables;

+-----------------------------------+--+

| tab_name |

+-----------------------------------+--+

| flight_delays |

| my_second_table |

| oracle_analytics_tweets |

+-----------------------------------+--+
8 rows selected (0.137 seconds)

0: jdbc:hive2://cdh-node1:10000>
Add SQL*Developer
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Hive is extensible in three major ways that help with accessing and integrating new data sets

•SerDes : Serializer-Deserializers that interpret semi-structured sources + make tabular

•UDFs + Hive Streaming : Add user-defined functions and whole-row external processing

•File Formats : make use of compressed and/or optimised file storage

•Storage Handlers : use storage other than HDFS (e.g. MongoDB) as data source
Hive Extensibility - The “Swiss Army Knife” of Hadoop
Client
Client
HDFS	Fileformats
JDBC	/	ODBC
Metastore
MapReduce
UDF/UDAFs
SerDe
Scripts
HBase
MongoDB
Parser
Execution	Engine
HiveQL
Planner
Storage	Hdlrs
TextFile
Parquet
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Extend Hive by adding new computation and aggregation capabilities

•UDFs (row-based), UDAFs (aggregation) and UDTFs (table functions)
Hive Extensibility through UDFs and UDAFs
add jar target/JsonSplit-1.0-SNAPSHOT.jar;
create temporary function json_split 

as 'com.pythian.hive.udf.JsonSplitUDF';
create table json_example (json string);
load data local inpath 'split_example.json' 

into table json_example;
SELECT ex.* FROM json_example 

LATERAL VIEW explode(json_split(json_example.json)) ex;ADD JAR ./ext.jar;
CREATE TEMPORARY FUNCTION process_names as 'com.matthewrathbone.example.NameParserGenericUDTF';
SELECT
adTable.name,
adTable.surname
FROM people
lateral view process_names(name) adTable as name, surname;
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Allows data to be stored in optimised storage format

‣Column-store for analytics

‣Self-describing, splittable storage

for general-purpose use

‣Compressed data

‣Semi-structured (e.g. log) data
29
SerDes & Storage Handlers Further Decouple Storage
Hadoop	Distributed	Filesystem	(HDFS)
Query	and	Processing	EnginesMapReduce
Unstructured	/

Semi-Structured

Log	Data
Offloaded

Archive

Data
Social	Graphs

&	Networks
Smart	Meter

&	Sensor	Data
SQL

Engine
NoSQL	Key-Value

Store	DB
RegEx	Serde Parquet	SerDe
JSON	SerDe
NoSQL	Key-Value

Store	DB
MongoDB

Store	Handler
MongoDB

Store	Handler
info@rittmanmead.com www.rittmanmead.com @rittmanmead 30
•Splittability - can the file be split into blocks and processed in parallel

‣CSV files can be split by file line; XML files can’t because of opening and closing tags

•Ability to compress - CSV files can’t be block compressed, impact on space / performance

•Support for schema evolution - does the file contain in-built schema information that self-describes
the data?
File Formats in Hadoop Are Important
2016-01-28T09:30:28Z,2016-01-28T11:56:24Z,145.933

2016-01-29T00:19:35Z,2016-01-29T01:36:49Z,77.233

2016-01-29T02:10:35Z,2016-01-29T02:32:18Z,21.717

2016-01-29T03:08:07Z,2016-01-29T03:16:11Z,8.067

2016-01-29T03:51:24Z,2016-01-29T06:57:44Z,186.333

2016-01-29T07:05:50Z,2016-01-29T07:13:21Z,7.517

2016-01-29T07:25:53Z,2016-01-29T07:30:23Z,4.5

2016-01-29T23:30:00Z,2016-01-30T07:00:30Z,450.5

2016-01-31T23:30:00Z,2016-02-01T07:30:00Z,480

2016-02-02T00:35:54Z,2016-02-02T02:10:54Z,95
CSV	Extract	from	Apple	Health
• Human	readable,	splittable	
• No	ability	to	block	compress	
• No	in-built	self-describing	metadata	
• Timestamps	will	need	special	processing	
• Store	final	data	in	parquet	format	to	
address	some	of	these	concerns
{"entities": {"user_mentions": [], "media": [],
"hashtags": [], "urls": []}, "text": "Off to visit
our office in Bangalore in 15 mins. It'll be good
to meet up with Venkat again, plus his team of Ram
and Jay.", "created_at": "2010-09-01 00:00:00
+0000", "source": "<a href="http://twitter.com"
rel="nofollow">Twitter Web Client</a>", "id_str":
"22684302309", "geo": {}, "id": 22684302309,
"user": {"verified": false, "name": "Mark Rittman",
"profile_image_url_https": "https://pbs.twimg.com/
profile_images/702537100890087425/
rAlqgrGX_normal.jpg", "protected": false, "id_str":
"14716125", "id": 14716125, "screen_name":
"markrittman"}}
JSON	Records	from	Twitter
• Human	readable,	splittable	
• No	ability	to	block	compress	(+verbose)	
• Built	self-describing	metadata	
• Less	mature	SerDe	support
info@rittmanmead.com www.rittmanmead.com @rittmanmead 31
•Beginners usually store data in HDFS using text file formats (CSV) but these have limitations

•Apache AVRO often used for general-purpose processing

‣Splitability, schema evolution, in-built metadata, support for block compression

•Parquet now commonly used with Impala due to column-orientated storage

‣Mirrors work in RDBMS world around column-store

‣Only return (project) the columns you require across a wide table
Specialised File Formats - Parquet and AVRO
info@rittmanmead.com www.rittmanmead.com @rittmanmead 32
Example HiveQL Commands to Create + Populate Table
create table health_sleep_analysis_tmp (

asleep_start_ts timestamp,

asleep_end_ts timestamp,

mins_asleep float)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (

"separatorChar" = “,",

"quoteChar" = "'",

"escapeChar" = ""

)

STORED AS TEXTFILE;
create table health_sleep_analysis

stored as parquet

as

select from_unixtime(unix_timestamp(asleep_start, "yyyy-MM-dd'T'hh:mm:ss'Z'")) asleep_start_ts,

from_unixtime(unix_timestamp(asleep_end, "yyyy-MM-dd'T'hh:mm:ss'Z'")) end_start_ts,

mins_asleep

from health_sleep_analysis_tmp;
• Define	temporary	Hive	table		to	store	start	and	end	times/dates	as	strings,

as	we	can’t	do	the	string>timestamp	conversion	using	the	LOAD	DATA	
command	
• Use	the	OpenCSVSerde	file	format	so	that	we	can	specify	delimiters,	quote	
chars	and	escape	chars	for	file	data	
• Store	as	regular	uncompressed	human-readable	text	file
LOAD DATA INPATH '/user/iot/Health/apple_health_sleep_analysis_noheader.csv'
OVERWRITE INTO TABLE health_sleep_analysis_tmp;
• Load	the	data	file	into	that	temporary	Hive	table
• Now	re-load	that	temporary	data	into	more	
optimised	Parquet	format	files,	suitable	for	ad-hoc	
analytic	querying	
• Convert	the	timestamps	currently	held	in	generic	
string	datatype	fields	into	more	optimal	TIMESTAMP	
datatypes	using	a	Hive	UDF
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•One of several third-party SerDes available to download from Github
Use of Third-Party (Community) Serde - JSONSerde
CREATE EXTERNAL TABLE tweets(
id string,
created_at string,
source string,
favorited boolean,
retweeted_status struct<text:string,
user:struct<screen_name:string,name:string>,
retweet_count:int>,
entities struct<urls:array
<struct<expanded_url:string>>, 

user_mentions:array<struct<screen_name:string,name:string>>,
hashtags:array<struct<text:string>>>,
text string,
user struct<screen_name:string,name:string,friends_count:int,followers_count:int,
statuses_count:int,verified:boolean,utc_offset:int,time_zone:string>,
in_reply_to_screen_name string
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
STORED AS TEXTFILE
LOCATION '/user/iot/tweets/';
• Note	the	use	of	STRUCT	and	ARRAY	datatypes	
• Used	to	handle	arrays	of	hashtags,	URLs	etc	in	tweets
Just	select	the	JSON	elements	that	we	want	from	the	
overall	schema	in	JSON	records
Created	as	an	external	Hive	table,	so	overlays	schema	on	
existing	directory	of	files
info@rittmanmead.com www.rittmanmead.com @rittmanmead 34
•Hive SELECT statement against nested columns returns data as arrays

•Can parse programatically, or create further views or CTAS tables to split out array
Support for Nested (Array)-Type Structures
hive> select entities, user from tweets
> limit 3;
OK
{"urls":[{"expanded_url":"http://www.rittmanmead.com/
biforum2013"}],"user_mentions":[],"hashtags":[]}
{"screen_name":"markrittman","name":"Mark
Rittman","friends_count":null,"followers_count":null,"statuses_count":null,"ver
ified":false,"utc_offset":null,"time_zone":null}
{"urls":[{"expanded_url":"http://www.bbc.co.uk/news/
technology-22299503"}],"user_mentions":[],"hashtags":[]}
{"screen_name":"markrittman","name":"Mark
Rittman","friends_count":null,"followers_count":null,"statuses_count":null,"ver
ified":false,"utc_offset":null,"time_zone":null}
{"urls":[{"expanded_url":"http://pocket.co/seb2e"}],"user_mentions":
[{"screen_name":"ArtOfBI","name":"Christian Screen"},
{"screen_name":"wiseanalytics","name":"Lyndsay Wise"}],"hashtags":[]}
{"screen_name":"markrittman","name":"Mark
Rittman","friends_count":null,"followers_count":null,"statuses_count":null,"ver
ified":false,"utc_offset":null,"time_zone":null}
How	to	you	work	with	these	values?
CREATE TABLE tweets_expanded
stored as parquet
AS select
tweets.id,
tweets.created_at,
tweets.user.screen_name as user_screen_name,
tweets.user.friends_count as user_friends_count,
tweets.user.followers_count as user_followers_count,
tweets.user.statuses_count as user_tweets_count,
tweets.text,
tweets.in_reply_to_screen_name,
tweets.retweeted_status.user.screen_name as retweet_user_screen_name,
tweets.retweeted_status.retweet_count as retweet_count,
tweets.entities.urls[0].expanded_url as url1,
tweets.entities.urls[1].expanded_url as url2,
tweets.entities.hashtags[0].text as hashtag1,
tweets.entities.hashtags[1].text as hashtag2,
tweets.entities.hashtags[2].text as hashtag3,
tweets.entities.hashtags[3].text as hashtag4
from tweets;
Create	a	copy	of	the	table	in	Parquet	storage	format
“Denormalize”	the	array	by	selecting	individual	elements	
CREATE view tweets_expanded_view
AS select
tweets.id,
tweets.created_at,
tweets.user.screen_name as user_screen_name,
tweets.user.friends_count as user_friends_count,
tweets.user.followers_count as user_followers_count,
tweets.user.statuses_count as user_tweets_count,
tweets.text,
tweets.in_reply_to_screen_name,
tweets.retweeted_status.user.screen_name as retweet_user_screen_name,
tweets.retweeted_status.retweet_count as retweet_count,
tweets.entities.urls[0].expanded_url as url1,
tweets.entities.urls[1].expanded_url as url2,
tweets.entities.hashtags[0].text as hashtag1,
tweets.entities.hashtags[1].text as hashtag2,
tweets.entities.hashtags[2].text as hashtag3,
tweets.entities.hashtags[3].text as hashtag4
from tweets;
…	or	create	as	a	view	(not	all	BI	tools	support	views	though)
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Use HiveQL to create aggregations, select individual columns (JSON elements) from data

•Use WHERE clause to limit data returned & ORDER BY to sort - as per normal SQL
35
Calculating Aggregations, Filtering Tweet Data
select text, hashtag1, hashtag2 from tweets_expanded

where hashtag1 = ‘obiee’;
Column	selection	only	=	just	MAP	task
select in_reply_to_screen_name, count(*) as total_replies_to
from tweets_expanded
group by in_reply_to_screen_name
order by total_replies_to desc
limit 10;
Selection	and	aggregation	=	MAP()	and	REDUCE	task
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Hive MR jobs can have multiple stages

•MapReduce Stages, Metastore operations

•File Move / Rename etc
Multi-Stage MapReduce Jobs
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM (
SELECT * FROM tweets WHERE regexp_extract(created_at,"(2015)*",1) = "2015"
) tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
1
2
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Multi-Step HiveQL Transforms - Tweet Sentiment
create external table load_tweets(id string,text STRING) 

ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' 

LOCATION '/user/iot/tweets';
create table split_words as 

select id as id,split(text,' ') as words 

from load_tweets;
create table tweet_word as 

select id as id,word 

from split_words 

LATERAL VIEW explode(words) w as word;
create table dictionary

(word string,rating int) 

ROW FORMAT DELIMITED 

FIELDS TERMINATED BY ‘t';
create table word_join as 

select tweet_word.id,tweet_word.word,dictionary.rating 

from tweet_word 

LEFT OUTER JOIN dictionary 

ON(tweet_word.word =dictionary.word);
select t.text, r.rating from tweets_expanded t
join (select id,AVG(rating) as rating 

from word_join 

GROUP BY word_join.id) r on t.id = r.id
order by r.rating;
LOAD DATA INPATH 'afinn.txt' 

into TABLE dictionary;
1
2
3
4
5
6
7
Take	all	the	text	within	a	set	of	tweets,	and	explode-out	
all	the	words	into	a	table,	one	row	per	word
Load	in	a	dictionary	file	that	we’ll	use	to	determine	the	
sentiment	of	words	in	these	tweets
Join	the	words	and	the	dictionary	sentiment	scores	
together,	so	every	word	used	with	any	of	the	tweets	has	
a	sentiment	score	we	can	use
Now	average-out	the	sentiment	scores	for	each	word	
within	a	tweet,	and	return	the	tweet	text	and	those	
averages	listed	in	descending	sentiment	order
info@rittmanmead.com www.rittmanmead.com @rittmanmead 38
•Not all join types are available in Hive - joins must be equality joins

•No sequences, no primary keys on tables

•Generally need to stage Oracle or other external data into Hive before joining to it

•Hive latency - not good for small microbatch-type work

‣But other alternatives exist - Spark, Impala etc

•Don’t assume that HiveQL == Oracle SQL

‣Test assumptions before committing to platform

•Hive is INSERT / APPEND only - no updates, deletes etc

‣But HBase may be suitable for CRUD-type loading
SQL Considerations : Using Hive vs. Regular Oracle SQL
vs.
info@rittmanmead.com www.rittmanmead.com @rittmanmead 39
•Based on BigTable paper from Google, 2006, Dean et al.

‣“Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.”Key Features:

‣Distributed storage across cluster of machines – Random, online read and write data access

‣Schemaless data model (“NoSQL”)

‣Self-managed data partitions

•Why would you use it with Hive?

‣Allows you to do update and delete

activity rather than just Hive append-only

‣Very fast for incremental loading

‣Can define Hive tables over HBase ones,

allowing OBIEE to then access them
What is HBase?
info@rittmanmead.com www.rittmanmead.com @rittmanmead 40
•HBase Shell CLI allows you to create HBase tables

•GET and PUT commands can then be used to add/update cells, query cells etc
Creating HBase Tables using HBase Shell
hbase shell
create 'carriers','details'
create 'geog_origin','origin'
create 'geog_dest','dest'
create 'flight_delays','dims','measures'
put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles'
put 'geog_dest','LAX','dest:city','Los Angeles, CA'
put 'geog_dest','LAX','dest:state','California'
put 'geog_dest','LAX','dest:id','12892'
hbase(main):015:0> scan 'geog_dest'
ROW                                    COLUMN+CELL

LAX                                   column=dest:airport_name, timestamp=1432067861347, value=Los Angeles, CA: Los Angeles

LAX                                   column=dest:city, timestamp=1432067861375, value=Los Angeles,CA

LAX                                   column=dest:id, timestamp=1432067862018,value=12892

LAX                                   column=dest:state, timestamp=1432067861404,value=California

1 row(s) in 0.0240 seconds
info@rittmanmead.com www.rittmanmead.com @rittmanmead 41
•Direct extract from salesforce.com into HBase 

using Python and add-in packages

‣Python packages extend functionality 

by adding APIs, integration etc

‣Happybase, Beatbox and Pyhs2 packages 

installed along with Python

•All free and open-source
Programmatically Loading HBase Tables using Python
import pyhs2
import happybase
connection = happybase.Connection('bigdatalite')
flight_delays_hbase_table = connection.table('test1_flight_delays')
b = flight_delays_hbase_table.batch(batch_size=10000)
with pyhs2.connect(host='bigdatalite',
               port=10000,
               authMechanism="PLAIN",
               user='oracle',
               password='welcome1',
               database='default') as conn:
    with conn.cursor() as cur:
        #Execute query
        cur.execute("select * from flight_delays_initial_load")
        #Fetch table results
        for i in cur.fetch():
            b.put(str(i[0]),{'dims:year': i[1],
                             'dims:carrier': i[2],
                             'dims:orig': i[3],
                             'dims:dest': i[4],
                             'measures:flights': i[5],
                             'measures:late': i[6],
                             'measures:cancelled': i[7],
                             'measures:distance': i[8]})
b.send()
info@rittmanmead.com www.rittmanmead.com @rittmanmead 42
•Create Hive tables over the HBase ones to provide SQL load/query capabilities

‣Uses HBaseStorageHandler Storage Handler for HBAse

‣HBase columns mapped to Hive columns using SERDEPROPERTIES
Create Hive Table Metadata over HBase Tables
CREATE EXTERNAL TABLE hbase_flight_delays
(key string,
  year string,
  carrier string,
  orig string,
  dest string,
  flights string,
  late   string,
  cancelled string,
  distance string
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,

measures:flights,measures:late,measures:cancelled,measures:distance")
TBLPROPERTIES ("hbase.table.name" = "test1_flight_delays");
info@rittmanmead.com www.rittmanmead.com @rittmanmead 43
•Use HiveQL commands INSERT INTO TABLE … SELECT to load (merge) new data

•Use HiveQL SELECT query to retrieve data from HBase table
Load and Query HBase using HiveQL
insert into table hbase_flight_delays              
select * from flight_delays_initial_load;      
        
Total jobs = 1
...
Total MapReduce CPU Time Spent: 11 seconds 870 msec
OK
Time taken: 40.301 seconds
select count(*), min(cast(key as bigint)) as min_key, max(cast(key as bigint)) as max_key
from hbase_flight_delays;
Total jobs = 1
...
Total MapReduce CPU Time Spent: 14 seconds 660 msec
OK
200000  1  200000
Time taken: 53.076 seconds, Fetched: 1 row(s)
info@rittmanmead.com www.rittmanmead.com @rittmanmead 44
•But Parquet (and HDFS) have significant limitation for real-time analytics applications

‣Append-only orientation, focus on column-store 

makes streaming ingestion harder

•Cloudera Kudu aims to combine 

best of HDFS + HBase

‣Real-time analytics-optimised 

‣Supports updates to data

‣Fast ingestion of data

‣Accessed using SQL-style tables

and get/put/update/delete API
Cloudera Kudu - Combining Best of HBase and Column-Store
info@rittmanmead.com www.rittmanmead.com @rittmanmead 45
•Kudu storage used with Impala - create tables using Kudu storage handler

•Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA
Example Impala DDL + DML Commands with Kudu
CREATE TABLE `my_first_table` (
`id` BIGINT,
`name` STRING
)
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'my_first_table',
'kudu.master_addresses' = 'kudu-master.example.com:7051',
'kudu.key_columns' = 'id'
);
INSERT INTO my_first_table VALUES (99, "sarah");
INSERT IGNORE INTO my_first_table VALUES (99, "sarah");
UPDATE my_first_table SET name="bob" where id = 3;
DELETE FROM my_first_table WHERE id < 3;
DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;
only one problem…
Hive is slow
too slow

for ad-hoc querying
info@rittmanmead.com www.rittmanmead.com @rittmanmead 50
•MapReduce’s great innovation was to break processing down into distributed jobs

•Jobs that have no functional dependency on each other, only upstream tasks

•Provides a framework that is infinitely scalable and very fault tolerant

•Hadoop handled job scheduling and resource management

‣All MapReduce code had to do was provide the “map” and “reduce” functions

‣Automatic distributed processing

‣Slow but extremely powerful
Hadoop 1.0 and MapReduce
info@rittmanmead.com www.rittmanmead.com @rittmanmead 51
•A typical Hive or Pig script compiles down into multiple MapReduce jobs

•Each job stages its intermediate results to disk

•Safe, but slow - write to disk, spin-up separate JVMs for each job
MapReduce - Scales By Writing Intermediate Results to Disk
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM (
SELECT * FROM tweets WHERE regexp_extract(created_at,"(2015)*",1) = "2015"
) tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.34 sec HDFS Read: 10952994 HDFS Write: 5239 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.1 sec HDFS Read: 9983 HDFS Write: 164 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 440 msec
OK
1
2
info@rittmanmead.com www.rittmanmead.com @rittmanmead 52
•MapReduce 2 (MR2) splits the functionality of the JobTracker

by separating resource management and job scheduling/monitoring

•Introduces YARN (Yet Another Resource Manager)

•Permits other processing frameworks to MR

‣For example, Apache Spark

•Maintains backwards compatibility with MR1

•Introduced with CDH5+
MapReduce 2 and YARN
Node

Manager
Node

Manager
Node

Manager
Resource

Manager
Client
Client
info@rittmanmead.com www.rittmanmead.com @rittmanmead 53
•Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc

•Models processing as an entire data flow graph (DAG), rather than separate job steps

‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems

‣Dataflow steps pass data between them as streams, rather than writing/reading from disk

•Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez

•Favoured In-memory / Hive v2 

route by Hortonworks
Apache Tez
InputData
TEZ DAG
Map()
Map()
Map()
Reduce()
OutputData
Reduce()
Reduce()
Reduce()
InputData
Map()
Map()
Reduce()
Reduce()
info@rittmanmead.com www.rittmanmead.com @rittmanmead 54
Tez Advantage - Drop-In Replacement for MR with Hive, Pig
set hive.execution.engine=mr
set hive.execution.engine=tez
4m 17s
2m 25s
info@rittmanmead.com www.rittmanmead.com @rittmanmead 56
•Cloudera’s answer to Hive query response time issues

•MPP SQL query engine running on Hadoop, bypasses MapReduce for
direct data access

•Mostly in-memory, but spills to disk if required

•Uses Hive metastore to access Hive table metadata

•Similar SQL dialect to Hive - not as rich though and no support for Hive
SerDes, storage handlers etc
Cloudera Impala - Fast, MPP-style Access to Hadoop Data
info@rittmanmead.com www.rittmanmead.com @rittmanmead 57
How Impala Works
Impala

Daemon
HDFS

DataNode
SQL

App
ODBC	/

JDBC
HDFS

DataNode
HDFS

DataNode
HDFS

DataNode
Impala

Daemon
Impala

Daemon
Impala

Daemon
Hive

MetaStore
Impala

StateStore
•Cloudera-based solution for ad-hoc SQL-on-Hadoop

•MPP SQL query engine running on Hadoop, with
daemons running on each Hadoop node 

•In contrast to jobs being submitted via YARN

•Mostly in-memory, but spills to disk if required

•Uses Hive metastore to access Hive table metadata

•Similar SQL dialect to Hive - not as rich though and
no support for Hive SerDes, storage handlers etc
info@rittmanmead.com www.rittmanmead.com @rittmanmead 58
•Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list

•Run SHOW TABLES Impala SQL command to view tables available

•Run COUNT(*) on main ACCESS_PER_POST table to see typical response time
Enabling Hive Tables for Impala
[oracle@bigdatalite ~]$ impala-shell
Starting Impala Shell without Kerberos authentication
[bigdatalite.localdomain:21000] > invalidate metadata;
Query: invalidate metadata
Fetched 0 row(s) in 2.18s
[bigdatalite.localdomain:21000] > show tables;
Query: show tables
+-----------------------------------+
| name |
+-----------------------------------+
| access_per_post |
| access_per_post_cat_author |
| … |
| posts |
|——————————————————————————————————-+
Fetched 45 row(s) in 0.15s
[bigdatalite.localdomain:21000] > select count(*) 

from access_per_post;
Query: select count(*) from access_per_post
+----------+
| count(*) |
+----------+
| 343 |
+----------+
Fetched 1 row(s) in 2.76s
info@rittmanmead.com www.rittmanmead.com @rittmanmead 59
•Significant improvement over Hive response time

•Now makes Hadoop suitable for ad-hoc querying
Significantly-Improved Ad-Hoc Query Response Time vs Hive
|
Logical Query Summary Stats: Elapsed time 2, Response time 1, Compilation time 0 (seconds)
Logical Query Summary Stats: Elapsed time 50, Response time 49, Compilation time 0 (seconds)
Simple Two-Table Join against Hive Data Only
Simple Two-Table Join against Impala Data Only
vs
What about integration

with my Data Warehouse?
info@rittmanmead.com www.rittmanmead.com @rittmanmead 61
•Most Traditional data warehousing vendors offer a Hadoop integration option

•Oracle Big Data SQL

•IBM Big SQL etc

•Leverage lower-level SQL-on-Hadoop

metadata but use own server process

•Allows DBAs to write SQL using RDBMS

SQL dialect, run across relational, Hadoop

and NoSQL servers
Hadoop	Distributed	Filesystem	(HDFS)
YARN	Resource	Manager
Query	and	Processing	Engines
Oracle

Big	Data	SQL

Server
Unstructured	/

Semi-Structured

Log	Data
Offloaded

Archive

Data
Social	Graphs

&	Networks
Smart	Meter

&	Sensor	Data
61
NoSQL	Key-Value

Store	DB
Platform for Traditional DW Integration with Hadoop
Oracle

RDBMS
info@rittmanmead.com www.rittmanmead.com @rittmanmead 62
•Originally Part of Oracle Big Data 4.0 (BDA-only) but now available for commodity Hadoop installs

‣Also requires Oracle Database 12c (no longer dependent on Exadata from Big Data SQL 4.0)

‣Extends Oracle Data Dictionary to cover Hive

•Extends Oracle SQL and SmartScan to Hadoop

•Extends Oracle Security Model over Hadoop

‣Fine-grained access control

‣Data redaction, data masking

‣Uses fast c-based readers where possible

(vs. Hive MapReduce generation)
Oracle Big Data SQL
Exadata

Storage Servers
Hadoop

Cluster
Exadata Database

Server
Oracle Big

Data SQL
SQL Queries
SmartScan SmartScan
info@rittmanmead.com www.rittmanmead.com @rittmanmead 63
•Oracle Database 12c 12.1.0.2.0 with Big Data SQL option can view Hive table metadata

‣Linked by Exadata configuration steps to one or more BDA clusters

•DBA_HIVE_TABLES and USER_HIVE_TABLES exposes Hive metadata

•Oracle SQL*Developer 4.0.3, with Cloudera Hive drivers, can connect to Hive metastore
View Hive Table Metadata in the Oracle Data Dictionary
SQL> col database_name for a30
SQL> col table_name for a30
SQL> select database_name, table_name
2 from dba_hive_tables;
DATABASE_NAME TABLE_NAME
------------------------------ ------------------------------
default access_per_post
default access_per_post_categories
default access_per_post_full
default apachelog
default categories
default countries
default cust
default hive_raw_apache_access_log
info@rittmanmead.com www.rittmanmead.com @rittmanmead 64
•Big Data SQL accesses Hive tables through external table mechanism

‣ORACLE_HIVE external table type imports Hive metastore metadata

‣ORACLE_HDFS requires metadata to be specified

•Access parameters cluster and tablename specify Hive table source and BDA cluster
Hive Access through Oracle External Tables + Hive Driver
CREATE TABLE access_per_post_categories(
hostname varchar2(100),
request_date varchar2(100),
post_id varchar2(10),
title varchar2(200),
author varchar2(100),
category varchar2(100),
ip_integer number)
organization external
(type oracle_hive
default directory default_dir
access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));
info@rittmanmead.com www.rittmanmead.com @rittmanmead 65
•Brings query-offloading features similar to Exadata

to Oracle Big Data Appliance

•Query across both Oracle and Hadoop sources

•Intelligent query optimisation applies SmartScan

close to ALL data

•Use same SQL dialect across both sources

•Apply same security rules, policies, 

user access rights across both sources
Extending SmartScan, and Oracle SQL, Across All Data
hold on…
“where we’re going

we don’t need roads”
“where we’re going

we don’t need roads”
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Apache Drill is another SQL-on-Hadoop project that focus on schema-free data discovery

•Inspired by Google Dremel, innovation is querying raw data with schema optional

•Automatically infers and detects schema from semi-structured datasets and NoSQL DBs

•Join across different silos of data e.g. JSON records, Hive tables and HBase database

•Aimed at different use-cases than Hive - 

low-latency queries, discovery 

(think Endeca vs OBIEE)
Introducing Apache Drill - “We Don’t Need No Roads”
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Most modern datasource formats embed their schema in the data (“schema-on-read”)

•Apache Drill makes these as easy to join to traditional datasets as “point me at the data”

•Cuts out unnecessary work in defining Hive schemas for data that’s self-describing

•Supports joining across files,

databases, NoSQL etc
Self-Describing Data - Parquet, AVRO, JSON etc
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Files can exist either on the local filesystem, or on HDFS

•Connection to directory or file defined in storage configuration

•Can work with CSV, TXT, TSV etc

•First row of file can provide schema (column names)
Apache Drill and Text Files
SELECT * FROM dfs.`/tmp/csv_with_header.csv2`;
+-------+------+------+------+
| name | num1 | num2 | num3 |
+-------+------+------+------+
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
| hello | 1 | 2 | 3 |
+-------+------+------+------+
7 rows selected (0.12 seconds)
SELECT * FROM dfs.`/tmp/csv_no_header.csv`;
+------------------------+
| columns |
+------------------------+
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
| ["hello","1","2","3"] |
+------------------------+
7 rows selected (0.112 seconds)
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•JSON (Javascript Object Notation) documents are
often used for data interchange

•Exports from Twitter and other consumer services

•Web service responses and other B2B interfaces

•A more lightweight form of XML that is “self-
describing”

•Handles evolving schemas, and optional attributes

•Drill treats each document as a row, and has features
to

•Flatten nested data (extract elements from arrays)

•Generate key/value pairs for loosely structured data
Apache Drill and JSON Documents
use dfs.iot;
show files;
select in_reply_to_user_id, text from `all_tweets.json`
limit 5;
+---------------------+------+
| in_reply_to_user_id | text |
+---------------------+------+
| null | BI Forum 2013 in Brighton has now sold-out |
| null | "Football has become a numbers game |
| null | Just bought Lyndsay Wise’s Book |
| null | An Oracle BI "Blast from the Past" |
| 14716125 | Dilbert on Agile Programming |
+---------------------+------+
5 rows selected (0.229 seconds)
select name, flatten(fillings) as f 

from dfs.users.`/donuts.json` 

where f.cal < 300;
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Drill can connect to Hive to make use of metastore (incl. multiple Hive metastores)

•NoSQL databases (HBase etc)

•Parquet files (native storage format - columnar + self describing)
Apache Drill and Hive, HBase, Parquet Sources etc
USE hbase;
SELECT * FROM students;
+-------------+-----------------------+-----------------------------------------------------+
| row_key | account | address |
+-------------+-----------------------+------------------------------------------------------+
| [B@e6d9eb7 | {"name":"QWxpY2U="} | {"state":"Q0E=","street":"MTIzIEJhbGxtZXIgQXY="} |
| [B@2823a2b4 | {"name":"Qm9i"} | {"state":"Q0E=","street":"MSBJbmZpbml0ZSBMb29w"} |
| [B@3b8eec02 | {"name":"RnJhbms="} | {"state":"Q0E=","street":"NDM1IFdhbGtlciBDdA=="} |
| [B@242895da | {"name":"TWFyeQ=="} | {"state":"Q0E=","street":"NTYgU291dGhlcm4gUGt3eQ=="} |
+-------------+-----------------------+----------------------------------------------------------------------+
SELECT firstname,lastname FROM 

hiveremote.`customers` limit 10;`

+------------+------------+
| firstname | lastname |
+------------+------------+
| Essie | Vaill |
| Cruz | Roudabush |
| Billie | Tinnes |
| Zackary | Mockus |
| Rosemarie | Fifield |
| Bernard | Laboy |
| Marianne | Earman |
+------------+------------+
SELECT * FROM dfs.`iot_demo/geodata/region.parquet`;
+--------------+--------------+-----------------------+
| R_REGIONKEY | R_NAME | R_COMMENT |
+--------------+--------------+-----------------------+
| 0 | AFRICA | lar deposits. blithe |
| 1 | AMERICA | hs use ironic, even |
| 2 | ASIA | ges. thinly even pin |
| 3 | EUROPE | ly final courts cajo |
| 4 | MIDDLE EAST | uickly special accou |
+--------------+--------------+-----------------------+
info@rittmanmead.com www.rittmanmead.com @rittmanmead
•Drill developed for real-time, ad-hoc data exploration with schema discovery on-the-fly

•Individual analysts exploring new datasets, leveraging corporate metadata/data to help

•Hive is more about large-scale, centrally curated set-based big data access

•Drill models conceptually as JSON, vs. Hive’s tabular approach

•Drill introspects schema from whatever it connects to, vs. formal modeling in Hive
Apache Drill vs. Apache Hive
Interactive	Queries

(Data	Discovery,	Tableau/VA)
Reporting	Queries

(Canned	Reports,	OBIEE)
ETL

(ODI,	Scripting,	Informatica)
Apache	Drill Apache	Hive
Interactive	Queries	
100ms	-	3mins
Reporting	Queries	
3mins	-	20mins
ETL	&	Batch	Queries	
20mins	-	hours
but…
what’s all this about “Spark”?
info@rittmanmead.com www.rittmanmead.com @rittmanmead 78
•Another DAG execution engine running on YARN

•More mature than TEZ, with richer API and more vendor support

•Uses concept of an RDD (Resilient Distributed Dataset)

‣RDDs like tables or Pig relations, but can be cached in-memory

‣Great for in-memory transformations, or iterative/cyclic processes

•Spark jobs comprise of a DAG of tasks operating on RDDs

•Access through Scala, Python or Java APIs

•Related projects include

‣Spark SQL

‣Spark Streaming
Apache Spark
info@rittmanmead.com www.rittmanmead.com @rittmanmead 79
•Native support for multiple languages 

with identical APIs

‣Python - prototyping, data wrangling

‣Scala - functional programming features

‣Java - lower-level, application integration

•Use of closures, iterations, and other 

common language constructs to minimize code

•Integrated support for distributed +

functional programming

•Unified API for batch and streaming
Rich Developer Support + Wide Developer Ecosystem
scala> val logfile = sc.textFile("logs/access_log")
14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) 

called with curMem=234759, maxMem=309225062
14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 

stored as values to memory (estimated size 75.5 KB, free 294.6 MB)
logfile: org.apache.spark.rdd.RDD[String] = 

MappedRDD[31] at textFile at <console>:15
scala> logfile.count()
14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1
14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1
...
14/05/12 21:19:06 INFO SparkContext: Job finished: 

count at <console>:18, took 0.192536694 s
res7: Long = 154563
scala> val logfile = sc.textFile("logs/access_log").cache
scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/"))
biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17
scala> biapps11g.count()
...
14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s
res9: Long = 403
info@rittmanmead.com www.rittmanmead.com @rittmanmead 80
•Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries

•Bring in and federate additional data from JDBC sources

•Load, read and save data in Hive, Parquet and other structured tabular formats
Spark SQL - Adding SQL Processing to Apache Spark
val accessLogsFilteredDF = accessLogs
.filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*"))
.filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF()
.registerTempTable("accessLogsFiltered")
val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) 

as total 

FROM accessLogsFiltered a 

JOIN posts p ON a.endpoint = p.POST_SLUG 

GROUP BY p.POST_TITLE, p.POST_AUTHOR 

ORDER BY total DESC LIMIT 10 ")
// Persist top ten table for this window to HDFS as parquet file
topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"

, "parquet", SaveMode.Overwrite)
Hadoop is insecure
and has fragmented security
…doesn’t it?
but …
info@rittmanmead.com www.rittmanmead.com @rittmanmead 82
Consistent Security and Audit Now Emerging on Platform
info@rittmanmead.com www.rittmanmead.com @rittmanmead 83
•Clusters by default are unsecured (vunerable to account spoofing) & need Kerberos enabled

•Data access controlled by POSIX-style permissions on HDFS files

•Hive and Impala can Apache Sentry RBAC

‣Result is data duplication and complexity

‣No consistent API or abstracted security model
Hadoop Security Initially Was a Mess
/user/mrittman/scratchpad
/user/ryeardley/scratchpad
/user/mpatel/scratchpad
/user/mrittman/scratchpad
/user/mrittman/scratchpad
/data/rm_website_analysis/logfiles/incoming
/data/rm_website_analysis/logfiles/archive
/data/rm_website_analysis/tweets/incoming
/data/rm_website_analysis/tweets/archive
info@rittmanmead.com www.rittmanmead.com @rittmanmead 84
•Use standard Oracle Security over Hadoop & NoSQL

‣Grant & Revoke Privileges

‣Redact Data

‣Apply Virtual Private Database

‣Provides Fine-grain Access Control

•Great solution to extend existing Oracle

security model over Hadoop datasets
Oracle Big Data SQL : Extend Oracle Security to Hadoop
Redacted
data	
subset
SQL
JSON
Customer	data
in	Oracle	DB
DBMS_REDACT.ADD_POLICY(
object_schema => 'txadp_hive_01',
object_name => 'customer_address_ext',
column_name => 'ca_street_name',
policy_name => 'customer_address_redaction',
function_type => DBMS_REDACT.RANDOM,
expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'', 

''REDACTION_TESTER'')=''TRUE'''
);
info@rittmanmead.com www.rittmanmead.com @rittmanmead 85
•Provides a higher level, logical abstraction for data (ie Tables or Views) 

‣Can be used with Spark & Spark SQL, with Predicate pushdown, projection 

•Returns schemed objects (instead of paths and bytes) in similar way to HCatalog

•Unified data access path allows platform-wide performance improvements

•Secure service that does not execute arbitrary user code

‣Central location for all authorization checks using Sentry metadata.
Cloudera RecordService
info@rittmanmead.com www.rittmanmead.com @rittmanmead 87
Choosing a SQL-on-Hadoop Engine
The	original	SQL-on-Hadoop	engine	
Maximum	compatibility	with	Hadoop	
…	but	designed	for	batch	processing
Plug-in	replacement	for	MapReduce

Works	via	YARN	and	submitting	jobs	
Speeds-up	Hive	but	long-term	future?
Daemon-based	MPP	engines	
Impala	is	more	mature	
Drill	innovates	around	data-discovery
Adds	SQL	access	and	set-based	
processing	to	Spark	
Useful	for	query	federation
Vendor-provided	RBDMS-Hadoop	
integration	bridges
You don’t need to learn

Java, or MapReduce,

or Scala
or Toupee
SQL-on-Hadoop isn’t

just Hive
Check-out these Developer VMs:
http://www.cloudera.com/documentation/enterprise/5-3-x/topics/
cloudera_quickstart_vm.html



http://hortonworks.com/products/sandbox/
http://www.oracle.com/technetwork/database/bigdata-appliance/oracle-
bigdatalite-2104726.html
https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill
http://www.rittmanmead.com
info@rittmanmead.com www.rittmanmead.com @rittmanmead
Gluent New World #02: 

SQL-on-Hadoop with Mark Rittman
Mark Rittman, CTO, Rittman Mead
April 2016

More Related Content

What's hot

OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...
OTN EMEA TOUR 2016  - OBIEE12c New Features for End-Users, Developers and Sys...OTN EMEA TOUR 2016  - OBIEE12c New Features for End-Users, Developers and Sys...
OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...Mark Rittman
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsOracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsMark Rittman
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...Mark Rittman
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Mark Rittman
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017Rittman Analytics
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing ArchitectureGang Tao
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyMark Rittman
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...StampedeCon
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringAnant Corporation
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azureDavid Giard
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformCaserta
 

What's hot (20)

OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...
OTN EMEA TOUR 2016  - OBIEE12c New Features for End-Users, Developers and Sys...OTN EMEA TOUR 2016  - OBIEE12c New Features for End-Users, Developers and Sys...
OTN EMEA TOUR 2016 - OBIEE12c New Features for End-Users, Developers and Sys...
 
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business AnalyticsOracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
Oracle BI Hybrid BI : Mode 1 + Mode 2, Cloud + On-Premise Business Analytics
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...Riga dev day 2016   adding a data reservoir and oracle bdd to extend your ora...
Riga dev day 2016 adding a data reservoir and oracle bdd to extend your ora...
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Big Data Computing Architecture
Big Data Computing ArchitectureBig Data Computing Architecture
Big Data Computing Architecture
 
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case StudyOracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
Oracle Big Data Spatial & Graph 
Social Media Analysis - Case Study
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Data Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data EngineeringData Engineer's Lunch #55: Get Started in Data Engineering
Data Engineer's Lunch #55: Get Started in Data Engineering
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platformBig Data 2.0: ETL & Analytics: Implementing a next generation platform
Big Data 2.0: ETL & Analytics: Implementing a next generation platform
 

Viewers also liked

GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapiragluent.
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesTanel Poder
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesTanel Poder
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparktrihug
 
How to scale relational (OLTP) databases. Think: Sharding @C16LV
How to scale relational (OLTP) databases. Think: Sharding @C16LVHow to scale relational (OLTP) databases. Think: Sharding @C16LV
How to scale relational (OLTP) databases. Think: Sharding @C16LVMaxym Kharchenko
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizationsSzehon Ho
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013alanfgates
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Oracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known FeaturesOracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known FeaturesTanel Poder
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder
 
Emrah METE - Oracle Cloud Day 2015 12c SQL New Features
Emrah METE - Oracle Cloud Day 2015 12c SQL New Features Emrah METE - Oracle Cloud Day 2015 12c SQL New Features
Emrah METE - Oracle Cloud Day 2015 12c SQL New Features Emrah METE
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresSteve Loughran
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTanel Poder
 
Modern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application TroubleshootingModern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application TroubleshootingTanel Poder
 

Viewers also liked (20)

GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
 
GNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for DatabasesGNW01: In-Memory Processing for Databases
GNW01: In-Memory Processing for Databases
 
Low Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling ExamplesLow Level CPU Performance Profiling Examples
Low Level CPU Performance Profiling Examples
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 
TriHUG Feb: Hive on spark
TriHUG Feb: Hive on sparkTriHUG Feb: Hive on spark
TriHUG Feb: Hive on spark
 
How to scale relational (OLTP) databases. Think: Sharding @C16LV
How to scale relational (OLTP) databases. Think: Sharding @C16LVHow to scale relational (OLTP) databases. Think: Sharding @C16LV
How to scale relational (OLTP) databases. Think: Sharding @C16LV
 
Hive join optimizations
Hive join optimizationsHive join optimizations
Hive join optimizations
 
Strata Stinger Talk October 2013
Strata Stinger Talk October 2013Strata Stinger Talk October 2013
Strata Stinger Talk October 2013
 
Oracle Cloud As Services
Oracle Cloud As ServicesOracle Cloud As Services
Oracle Cloud As Services
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
 
Oracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known FeaturesOracle Exadata Performance: Latest Improvements and Less Known Features
Oracle Exadata Performance: Latest Improvements and Less Known Features
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)
 
Emrah METE - Oracle Cloud Day 2015 12c SQL New Features
Emrah METE - Oracle Cloud Day 2015 12c SQL New Features Emrah METE - Oracle Cloud Day 2015 12c SQL New Features
Emrah METE - Oracle Cloud Day 2015 12c SQL New Features
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
 
Hadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object StoresHadoop, Hive, Spark and Object Stores
Hadoop, Hive, Spark and Object Stores
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
 
Troubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contentionTroubleshooting Complex Performance issues - Oracle SEG$ contention
Troubleshooting Complex Performance issues - Oracle SEG$ contention
 
Modern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application TroubleshootingModern Linux Performance Tools for Application Troubleshooting
Modern Linux Performance Tools for Application Troubleshooting
 

Similar to Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the-Art, and Looking towards the Future

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...Mark Rittman
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerMark Kromer
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014cdmaxime
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Thomas W. Dinsmore
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCPrecisely
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectSoftServe
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experienceVitaliy Bashun
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Andrew Brust
 

Similar to Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the-Art, and Looking towards the Future (20)

IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
IlOUG Tech Days 2016 - Big Data for Oracle Developers - Towards Spark, Real-T...
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
Apache Spark - Las Vegas Big Data Meetup Dec 3rd 2014
 
Apache drill
Apache drillApache drill
Apache drill
 
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Customer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDCCustomer Education Webcast: New Features in Data Integration and Streaming CDC
Customer Education Webcast: New Features in Data Integration and Streaming CDC
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data ArchitectHadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
Hadoop Infrastructure and SoftServe Experience by Vitaliy Bashun, Data Architect
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Pacemaker hadoop infrastructure and soft serve experience
Pacemaker   hadoop infrastructure and soft serve experiencePacemaker   hadoop infrastructure and soft serve experience
Pacemaker hadoop infrastructure and soft serve experience
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Big Data Processing
Big Data ProcessingBig Data Processing
Big Data Processing
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 

More from Mark Rittman

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
 
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...Mark Rittman
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudMark Rittman
 
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Mark Rittman
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...Mark Rittman
 
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Mark Rittman
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Mark Rittman
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...Mark Rittman
 
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015Mark Rittman
 
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIBIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIMark Rittman
 
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI ProjectsOGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI ProjectsMark Rittman
 
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cUKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cMark Rittman
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Mark Rittman
 
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gPart 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gMark Rittman
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cMark Rittman
 

More from Mark Rittman (15)

IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
OBIEE12c and Embedded Essbase 12c - An Initial Look at Query Acceleration Use...
 
Deploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle CloudDeploying Full BI Platforms to Oracle Cloud
Deploying Full BI Platforms to Oracle Cloud
 
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
Adding a Data Reservoir to your Oracle Data Warehouse for Customer 360-Degree...
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
 
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
Deploying Full Oracle BI Platforms to Oracle Cloud - OOW2015
 
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
Delivering the Data Factory, Data Reservoir and a Scalable Oracle Big Data Ar...
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
 
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
OBIEE11g Seminar by Mark Rittman for OU Expert Summit, Dubai 2015
 
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODIBIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
BIWA2015 - Bringing Oracle Big Data SQL to OBIEE and ODI
 
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI ProjectsOGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
OGH 2015 - Hadoop (Oracle BDA) and Oracle Technologies on BI Projects
 
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12cUKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
UKOUG Tech'14 Super Sunday : Deep-Dive into Big Data ETL with ODI12c
 
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
Part 1 - Introduction to Hadoop and Big Data Technologies for Oracle BI & DW ...
 
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11gPart 4 - Hadoop Data Output and Reporting using OBIEE11g
Part 4 - Hadoop Data Output and Reporting using OBIEE11g
 
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12cPart 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
Part 2 - Hadoop Data Loading using Hadoop Tools and ODI12c
 

Recently uploaded

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 

Recently uploaded (20)

Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 

Gluent New World #02 - SQL-on-Hadoop : A bit of History, Current State-of-the-Art, and Looking towards the Future

  • 1. info@rittmanmead.com www.rittmanmead.com @rittmanmead Gluent New World #02: 
 SQL-on-Hadoop with Mark Rittman Mark Rittman, CTO, Rittman Mead April 2016
  • 2. info@rittmanmead.com www.rittmanmead.com @rittmanmead 2 •Mark Rittman, Co-Founder of Rittman Mead ‣Oracle ACE Director, specialising in Oracle BI&DW ‣14 Years Experience with Oracle Technology ‣Regular columnist for Oracle Magazine •Author of two Oracle Press Oracle BI books ‣Oracle Business Intelligence Developers Guide ‣Oracle Exalytics Revealed ‣Writer for Rittman Mead Blog :
 http://www.rittmanmead.com/blog •Email : mark.rittman@rittmanmead.com •Twitter : @markrittman About the Speaker
  • 3. info@rittmanmead.com www.rittmanmead.com @rittmanmead 3 •Why Hadoop? And what are the key Hadoop platform features? •Introducing SQL-on-Hadoop, and Apache Hive •How Hive works, and how it’s not just about SELECTing data •Solving Hive’s ad-hoc query performance problem •So what’s all this about Apache Drill? •…. and Oracle Big Data SQL, IBM Big SQL? •Apache Spark, and Spark SQL •Security, Hadoop and SQL-on-Hadoop •Selecting a SQL-on-Hadoop query engine Agenda
  • 4. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Everyone’s talking about Hadoop and “Big Data” Hadoop is the Big Hot Topic In IT / Analytics
  • 5. info@rittmanmead.com www.rittmanmead.com @rittmanmead Highly Scalable (and Affordable) Cluster Computing •Enterprise High-End RDBMSs such as Oracle can scale into the petabytes, using clustering ‣Sharded databases (e.g. Netezza) can scale further but with complexity / single workload trade-offs •Hadoop was designed from outside for massive horizontal scalability - using cheap hardware •Anticipates hardware failure and makes multiple copies of data as protection •More nodes you add, more stable it becomes •And at a fraction of the cost of traditional
 RDBMS platforms
  • 6. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Store and analyze huge volumes of structured and unstructured data •In the past, we had to throw away the detail •No need to define a data model during ingest •Supports multiple, flexible schemas •Separation of storage from compute engine •Allows multiple query engines and frameworks
 to work on the same raw datasets Store Everything Forever - And Process in Many Ways Hadoop Data Lake Webserver
 Log Files (txt) Social Media
 Logs (JSON) DB Archives
 (CSV) Sensor Data
 (XML) `Spatial & Graph
 (XML, txt) IoT Logs
 (JSON, txt) Chat Transcripts
 (Txt) DB Transactions
 (CSV, XML) Blogs, Articles
 (TXT, HTML) Raw Data Processed Data NoSQL Key-Value
 Store DB Tabular Data
 (Hive Tables) Aggregates
 (Impala Tables) NoSQL Document 
 Store DB
  • 7. info@rittmanmead.com www.rittmanmead.com @rittmanmead 7 •Data for customer 360 system typically landed into a Hadoop & NoSQL-based •Applies aggregation, joining and machine-learning processes to extract insights Design Pattern : “Data Lake” or “Data Reservoir” Data Transfer Data Access Data Factory Data Reservoir Business Intelligence Tools Hadoop Platform File Based Integration Stream Based Integration Data streams Discovery & Development Labs Safe & secure Discovery and Development environment Data sets and samples Models and programs Marketing / Sales Applications Models Machine Learning Segments Operational Data Transactions Customer Master ata Unstructured Data Voice + Chat Transcripts ETL Based Integration Raw Customer Data Data stored in the original format (usually files) such as SS7, ASN.1, JSON etc. Mapped Customer Data Data sets produced by mapping and transforming raw data
  • 8. info@rittmanmead.com www.rittmanmead.com @rittmanmead 8 •Combine with a traditional data warehouse to add storage, support for new datatypes •Land raw data in real-time into Hadoop, then process and store Combine with Traditional Data Warehouse
  • 9. info@rittmanmead.com www.rittmanmead.com @rittmanmead 9 •Hadoop is the overall framework for enabling low-cost, scalable cluster computing ‣HDFS cluster filesystem stores the data, in a process/query neutral form (files) ‣YARN resource manager allocates resources to Hadoop jobs ‣MapReduce and other processing frameworks 
 then work on that data •Data is decoupled from the engine that processes it •Layers can be swapped out (Mesos for YARN etc) •Hadoop takes care of the overall cluster framework Key Hadoop Platform Technologies Hadoop Distributed Filesystem (HDFS) YARN Resource Manager Query and Processing Engines Batch
 (MapReduce) In-Memory
 (Spark) Streaming 
 (Spark, Storm) Graph + Search
 (Solr, Giraph) Unstructured /
 Semi-Structured
 Log Data Offloaded
 Archive
 Data Social Graphs
 & Networks Smart Meter
 & Sensor Data
  • 11. but I’m a DBA
  • 13.
  • 14. I don’t know MapReduce
  • 16. or
  • 17. or whatever the latest
 made-up Hadoop language is
  • 18.
  • 19. info@rittmanmead.com www.rittmanmead.com @rittmanmead Introducing SQL-on-Hadoop •Hadoop is not a cheap substitute for enterprise DW platforms - don’t use it like this •But adding SQL processing and abstraction can help in many scenarios: • Query access to data stored in Hadoop as an archive • Aggregating, sorting, filtering data • Set-based transformation capabilities for other frameworks (e.g. Spark) • Ad-hoc analysis and data discovery in-real time • Providing tabular abstractions over complex datatypes 19 Hadoop Distributed Filesystem (HDFS) YARN Resource Manager Query and Processing Engines Batch
 (MapReduce) In-Memory
 (Spark) Streaming 
 (Spark, Storm) Graph + Search
 (Solr, Giraph) Unstructured /
 Semi-Structured
 Log Data Offloaded
 Archive
 Data Social Graphs
 & Networks Smart Meter
 & Sensor Data SQL
 Engine SQL
 Engine
  • 20. info@rittmanmead.com www.rittmanmead.com @rittmanmead 20 •Modern SQL-on-Hadoop engines often provide connectivity
 to data sources outside of the Hadoop cluster ‣Traditional DW platforms ‣No-SQL databases e.g. MongoDB ‣Files, JDBC etc •Provide a framework for data integration
 and data federation, using JDBC drivers Enables Integration with External (And Internal) Data Hadoop Distributed Filesystem (HDFS) YARN Resource Manager Query and Processing Engines In-Memory
 (Spark) Unstructured /
 Semi-Structured
 Log Data Offloaded
 Archive
 Data Social Graphs
 & Networks Smart Meter
 & Sensor Data SQL
 Engine 20 NoSQL Key-Value
 Store DB
  • 21. info@rittmanmead.com www.rittmanmead.com @rittmanmead 21 •Most Traditional data warehousing vendors offer a Hadoop integration option •Oracle Big Data SQL •IBM Big SQL etc •Leverage lower-level SQL-on-Hadoop
 metadata but use own server process •Allows DBAs to write SQL using RDBMS
 SQL dialect, run across relational, Hadoop
 and NoSQL servers Hadoop Distributed Filesystem (HDFS) YARN Resource Manager Query and Processing Engines Oracle
 Big Data SQL
 Server Unstructured /
 Semi-Structured
 Log Data Offloaded
 Archive
 Data Social Graphs
 & Networks Smart Meter
 & Sensor Data 21 NoSQL Key-Value
 Store DB Platform for Traditional DW Integration with Hadoop Oracle
 RDBMS
  • 22. So how do they work?
  • 23. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Original SQL-on-Hadoop engine developed at Facebook, now within the Hadoop project •Allows users to query Hadoop data using SQL-like language •Tabular metadata layer that overlays files, can interpret semi-structured data (e.g. JSON) •Generates MapReduce code to return required data •Extensible through SerDes and Storage Handlers •JDBC and ODBC drivers for most platforms/tools •Perfect for set-based access + batch ETL work 23 Apache Hive : SQL Metadata + Engine over Hadoop YARN Resource Manager Hadoop Distributed Filesystem (HDFS) Unstructured /
 Semi-Structured
 Log Data Offloaded
 Archive
 Data Social Graphs
 & Networks Smart Meter
 & Sensor Data 2323 MapReduce Processing Framework Apache Hive SQL Processing Engine HiveQL SQL Commands Java JARs Submitted Jobs
  • 24. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Queries come in via JDBC/ODBC, the Hive Thrift Server,
 from the CLI or via Hue (for example) •The Hive Metastore (data dictionary) maps files and
 other Hadoop data structures onto tables and columns •The Hive SQL engine parses, plans and then executes
 the query, using an execution plan similar to Oracle,
 SQL Server and other RBDMS engines •MapReduce code is then auto-generated, and submitted
 to YARN, and then run on the Hadoop cluster 24 Apache Hive Logical Architecture Hive Thrift Server JDBC / ODBC Parser Planner Execution Engine Metastore HueCLI MapReduce HDFS hive> select count(*) from src_customer; 
 Total MapReduce jobs = 1
 Launching Job 1 out of 1
 Number of reduce tasks determined at compile time: 1
 In order to change the average load for a reducer (in bytes):
 set hive.exec.reducers.bytes.per.reducer=
 In order to limit the maximum number of reducers:
 set hive.exec.reducers.max=
 In order to set a constant number of reducers:
 set mapred.reduce.tasks=
 Starting Job = job_201303171815_0003, Tracking URL = 
 http://localhost.localdomain:50030/jobdetails.jsp…
 Kill Command = /usr/lib/hadoop-0.20/bin/
 hadoop job -Dmapred.job.tracker=localhost.localdomain:8021 
 -kill job_201303171815_0003
 
 2013-04-17 04:06:59,867 Stage-1 map = 0%, reduce = 0%
 2013-04-17 04:07:03,926 Stage-1 map = 100%, reduce = 0%
 2013-04-17 04:07:14,040 Stage-1 map = 100%, reduce = 33%
 2013-04-17 04:07:15,049 Stage-1 map = 100%, reduce = 100%
 Ended Job = job_201303171815_0003
 OK
 25
 Time taken: 22.21 seconds HiveQL
 Query MapReduce
 Job submitted Results 
 returned
  • 25. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Data integration tools such as Oracle Data Integrator can load and process Hadoop data •BI tools such as Oracle Business Intelligence 12c can report on Hadoop data •Generally use MapReduce and Hive to access data ‣ODBC and JDBC access to Hive tabular data ‣Allows Hadoop unstructured/semi-structured
 data on HDFS to be accessed like RDBMS Provides a SQL Interface for BI + ETL Tools Access direct Hive or extract using ODI12c for structured OBIEE dashboard analysis What pages are people visiting? Who is referring to us on Twitter? What content has the most reach?
  • 26. T : +44 (0) 1273 911 268 (UK) or (888) 631-1410 (USA) or 
 +61 3 9596 7186 (Australia & New Zealand) or +91 997 256 7970 (India) E : info@rittmanmead.com W : www.rittmanmead.com Connecting to Hive using Beeline CLI •From the command-line, either use Hive CLI, or beeline CLI ‣HUE (“Hadoop User Experience”) provides Web interface into Hive (think Oracle Apex) [iot@cdh-node1 ~]$ beeline -u jdbc:hive2://cdh-node1:10000 -n iot -p welcome1 -d org.apache.hive.jdbc.HiveDriver Connecting to jdbc:hive2://cdh-node1:10000
 Connected to: Apache Hive (version 1.1.0-cdh5.5.1)
 Driver: Hive JDBC (version 1.1.0-cdh5.5.1)
 Transaction isolation: TRANSACTION_REPEATABLE_READ
 Beeline version 1.1.0-cdh5.5.1 by Apache Hive 0: jdbc:hive2://cdh-node1:10000> show tables;
 +-----------------------------------+--+
 | tab_name |
 +-----------------------------------+--+
 | flight_delays |
 | my_second_table |
 | oracle_analytics_tweets |
 +-----------------------------------+--+ 8 rows selected (0.137 seconds)
 0: jdbc:hive2://cdh-node1:10000> Add SQL*Developer
  • 27. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Hive is extensible in three major ways that help with accessing and integrating new data sets •SerDes : Serializer-Deserializers that interpret semi-structured sources + make tabular •UDFs + Hive Streaming : Add user-defined functions and whole-row external processing •File Formats : make use of compressed and/or optimised file storage •Storage Handlers : use storage other than HDFS (e.g. MongoDB) as data source Hive Extensibility - The “Swiss Army Knife” of Hadoop Client Client HDFS Fileformats JDBC / ODBC Metastore MapReduce UDF/UDAFs SerDe Scripts HBase MongoDB Parser Execution Engine HiveQL Planner Storage Hdlrs TextFile Parquet
  • 28. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Extend Hive by adding new computation and aggregation capabilities •UDFs (row-based), UDAFs (aggregation) and UDTFs (table functions) Hive Extensibility through UDFs and UDAFs add jar target/JsonSplit-1.0-SNAPSHOT.jar; create temporary function json_split 
 as 'com.pythian.hive.udf.JsonSplitUDF'; create table json_example (json string); load data local inpath 'split_example.json' 
 into table json_example; SELECT ex.* FROM json_example 
 LATERAL VIEW explode(json_split(json_example.json)) ex;ADD JAR ./ext.jar; CREATE TEMPORARY FUNCTION process_names as 'com.matthewrathbone.example.NameParserGenericUDTF'; SELECT adTable.name, adTable.surname FROM people lateral view process_names(name) adTable as name, surname;
  • 29. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Allows data to be stored in optimised storage format ‣Column-store for analytics ‣Self-describing, splittable storage
 for general-purpose use ‣Compressed data ‣Semi-structured (e.g. log) data 29 SerDes & Storage Handlers Further Decouple Storage Hadoop Distributed Filesystem (HDFS) Query and Processing EnginesMapReduce Unstructured /
 Semi-Structured
 Log Data Offloaded
 Archive
 Data Social Graphs
 & Networks Smart Meter
 & Sensor Data SQL
 Engine NoSQL Key-Value
 Store DB RegEx Serde Parquet SerDe JSON SerDe NoSQL Key-Value
 Store DB MongoDB
 Store Handler MongoDB
 Store Handler
  • 30. info@rittmanmead.com www.rittmanmead.com @rittmanmead 30 •Splittability - can the file be split into blocks and processed in parallel ‣CSV files can be split by file line; XML files can’t because of opening and closing tags •Ability to compress - CSV files can’t be block compressed, impact on space / performance •Support for schema evolution - does the file contain in-built schema information that self-describes the data? File Formats in Hadoop Are Important 2016-01-28T09:30:28Z,2016-01-28T11:56:24Z,145.933
 2016-01-29T00:19:35Z,2016-01-29T01:36:49Z,77.233
 2016-01-29T02:10:35Z,2016-01-29T02:32:18Z,21.717
 2016-01-29T03:08:07Z,2016-01-29T03:16:11Z,8.067
 2016-01-29T03:51:24Z,2016-01-29T06:57:44Z,186.333
 2016-01-29T07:05:50Z,2016-01-29T07:13:21Z,7.517
 2016-01-29T07:25:53Z,2016-01-29T07:30:23Z,4.5
 2016-01-29T23:30:00Z,2016-01-30T07:00:30Z,450.5
 2016-01-31T23:30:00Z,2016-02-01T07:30:00Z,480
 2016-02-02T00:35:54Z,2016-02-02T02:10:54Z,95 CSV Extract from Apple Health • Human readable, splittable • No ability to block compress • No in-built self-describing metadata • Timestamps will need special processing • Store final data in parquet format to address some of these concerns {"entities": {"user_mentions": [], "media": [], "hashtags": [], "urls": []}, "text": "Off to visit our office in Bangalore in 15 mins. It'll be good to meet up with Venkat again, plus his team of Ram and Jay.", "created_at": "2010-09-01 00:00:00 +0000", "source": "<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>", "id_str": "22684302309", "geo": {}, "id": 22684302309, "user": {"verified": false, "name": "Mark Rittman", "profile_image_url_https": "https://pbs.twimg.com/ profile_images/702537100890087425/ rAlqgrGX_normal.jpg", "protected": false, "id_str": "14716125", "id": 14716125, "screen_name": "markrittman"}} JSON Records from Twitter • Human readable, splittable • No ability to block compress (+verbose) • Built self-describing metadata • Less mature SerDe support
  • 31. info@rittmanmead.com www.rittmanmead.com @rittmanmead 31 •Beginners usually store data in HDFS using text file formats (CSV) but these have limitations •Apache AVRO often used for general-purpose processing ‣Splitability, schema evolution, in-built metadata, support for block compression •Parquet now commonly used with Impala due to column-orientated storage ‣Mirrors work in RDBMS world around column-store ‣Only return (project) the columns you require across a wide table Specialised File Formats - Parquet and AVRO
  • 32. info@rittmanmead.com www.rittmanmead.com @rittmanmead 32 Example HiveQL Commands to Create + Populate Table create table health_sleep_analysis_tmp (
 asleep_start_ts timestamp,
 asleep_end_ts timestamp,
 mins_asleep float)
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
 WITH SERDEPROPERTIES (
 "separatorChar" = “,",
 "quoteChar" = "'",
 "escapeChar" = ""
 )
 STORED AS TEXTFILE; create table health_sleep_analysis
 stored as parquet
 as
 select from_unixtime(unix_timestamp(asleep_start, "yyyy-MM-dd'T'hh:mm:ss'Z'")) asleep_start_ts,
 from_unixtime(unix_timestamp(asleep_end, "yyyy-MM-dd'T'hh:mm:ss'Z'")) end_start_ts,
 mins_asleep
 from health_sleep_analysis_tmp; • Define temporary Hive table to store start and end times/dates as strings,
 as we can’t do the string>timestamp conversion using the LOAD DATA command • Use the OpenCSVSerde file format so that we can specify delimiters, quote chars and escape chars for file data • Store as regular uncompressed human-readable text file LOAD DATA INPATH '/user/iot/Health/apple_health_sleep_analysis_noheader.csv' OVERWRITE INTO TABLE health_sleep_analysis_tmp; • Load the data file into that temporary Hive table • Now re-load that temporary data into more optimised Parquet format files, suitable for ad-hoc analytic querying • Convert the timestamps currently held in generic string datatype fields into more optimal TIMESTAMP datatypes using a Hive UDF
  • 33. info@rittmanmead.com www.rittmanmead.com @rittmanmead •One of several third-party SerDes available to download from Github Use of Third-Party (Community) Serde - JSONSerde CREATE EXTERNAL TABLE tweets( id string, created_at string, source string, favorited boolean, retweeted_status struct<text:string, user:struct<screen_name:string,name:string>, retweet_count:int>, entities struct<urls:array <struct<expanded_url:string>>, 
 user_mentions:array<struct<screen_name:string,name:string>>, hashtags:array<struct<text:string>>>, text string, user struct<screen_name:string,name:string,friends_count:int,followers_count:int, statuses_count:int,verified:boolean,utc_offset:int,time_zone:string>, in_reply_to_screen_name string ) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' STORED AS TEXTFILE LOCATION '/user/iot/tweets/'; • Note the use of STRUCT and ARRAY datatypes • Used to handle arrays of hashtags, URLs etc in tweets Just select the JSON elements that we want from the overall schema in JSON records Created as an external Hive table, so overlays schema on existing directory of files
  • 34. info@rittmanmead.com www.rittmanmead.com @rittmanmead 34 •Hive SELECT statement against nested columns returns data as arrays •Can parse programatically, or create further views or CTAS tables to split out array Support for Nested (Array)-Type Structures hive> select entities, user from tweets > limit 3; OK {"urls":[{"expanded_url":"http://www.rittmanmead.com/ biforum2013"}],"user_mentions":[],"hashtags":[]} {"screen_name":"markrittman","name":"Mark Rittman","friends_count":null,"followers_count":null,"statuses_count":null,"ver ified":false,"utc_offset":null,"time_zone":null} {"urls":[{"expanded_url":"http://www.bbc.co.uk/news/ technology-22299503"}],"user_mentions":[],"hashtags":[]} {"screen_name":"markrittman","name":"Mark Rittman","friends_count":null,"followers_count":null,"statuses_count":null,"ver ified":false,"utc_offset":null,"time_zone":null} {"urls":[{"expanded_url":"http://pocket.co/seb2e"}],"user_mentions": [{"screen_name":"ArtOfBI","name":"Christian Screen"}, {"screen_name":"wiseanalytics","name":"Lyndsay Wise"}],"hashtags":[]} {"screen_name":"markrittman","name":"Mark Rittman","friends_count":null,"followers_count":null,"statuses_count":null,"ver ified":false,"utc_offset":null,"time_zone":null} How to you work with these values? CREATE TABLE tweets_expanded stored as parquet AS select tweets.id, tweets.created_at, tweets.user.screen_name as user_screen_name, tweets.user.friends_count as user_friends_count, tweets.user.followers_count as user_followers_count, tweets.user.statuses_count as user_tweets_count, tweets.text, tweets.in_reply_to_screen_name, tweets.retweeted_status.user.screen_name as retweet_user_screen_name, tweets.retweeted_status.retweet_count as retweet_count, tweets.entities.urls[0].expanded_url as url1, tweets.entities.urls[1].expanded_url as url2, tweets.entities.hashtags[0].text as hashtag1, tweets.entities.hashtags[1].text as hashtag2, tweets.entities.hashtags[2].text as hashtag3, tweets.entities.hashtags[3].text as hashtag4 from tweets; Create a copy of the table in Parquet storage format “Denormalize” the array by selecting individual elements CREATE view tweets_expanded_view AS select tweets.id, tweets.created_at, tweets.user.screen_name as user_screen_name, tweets.user.friends_count as user_friends_count, tweets.user.followers_count as user_followers_count, tweets.user.statuses_count as user_tweets_count, tweets.text, tweets.in_reply_to_screen_name, tweets.retweeted_status.user.screen_name as retweet_user_screen_name, tweets.retweeted_status.retweet_count as retweet_count, tweets.entities.urls[0].expanded_url as url1, tweets.entities.urls[1].expanded_url as url2, tweets.entities.hashtags[0].text as hashtag1, tweets.entities.hashtags[1].text as hashtag2, tweets.entities.hashtags[2].text as hashtag3, tweets.entities.hashtags[3].text as hashtag4 from tweets; … or create as a view (not all BI tools support views though)
  • 35. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Use HiveQL to create aggregations, select individual columns (JSON elements) from data •Use WHERE clause to limit data returned & ORDER BY to sort - as per normal SQL 35 Calculating Aggregations, Filtering Tweet Data select text, hashtag1, hashtag2 from tweets_expanded
 where hashtag1 = ‘obiee’; Column selection only = just MAP task select in_reply_to_screen_name, count(*) as total_replies_to from tweets_expanded group by in_reply_to_screen_name order by total_replies_to desc limit 10; Selection and aggregation = MAP() and REDUCE task
  • 36. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Hive MR jobs can have multiple stages •MapReduce Stages, Metastore operations •File Move / Rename etc Multi-Stage MapReduce Jobs SELECT LOWER(hashtags.text), COUNT(*) AS total_count FROM ( SELECT * FROM tweets WHERE regexp_extract(created_at,"(2015)*",1) = "2015" ) tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags GROUP BY LOWER(hashtags.text) ORDER BY total_count DESC LIMIT 15 1 2
  • 37. info@rittmanmead.com www.rittmanmead.com @rittmanmead Multi-Step HiveQL Transforms - Tweet Sentiment create external table load_tweets(id string,text STRING) 
 ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' 
 LOCATION '/user/iot/tweets'; create table split_words as 
 select id as id,split(text,' ') as words 
 from load_tweets; create table tweet_word as 
 select id as id,word 
 from split_words 
 LATERAL VIEW explode(words) w as word; create table dictionary
 (word string,rating int) 
 ROW FORMAT DELIMITED 
 FIELDS TERMINATED BY ‘t'; create table word_join as 
 select tweet_word.id,tweet_word.word,dictionary.rating 
 from tweet_word 
 LEFT OUTER JOIN dictionary 
 ON(tweet_word.word =dictionary.word); select t.text, r.rating from tweets_expanded t join (select id,AVG(rating) as rating 
 from word_join 
 GROUP BY word_join.id) r on t.id = r.id order by r.rating; LOAD DATA INPATH 'afinn.txt' 
 into TABLE dictionary; 1 2 3 4 5 6 7 Take all the text within a set of tweets, and explode-out all the words into a table, one row per word Load in a dictionary file that we’ll use to determine the sentiment of words in these tweets Join the words and the dictionary sentiment scores together, so every word used with any of the tweets has a sentiment score we can use Now average-out the sentiment scores for each word within a tweet, and return the tweet text and those averages listed in descending sentiment order
  • 38. info@rittmanmead.com www.rittmanmead.com @rittmanmead 38 •Not all join types are available in Hive - joins must be equality joins •No sequences, no primary keys on tables •Generally need to stage Oracle or other external data into Hive before joining to it •Hive latency - not good for small microbatch-type work ‣But other alternatives exist - Spark, Impala etc •Don’t assume that HiveQL == Oracle SQL ‣Test assumptions before committing to platform •Hive is INSERT / APPEND only - no updates, deletes etc ‣But HBase may be suitable for CRUD-type loading SQL Considerations : Using Hive vs. Regular Oracle SQL vs.
  • 39. info@rittmanmead.com www.rittmanmead.com @rittmanmead 39 •Based on BigTable paper from Google, 2006, Dean et al. ‣“Bigtable is a sparse, distributed, persistent multi-dimensional sorted map.”Key Features: ‣Distributed storage across cluster of machines – Random, online read and write data access ‣Schemaless data model (“NoSQL”) ‣Self-managed data partitions •Why would you use it with Hive? ‣Allows you to do update and delete
 activity rather than just Hive append-only ‣Very fast for incremental loading ‣Can define Hive tables over HBase ones,
 allowing OBIEE to then access them What is HBase?
  • 40. info@rittmanmead.com www.rittmanmead.com @rittmanmead 40 •HBase Shell CLI allows you to create HBase tables •GET and PUT commands can then be used to add/update cells, query cells etc Creating HBase Tables using HBase Shell hbase shell create 'carriers','details' create 'geog_origin','origin' create 'geog_dest','dest' create 'flight_delays','dims','measures' put 'geog_dest','LAX','dest:airport_name','Los Angeles, CA: Los Angeles' put 'geog_dest','LAX','dest:city','Los Angeles, CA' put 'geog_dest','LAX','dest:state','California' put 'geog_dest','LAX','dest:id','12892' hbase(main):015:0> scan 'geog_dest' ROW                                    COLUMN+CELL
 LAX                                   column=dest:airport_name, timestamp=1432067861347, value=Los Angeles, CA: Los Angeles
 LAX                                   column=dest:city, timestamp=1432067861375, value=Los Angeles,CA
 LAX                                   column=dest:id, timestamp=1432067862018,value=12892
 LAX                                   column=dest:state, timestamp=1432067861404,value=California
 1 row(s) in 0.0240 seconds
  • 41. info@rittmanmead.com www.rittmanmead.com @rittmanmead 41 •Direct extract from salesforce.com into HBase 
 using Python and add-in packages ‣Python packages extend functionality 
 by adding APIs, integration etc ‣Happybase, Beatbox and Pyhs2 packages 
 installed along with Python •All free and open-source Programmatically Loading HBase Tables using Python import pyhs2 import happybase connection = happybase.Connection('bigdatalite') flight_delays_hbase_table = connection.table('test1_flight_delays') b = flight_delays_hbase_table.batch(batch_size=10000) with pyhs2.connect(host='bigdatalite',                port=10000,                authMechanism="PLAIN",                user='oracle',                password='welcome1',                database='default') as conn:     with conn.cursor() as cur:         #Execute query         cur.execute("select * from flight_delays_initial_load")         #Fetch table results         for i in cur.fetch():             b.put(str(i[0]),{'dims:year': i[1],                              'dims:carrier': i[2],                              'dims:orig': i[3],                              'dims:dest': i[4],                              'measures:flights': i[5],                              'measures:late': i[6],                              'measures:cancelled': i[7],                              'measures:distance': i[8]}) b.send()
  • 42. info@rittmanmead.com www.rittmanmead.com @rittmanmead 42 •Create Hive tables over the HBase ones to provide SQL load/query capabilities ‣Uses HBaseStorageHandler Storage Handler for HBAse ‣HBase columns mapped to Hive columns using SERDEPROPERTIES Create Hive Table Metadata over HBase Tables CREATE EXTERNAL TABLE hbase_flight_delays (key string,   year string,   carrier string,   orig string,   dest string,   flights string,   late   string,   cancelled string,   distance string ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,dims:year,dims:carrier,dims:orig,dims:dest,
 measures:flights,measures:late,measures:cancelled,measures:distance") TBLPROPERTIES ("hbase.table.name" = "test1_flight_delays");
  • 43. info@rittmanmead.com www.rittmanmead.com @rittmanmead 43 •Use HiveQL commands INSERT INTO TABLE … SELECT to load (merge) new data •Use HiveQL SELECT query to retrieve data from HBase table Load and Query HBase using HiveQL insert into table hbase_flight_delays               select * from flight_delays_initial_load;                Total jobs = 1 ... Total MapReduce CPU Time Spent: 11 seconds 870 msec OK Time taken: 40.301 seconds select count(*), min(cast(key as bigint)) as min_key, max(cast(key as bigint)) as max_key from hbase_flight_delays; Total jobs = 1 ... Total MapReduce CPU Time Spent: 14 seconds 660 msec OK 200000  1  200000 Time taken: 53.076 seconds, Fetched: 1 row(s)
  • 44. info@rittmanmead.com www.rittmanmead.com @rittmanmead 44 •But Parquet (and HDFS) have significant limitation for real-time analytics applications ‣Append-only orientation, focus on column-store 
 makes streaming ingestion harder •Cloudera Kudu aims to combine 
 best of HDFS + HBase ‣Real-time analytics-optimised ‣Supports updates to data ‣Fast ingestion of data ‣Accessed using SQL-style tables
 and get/put/update/delete API Cloudera Kudu - Combining Best of HBase and Column-Store
  • 45. info@rittmanmead.com www.rittmanmead.com @rittmanmead 45 •Kudu storage used with Impala - create tables using Kudu storage handler •Can now UPDATE, DELETE and INSERT into Hadoop tables, not just SELECT and LOAD DATA Example Impala DDL + DML Commands with Kudu CREATE TABLE `my_first_table` ( `id` BIGINT, `name` STRING ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'my_first_table', 'kudu.master_addresses' = 'kudu-master.example.com:7051', 'kudu.key_columns' = 'id' ); INSERT INTO my_first_table VALUES (99, "sarah"); INSERT IGNORE INTO my_first_table VALUES (99, "sarah"); UPDATE my_first_table SET name="bob" where id = 3; DELETE FROM my_first_table WHERE id < 3; DELETE c FROM my_second_table c, stock_symbols s WHERE c.name = s.symbol;
  • 46.
  • 50. info@rittmanmead.com www.rittmanmead.com @rittmanmead 50 •MapReduce’s great innovation was to break processing down into distributed jobs •Jobs that have no functional dependency on each other, only upstream tasks •Provides a framework that is infinitely scalable and very fault tolerant •Hadoop handled job scheduling and resource management ‣All MapReduce code had to do was provide the “map” and “reduce” functions ‣Automatic distributed processing ‣Slow but extremely powerful Hadoop 1.0 and MapReduce
  • 51. info@rittmanmead.com www.rittmanmead.com @rittmanmead 51 •A typical Hive or Pig script compiles down into multiple MapReduce jobs •Each job stages its intermediate results to disk •Safe, but slow - write to disk, spin-up separate JVMs for each job MapReduce - Scales By Writing Intermediate Results to Disk SELECT LOWER(hashtags.text), COUNT(*) AS total_count FROM ( SELECT * FROM tweets WHERE regexp_extract(created_at,"(2015)*",1) = "2015" ) tweets LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags GROUP BY LOWER(hashtags.text) ORDER BY total_count DESC LIMIT 15 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 5.34 sec HDFS Read: 10952994 HDFS Write: 5239 SUCCESS Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.1 sec HDFS Read: 9983 HDFS Write: 164 SUCCESS Total MapReduce CPU Time Spent: 7 seconds 440 msec OK 1 2
  • 52. info@rittmanmead.com www.rittmanmead.com @rittmanmead 52 •MapReduce 2 (MR2) splits the functionality of the JobTracker
 by separating resource management and job scheduling/monitoring •Introduces YARN (Yet Another Resource Manager) •Permits other processing frameworks to MR ‣For example, Apache Spark •Maintains backwards compatibility with MR1 •Introduced with CDH5+ MapReduce 2 and YARN Node
 Manager Node
 Manager Node
 Manager Resource
 Manager Client Client
  • 53. info@rittmanmead.com www.rittmanmead.com @rittmanmead 53 •Runs on top of YARN, provides a faster execution engine than MapReduce for Hive, Pig etc •Models processing as an entire data flow graph (DAG), rather than separate job steps ‣DAG (Directed Acyclic Graph) is a new programming style for distributed systems ‣Dataflow steps pass data between them as streams, rather than writing/reading from disk •Supports in-memory computation, enables Hive on Tez (Stinger) and Pig on Tez •Favoured In-memory / Hive v2 
 route by Hortonworks Apache Tez InputData TEZ DAG Map() Map() Map() Reduce() OutputData Reduce() Reduce() Reduce() InputData Map() Map() Reduce() Reduce()
  • 54. info@rittmanmead.com www.rittmanmead.com @rittmanmead 54 Tez Advantage - Drop-In Replacement for MR with Hive, Pig set hive.execution.engine=mr set hive.execution.engine=tez 4m 17s 2m 25s
  • 55.
  • 56. info@rittmanmead.com www.rittmanmead.com @rittmanmead 56 •Cloudera’s answer to Hive query response time issues •MPP SQL query engine running on Hadoop, bypasses MapReduce for direct data access •Mostly in-memory, but spills to disk if required •Uses Hive metastore to access Hive table metadata •Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc Cloudera Impala - Fast, MPP-style Access to Hadoop Data
  • 57. info@rittmanmead.com www.rittmanmead.com @rittmanmead 57 How Impala Works Impala
 Daemon HDFS
 DataNode SQL
 App ODBC /
 JDBC HDFS
 DataNode HDFS
 DataNode HDFS
 DataNode Impala
 Daemon Impala
 Daemon Impala
 Daemon Hive
 MetaStore Impala
 StateStore •Cloudera-based solution for ad-hoc SQL-on-Hadoop •MPP SQL query engine running on Hadoop, with daemons running on each Hadoop node •In contrast to jobs being submitted via YARN •Mostly in-memory, but spills to disk if required •Uses Hive metastore to access Hive table metadata •Similar SQL dialect to Hive - not as rich though and no support for Hive SerDes, storage handlers etc
  • 58. info@rittmanmead.com www.rittmanmead.com @rittmanmead 58 •Log into Impala Shell, run INVALIDATE METADATA command to refresh Impala table list •Run SHOW TABLES Impala SQL command to view tables available •Run COUNT(*) on main ACCESS_PER_POST table to see typical response time Enabling Hive Tables for Impala [oracle@bigdatalite ~]$ impala-shell Starting Impala Shell without Kerberos authentication [bigdatalite.localdomain:21000] > invalidate metadata; Query: invalidate metadata Fetched 0 row(s) in 2.18s [bigdatalite.localdomain:21000] > show tables; Query: show tables +-----------------------------------+ | name | +-----------------------------------+ | access_per_post | | access_per_post_cat_author | | … | | posts | |——————————————————————————————————-+ Fetched 45 row(s) in 0.15s [bigdatalite.localdomain:21000] > select count(*) 
 from access_per_post; Query: select count(*) from access_per_post +----------+ | count(*) | +----------+ | 343 | +----------+ Fetched 1 row(s) in 2.76s
  • 59. info@rittmanmead.com www.rittmanmead.com @rittmanmead 59 •Significant improvement over Hive response time •Now makes Hadoop suitable for ad-hoc querying Significantly-Improved Ad-Hoc Query Response Time vs Hive | Logical Query Summary Stats: Elapsed time 2, Response time 1, Compilation time 0 (seconds) Logical Query Summary Stats: Elapsed time 50, Response time 49, Compilation time 0 (seconds) Simple Two-Table Join against Hive Data Only Simple Two-Table Join against Impala Data Only vs
  • 60. What about integration
 with my Data Warehouse?
  • 61. info@rittmanmead.com www.rittmanmead.com @rittmanmead 61 •Most Traditional data warehousing vendors offer a Hadoop integration option •Oracle Big Data SQL •IBM Big SQL etc •Leverage lower-level SQL-on-Hadoop
 metadata but use own server process •Allows DBAs to write SQL using RDBMS
 SQL dialect, run across relational, Hadoop
 and NoSQL servers Hadoop Distributed Filesystem (HDFS) YARN Resource Manager Query and Processing Engines Oracle
 Big Data SQL
 Server Unstructured /
 Semi-Structured
 Log Data Offloaded
 Archive
 Data Social Graphs
 & Networks Smart Meter
 & Sensor Data 61 NoSQL Key-Value
 Store DB Platform for Traditional DW Integration with Hadoop Oracle
 RDBMS
  • 62. info@rittmanmead.com www.rittmanmead.com @rittmanmead 62 •Originally Part of Oracle Big Data 4.0 (BDA-only) but now available for commodity Hadoop installs ‣Also requires Oracle Database 12c (no longer dependent on Exadata from Big Data SQL 4.0) ‣Extends Oracle Data Dictionary to cover Hive •Extends Oracle SQL and SmartScan to Hadoop •Extends Oracle Security Model over Hadoop ‣Fine-grained access control ‣Data redaction, data masking ‣Uses fast c-based readers where possible
 (vs. Hive MapReduce generation) Oracle Big Data SQL Exadata
 Storage Servers Hadoop
 Cluster Exadata Database
 Server Oracle Big
 Data SQL SQL Queries SmartScan SmartScan
  • 63. info@rittmanmead.com www.rittmanmead.com @rittmanmead 63 •Oracle Database 12c 12.1.0.2.0 with Big Data SQL option can view Hive table metadata ‣Linked by Exadata configuration steps to one or more BDA clusters •DBA_HIVE_TABLES and USER_HIVE_TABLES exposes Hive metadata •Oracle SQL*Developer 4.0.3, with Cloudera Hive drivers, can connect to Hive metastore View Hive Table Metadata in the Oracle Data Dictionary SQL> col database_name for a30 SQL> col table_name for a30 SQL> select database_name, table_name 2 from dba_hive_tables; DATABASE_NAME TABLE_NAME ------------------------------ ------------------------------ default access_per_post default access_per_post_categories default access_per_post_full default apachelog default categories default countries default cust default hive_raw_apache_access_log
  • 64. info@rittmanmead.com www.rittmanmead.com @rittmanmead 64 •Big Data SQL accesses Hive tables through external table mechanism ‣ORACLE_HIVE external table type imports Hive metastore metadata ‣ORACLE_HDFS requires metadata to be specified •Access parameters cluster and tablename specify Hive table source and BDA cluster Hive Access through Oracle External Tables + Hive Driver CREATE TABLE access_per_post_categories( hostname varchar2(100), request_date varchar2(100), post_id varchar2(10), title varchar2(200), author varchar2(100), category varchar2(100), ip_integer number) organization external (type oracle_hive default directory default_dir access parameters(com.oracle.bigdata.tablename=default.access_per_post_categories));
  • 65. info@rittmanmead.com www.rittmanmead.com @rittmanmead 65 •Brings query-offloading features similar to Exadata
 to Oracle Big Data Appliance •Query across both Oracle and Hadoop sources •Intelligent query optimisation applies SmartScan
 close to ALL data •Use same SQL dialect across both sources •Apply same security rules, policies, 
 user access rights across both sources Extending SmartScan, and Oracle SQL, Across All Data
  • 67.
  • 68. “where we’re going
 we don’t need roads”
  • 69. “where we’re going
 we don’t need roads”
  • 70. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Apache Drill is another SQL-on-Hadoop project that focus on schema-free data discovery •Inspired by Google Dremel, innovation is querying raw data with schema optional •Automatically infers and detects schema from semi-structured datasets and NoSQL DBs •Join across different silos of data e.g. JSON records, Hive tables and HBase database •Aimed at different use-cases than Hive - 
 low-latency queries, discovery 
 (think Endeca vs OBIEE) Introducing Apache Drill - “We Don’t Need No Roads”
  • 71. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Most modern datasource formats embed their schema in the data (“schema-on-read”) •Apache Drill makes these as easy to join to traditional datasets as “point me at the data” •Cuts out unnecessary work in defining Hive schemas for data that’s self-describing •Supports joining across files,
 databases, NoSQL etc Self-Describing Data - Parquet, AVRO, JSON etc
  • 72. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Files can exist either on the local filesystem, or on HDFS •Connection to directory or file defined in storage configuration •Can work with CSV, TXT, TSV etc •First row of file can provide schema (column names) Apache Drill and Text Files SELECT * FROM dfs.`/tmp/csv_with_header.csv2`; +-------+------+------+------+ | name | num1 | num2 | num3 | +-------+------+------+------+ | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | | hello | 1 | 2 | 3 | +-------+------+------+------+ 7 rows selected (0.12 seconds) SELECT * FROM dfs.`/tmp/csv_no_header.csv`; +------------------------+ | columns | +------------------------+ | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | | ["hello","1","2","3"] | +------------------------+ 7 rows selected (0.112 seconds)
  • 73. info@rittmanmead.com www.rittmanmead.com @rittmanmead •JSON (Javascript Object Notation) documents are often used for data interchange •Exports from Twitter and other consumer services •Web service responses and other B2B interfaces •A more lightweight form of XML that is “self- describing” •Handles evolving schemas, and optional attributes •Drill treats each document as a row, and has features to •Flatten nested data (extract elements from arrays) •Generate key/value pairs for loosely structured data Apache Drill and JSON Documents use dfs.iot; show files; select in_reply_to_user_id, text from `all_tweets.json` limit 5; +---------------------+------+ | in_reply_to_user_id | text | +---------------------+------+ | null | BI Forum 2013 in Brighton has now sold-out | | null | "Football has become a numbers game | | null | Just bought Lyndsay Wise’s Book | | null | An Oracle BI "Blast from the Past" | | 14716125 | Dilbert on Agile Programming | +---------------------+------+ 5 rows selected (0.229 seconds) select name, flatten(fillings) as f 
 from dfs.users.`/donuts.json` 
 where f.cal < 300;
  • 74. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Drill can connect to Hive to make use of metastore (incl. multiple Hive metastores) •NoSQL databases (HBase etc) •Parquet files (native storage format - columnar + self describing) Apache Drill and Hive, HBase, Parquet Sources etc USE hbase; SELECT * FROM students; +-------------+-----------------------+-----------------------------------------------------+ | row_key | account | address | +-------------+-----------------------+------------------------------------------------------+ | [B@e6d9eb7 | {"name":"QWxpY2U="} | {"state":"Q0E=","street":"MTIzIEJhbGxtZXIgQXY="} | | [B@2823a2b4 | {"name":"Qm9i"} | {"state":"Q0E=","street":"MSBJbmZpbml0ZSBMb29w"} | | [B@3b8eec02 | {"name":"RnJhbms="} | {"state":"Q0E=","street":"NDM1IFdhbGtlciBDdA=="} | | [B@242895da | {"name":"TWFyeQ=="} | {"state":"Q0E=","street":"NTYgU291dGhlcm4gUGt3eQ=="} | +-------------+-----------------------+----------------------------------------------------------------------+ SELECT firstname,lastname FROM 
 hiveremote.`customers` limit 10;`
 +------------+------------+ | firstname | lastname | +------------+------------+ | Essie | Vaill | | Cruz | Roudabush | | Billie | Tinnes | | Zackary | Mockus | | Rosemarie | Fifield | | Bernard | Laboy | | Marianne | Earman | +------------+------------+ SELECT * FROM dfs.`iot_demo/geodata/region.parquet`; +--------------+--------------+-----------------------+ | R_REGIONKEY | R_NAME | R_COMMENT | +--------------+--------------+-----------------------+ | 0 | AFRICA | lar deposits. blithe | | 1 | AMERICA | hs use ironic, even | | 2 | ASIA | ges. thinly even pin | | 3 | EUROPE | ly final courts cajo | | 4 | MIDDLE EAST | uickly special accou | +--------------+--------------+-----------------------+
  • 75. info@rittmanmead.com www.rittmanmead.com @rittmanmead •Drill developed for real-time, ad-hoc data exploration with schema discovery on-the-fly •Individual analysts exploring new datasets, leveraging corporate metadata/data to help •Hive is more about large-scale, centrally curated set-based big data access •Drill models conceptually as JSON, vs. Hive’s tabular approach •Drill introspects schema from whatever it connects to, vs. formal modeling in Hive Apache Drill vs. Apache Hive Interactive Queries
 (Data Discovery, Tableau/VA) Reporting Queries
 (Canned Reports, OBIEE) ETL
 (ODI, Scripting, Informatica) Apache Drill Apache Hive Interactive Queries 100ms - 3mins Reporting Queries 3mins - 20mins ETL & Batch Queries 20mins - hours
  • 77. what’s all this about “Spark”?
  • 78. info@rittmanmead.com www.rittmanmead.com @rittmanmead 78 •Another DAG execution engine running on YARN •More mature than TEZ, with richer API and more vendor support •Uses concept of an RDD (Resilient Distributed Dataset) ‣RDDs like tables or Pig relations, but can be cached in-memory ‣Great for in-memory transformations, or iterative/cyclic processes •Spark jobs comprise of a DAG of tasks operating on RDDs •Access through Scala, Python or Java APIs •Related projects include ‣Spark SQL ‣Spark Streaming Apache Spark
  • 79. info@rittmanmead.com www.rittmanmead.com @rittmanmead 79 •Native support for multiple languages 
 with identical APIs ‣Python - prototyping, data wrangling ‣Scala - functional programming features ‣Java - lower-level, application integration •Use of closures, iterations, and other 
 common language constructs to minimize code •Integrated support for distributed +
 functional programming •Unified API for batch and streaming Rich Developer Support + Wide Developer Ecosystem scala> val logfile = sc.textFile("logs/access_log") 14/05/12 21:18:59 INFO MemoryStore: ensureFreeSpace(77353) 
 called with curMem=234759, maxMem=309225062 14/05/12 21:18:59 INFO MemoryStore: Block broadcast_2 
 stored as values to memory (estimated size 75.5 KB, free 294.6 MB) logfile: org.apache.spark.rdd.RDD[String] = 
 MappedRDD[31] at textFile at <console>:15 scala> logfile.count() 14/05/12 21:19:06 INFO FileInputFormat: Total input paths to process : 1 14/05/12 21:19:06 INFO SparkContext: Starting job: count at <console>:1 ... 14/05/12 21:19:06 INFO SparkContext: Job finished: 
 count at <console>:18, took 0.192536694 s res7: Long = 154563 scala> val logfile = sc.textFile("logs/access_log").cache scala> val biapps11g = logfile.filter(line => line.contains("/biapps11g/")) biapps11g: org.apache.spark.rdd.RDD[String] = FilteredRDD[34] at filter at <console>:17 scala> biapps11g.count() ... 14/05/12 21:28:28 INFO SparkContext: Job finished: count at <console>:20, took 0.387960876 s res9: Long = 403
  • 80. info@rittmanmead.com www.rittmanmead.com @rittmanmead 80 •Spark SQL, and Data Frames, allow RDDs in Spark to be processed using SQL queries •Bring in and federate additional data from JDBC sources •Load, read and save data in Hive, Parquet and other structured tabular formats Spark SQL - Adding SQL Processing to Apache Spark val accessLogsFilteredDF = accessLogs .filter( r => ! r.agent.matches(".*(spider|robot|bot|slurp).*")) .filter( r => ! r.endpoint.matches(".*(wp-content|wp-admin).*")).toDF() .registerTempTable("accessLogsFiltered") val topTenPostsLast24Hour = sqlContext.sql("SELECT p.POST_TITLE, p.POST_AUTHOR, COUNT(*) 
 as total 
 FROM accessLogsFiltered a 
 JOIN posts p ON a.endpoint = p.POST_SLUG 
 GROUP BY p.POST_TITLE, p.POST_AUTHOR 
 ORDER BY total DESC LIMIT 10 ") // Persist top ten table for this window to HDFS as parquet file topTenPostsLast24Hour.save("/user/oracle/rm_logs_batch_output/topTenPostsLast24Hour.parquet"
 , "parquet", SaveMode.Overwrite)
  • 81. Hadoop is insecure and has fragmented security …doesn’t it? but …
  • 82. info@rittmanmead.com www.rittmanmead.com @rittmanmead 82 Consistent Security and Audit Now Emerging on Platform
  • 83. info@rittmanmead.com www.rittmanmead.com @rittmanmead 83 •Clusters by default are unsecured (vunerable to account spoofing) & need Kerberos enabled •Data access controlled by POSIX-style permissions on HDFS files •Hive and Impala can Apache Sentry RBAC ‣Result is data duplication and complexity ‣No consistent API or abstracted security model Hadoop Security Initially Was a Mess /user/mrittman/scratchpad /user/ryeardley/scratchpad /user/mpatel/scratchpad /user/mrittman/scratchpad /user/mrittman/scratchpad /data/rm_website_analysis/logfiles/incoming /data/rm_website_analysis/logfiles/archive /data/rm_website_analysis/tweets/incoming /data/rm_website_analysis/tweets/archive
  • 84. info@rittmanmead.com www.rittmanmead.com @rittmanmead 84 •Use standard Oracle Security over Hadoop & NoSQL ‣Grant & Revoke Privileges ‣Redact Data ‣Apply Virtual Private Database ‣Provides Fine-grain Access Control •Great solution to extend existing Oracle
 security model over Hadoop datasets Oracle Big Data SQL : Extend Oracle Security to Hadoop Redacted data subset SQL JSON Customer data in Oracle DB DBMS_REDACT.ADD_POLICY( object_schema => 'txadp_hive_01', object_name => 'customer_address_ext', column_name => 'ca_street_name', policy_name => 'customer_address_redaction', function_type => DBMS_REDACT.RANDOM, expression => 'SYS_CONTEXT(''SYS_SESSION_ROLES'', 
 ''REDACTION_TESTER'')=''TRUE''' );
  • 85. info@rittmanmead.com www.rittmanmead.com @rittmanmead 85 •Provides a higher level, logical abstraction for data (ie Tables or Views) ‣Can be used with Spark & Spark SQL, with Predicate pushdown, projection •Returns schemed objects (instead of paths and bytes) in similar way to HCatalog •Unified data access path allows platform-wide performance improvements •Secure service that does not execute arbitrary user code ‣Central location for all authorization checks using Sentry metadata. Cloudera RecordService
  • 86.
  • 87. info@rittmanmead.com www.rittmanmead.com @rittmanmead 87 Choosing a SQL-on-Hadoop Engine The original SQL-on-Hadoop engine Maximum compatibility with Hadoop … but designed for batch processing Plug-in replacement for MapReduce
 Works via YARN and submitting jobs Speeds-up Hive but long-term future? Daemon-based MPP engines Impala is more mature Drill innovates around data-discovery Adds SQL access and set-based processing to Spark Useful for query federation Vendor-provided RBDMS-Hadoop integration bridges
  • 88. You don’t need to learn
 Java, or MapReduce,
 or Scala
  • 91. Check-out these Developer VMs: http://www.cloudera.com/documentation/enterprise/5-3-x/topics/ cloudera_quickstart_vm.html
 
 http://hortonworks.com/products/sandbox/ http://www.oracle.com/technetwork/database/bigdata-appliance/oracle- bigdatalite-2104726.html https://www.mapr.com/products/mapr-sandbox-hadoop/download-sandbox-drill
  • 93. info@rittmanmead.com www.rittmanmead.com @rittmanmead Gluent New World #02: 
 SQL-on-Hadoop with Mark Rittman Mark Rittman, CTO, Rittman Mead April 2016