Hadoop databases for oracle DBAs

Session ID:
Prepared by:
Hadoop databases:
Hive, Impala, Spark, Presto
For ORACLE DBAs
557
Maxym Kharchenko, Gluent
@maxymkh

April 2-6, 2017 in Las Vegas, NV USA #C17LV
Whoami
• Database Kernel developer
-> ORACLE DBA
-> Database Hadoop/Cloud developer
• Worked with ORACLE for the last 15 years
• OCM, ORACLE Ace alumni, Amazon alumni
• Last year: OLTP -> Hadoop

Shameless plug about my company
Gluent
Oracle
Teradata
NoSQL
Big Data
Sources
MSSQL
App
X
App
Y
App
Z

Agenda
• What’s Hadoop databases ?
• Hive/Impala/Spark vs. ORACLE (hopefully, demo)
• Best ways to start

What is Hadoop:
• For “Big data”
• Can deal with “Unstructured” data
• Distributed
• Consists of: HDFS + MapReduce
• Requires you to write MapReduce jobs, NoSql

Yes, but what does it all mean ?

Imagine that you are Google
in the early 2000s

Target Ads
• You need to query web crawler data
• Which is unbelievably huge
• These queries need to be:
• (reasonably) Fast
• (reasonably) Cheap
• (reasonably) Easy to use

Let’s build a Data Warehouse

(traditional) Data warehouse
• Been there for years
• Mature and
(relatively) advanced
• SQL !!!

Data Warehouse scorecard
Requirements RDBMS
(reasonably) Fast   
(reasonably) Cheap   
(reasonably) Easy to use   
Able to process data ¯_(ツ)_/¯

Scaling up “Big data” ain’t cheap
• Can’t fit all of the data
on a single box
• Cost is quickly
getting out of hand

(cheap) Commodity systems
make “big data” feasible

Solution = commodity systems
=
$$$$$ $$

Commodity systems scorecard
Requirements Commodity
(reasonably) Fast   
(reasonably) Cheap   
(reasonably) Easy to use   
Able to process data   

All your queries are Java Classes

Google
• 2003:
Google File System
(GFS) paper
• 2004:
Google MapReduce
(MR) paper

Hadoop
• 2006: Hadoop

”Traditional Data Warehouse” vs. Hadoop
Requirements Hadoop Data Warehouse
(reasonably) Fast      
(reasonably) Cheap      
(reasonably) Easy to use      
Able to process data    ¯_(ツ)_/¯

• 2010: Facebook releases
Apache Hive
• SQL on Hadoop !
SQL on Hadoop - Hive

• 2012: Cloudera announces
Impala
• Faster SQL on Hadoop !
Another SQL on Hadoop - Impala

And then, it exploded …

“Hadoop” vs “Relational”
databases
Demo … hopefully 

This is not about NoSql :-)

Same: Running SQL queries
sql> select prod_id, count(1)
from sh.sales s, sh.channels c
where c.channel_id = s.channel_id
and c.channel_desc='Catalog'
group by prod_id
order by 2 desc
limit 5;
+------------------------+----------+
| prod_id | count(1) |
+------------------------+----------+
| 43.000000000000000000 | 5182 |
| 46.000000000000000000 | 5165 |
| 22.000000000000000000 | 5162 |
| 123.000000000000000000 | 5152 |
| 32.000000000000000000 | 5145 |
+------------------------+----------+
Fetched 5 row(s) in 3.26s

Different: What gets optimized
• No “regular” indexes
• But many operations
are distributed
SALES 1
TIMES 1
SALES 2
TIMES 2
SALES 3
TIMES 3

Different: Native cloud filesystem support
sql> show partition sh.sales;
s3a://bucket1/sh/sales/time_id=2011-01 | PARQUET
hdfs://clust1/sh/sales/time_id=2011-09 | PARQUET

Database engine does NOT ”own” data

example01.dbf
sysaux01.dbf
system01.dbf
temp01.dbf
undotbs01.dbf
users01.dbf
a01_data.parq
a01_data.parq
a03_data.parq
a04_data.parq
a05_data.parq
a06_data.parq
Different: Different engines can work with
the same data files (even at the same time)

Different: … or copies of the data files
hdfs://adhoc/a.parq
hdfs://adhoc/b.parq
hdfs://adhoc/c.parq
hdfs://adhoc/d.parq
hdfs://adhoc/e.parq
hdfs://adhoc/f.parq
hdfs://prod/a.parq
hdfs://prod/b.parq
hdfs://prod/c.parq
hdfs://prod/d.parq
hdfs://prod/e.parq
hdfs://prod/f.parq
s3://backup/a.parq
s3://backup/b.parq
s3://backup/c.parq
s3://backup/d.parq
s3://backup/e.parq
s3://backup/f.parq

Different: Open data formats
• Not proprietary – many
tools can read/write
• No additional $$
for “advanced features”:
• Columnar storage
• Storage indexes
• Compression

Same: “sqlplus-like” clients
> impala-shell -i 10.0.0.1
[10.0.0.1:21000] > select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;
+-----------------------+----------+
| prod_id | count(1) |
+-----------------------+----------+
| 48.000000000000000000 | 74026 |
+-----------------------+----------+
> beeline –u 'jdbc:hive2://10.0.0.1:10000'
0: jdbc:hive2://10.0.0.1:1> select prod_id, count(1)
from sh.sales group by prod_id order by 2 desc limit 1;

Different: External dictionary
User data
Dictionary (SYS)
User data
Dictionary (SYS)
Hive Metastore

Different: Append only, “ETL-like” DML
• Hadoop DML
is more like ETL
• Data is presumed static
• ACID: some
interpretation required
• Schema on read
UPDATE t SET a=12 WHERE b=1;
Table T (base):
a_data.orc
Table T (base):
a_data.orc
Table T (delta):
b_data.orc
Compactor runs …
Table T (base):
c_data.orc

Apache Hive
Slave C
• “Designed” for
“batch” queries (*)
• Runs on top of standard
Hadoop RM: YARN
• Supports multiple
“engines”: MR, TEZ,
Spark
• SerDes
YARN
NM
datanode
Master
Hiveserver2
namenode
Slave C
YARN
NM
datanode
Slave C
YARN
NM
datanode
YARN RM

Slave A
Apache Impala
• Designed for
“quick interactive”
queries
• “Data-local” execution
• In-memory processing
impalad
datanode
Slave B
impalad
datanode
Slave C
impalad
datanode
Master
statestored
namenode
catalogd

Apache Spark
• “Better Hadoop”
with “native”:
SQL, Mlib, GraphX
• In-memory processing,
based on RDDs
• Supports many clusters:
“native”, YARN, Mesos
• Flexible programming
model
Master
Driver
Slave A
Executor
Slave B
Executor
Slave C
Executor

Presto
Slave A
• Designed for
“interactive” queries
• In-memory processing
• Custom storage
“plugins”: Hive, Kafka,
MySql, Postgres,… worker
Slave B
worker
Slave C
worker
Master
coordinator

Step 1: Google “Hadoop ecosystem”

Step 2: Try to install the simplest thing

Step 3

Step 4

Hint: Nobody builds their own Linux
anymore

Chose Hadoop distribution that suits you

Hadoop distributions
• Pre-built and pre-integrated
(aka: all things work out of the box)
• Each has their own “philosophy” …
• … As well as preferred Hadoop database

So what’s in it for me ?
• It’s interesting (cool technology that hits many recent
buzzwords)
• If you know ORACLE, it’s close to your skill set
• It’s promising and future oriented

Please Complete Your
Session Evaluation
Evaluate this session in your COLLABORATE app.
Pull up this session and tap "Session Evaluation"
to complete the survey.
Session ID: 557

Hadoop databases for oracle DBAs

More Related Content

What's hot

Similar to Hadoop databases for oracle DBAs

More from Maxym Kharchenko

Recently uploaded

Hadoop databases for oracle DBAs

Editor's Notes