Hadoop Essentials
for
Oracle Professionals
申建忠(Frank Chien-Chung Shen)
Oracle Authorized Trainer
MySQL Authorized Trainer
Cloudera Hadoop Authorized Trainer
Big Data
• Volume
– Data amount
• Variety
– Data type
• Velocity
– Data processing Speed
Value
Hadoop Common(MR v1)
HDFS
(Hadoop Distributed FileSystem)
Map Reduce
(mapred API/Framework/Resource Management)
Hadoop Common(MR v2)
HDFS
Map Reduce
(mapreduce API/Framework)
YARN-Yet Another Resource Negotiator
(YARN API/Resource Management)
Hadoop EcoSystem
HDFS(Hadoop Distributed FileSystem)
Map
Reduce
(V1) YARN
(Yet Another Resource Negotiator)
Impala Spark Storm
HBase
Map Reduce (V2)
mapred (Java Api) mapeduce
Giraph
ZooKeeper HUE Ozzie Solr
Flume
MahoutHive Pig
Sqoop HttpFS WebHDFS Hadoop fsKafka
Hadoop EcoSystem
HDFS(Hadoop Distributed FileSystem)
Map
Reduce
(V1) YARN
(Yet Another Resource Negotiator)
Impala Spark Storm
HBase
Map Reduce (V2)
mapred (Java Api) mapeduce
Giraph
ZooKeeper HUE Ozzie Solr
Flume
SQL PigLatin
SQL
Scala/
Java/
Python
Java
Logs Message RDBMS HTTP/S
MahoutHive Pig
Sqoop HttpFS WebHDFS Hadoop fsKafka
Hadoop EcoSystem
HDFS(Hadoop Distributed FileSystem)
Map
Reduce
(V1) YARN
(Yet Another Resource Negotiator)
Impala
Mahout
Spark Storm
HBase
Map Reduce (V2)
Hive Pig
mapred (Java Api) mapeduce
Giraph
ZooKeeper HUE Ozzie Solr
Flume Sqoop HttpFS WebHDFS Hadoop fs
Batch
SQL
Interactive
SQL
Machine
Learning
Streaming
Cluster
Coordination
Workflow
Graph
Processing
Scripting
Language
In-Memory
Processing
NoSQL
Database
Browser
GUI
Search
Data
Collector
Data
Exchange
Kafka
Data
Collector
Http
Access
Http
Access
Command Line
Server
Process
Computing Layer
Instance
DBWR
Oracle Server
Storage Layer
Database
block
blockblock
block
Instance
DBWR
Real Application Cluster
Database
block
DBWR
Instance
block block
blockblock block
Server
Process
coordinator
Server
process
Server
Process
Parallel slave Parallel slave
block block
I/O is the problem
• Faster I/O
– ASM
– Exadata
• Less I/O
– Index Access
– Compression
– Partitioned Table/Index
– Materialized View/View Log
– Clustered Table
– Index Organized Table
– Exadata
/dev/sdb1 /dev/sdc1/dev/sdd1 /dev/sde1 /dev/sdf1/dev/sdg1
DATA
Physical
block(4k)
Allocation unit
1/2/4/8/16/32/64M
metadata
+ASM
metadata metadata metadata
/dev/sdh
/dev/sdh1
Oracle ASM
/dev/sdb
SQL> create tablespace ts1 datafile ‘+DATA’ size 3M;
/dev/sdb1 /dev/sdc1/dev/sdd1 /dev/sde1 /dev/sdf1/dev/sdg1
DATA
ASM volume
/acfs_data
acfs
Physical
block(4k)
Allocation unit
1/2/4/8/16/32/64M
metadata
orcl +ASM
metadata metadata metadata
/dev/sdh
/dev/sdh1
/dev/sdh2
Primary extent
Mirror extent
Primary extent
Instance
DBWR
Exadata
Database
DBWR
Instance
block block
Server
Process
Server
Process
Exadata Cell Exadata Cell Exadata Cell Exadata Cell
Smart Scans
Exadata Cell
block block block block block
Filtering block block
block block
block block
Hadoop Concepts
• Moving Processing is faster than moving Data
• Data Locality
• HDFS was designed for many millions of large
files, not billions of small files
• Write Once, Read Many times
• Default Replication(3)
Why not Moving Data
• Moving data from storage layer to computing
layer
• Can’t start processing until data movement
completion
• More nodes, more contentions
Storage Layer
Data Node Data Node Data Node Data Node
Name Node
HDFS
$ hadoop put /home/frank/data1.txt
Data Node1 Data Node2 Data Node3 Data Node4
Name Node
Datanode:1,2,3
/user/frank/data1.txt
Replications:3
Blocksize:64M
150M
64M
Datanode:1,2,3
HDFS
$ hadoop put /home/frank/data1.txt
Data Node1 Data Node2 Data Node3 Data Node4
Name Node
Datanode:1,2,3
Datanode:2,3,4
/user/frank/data1.txt
Replications:3
Blocksize:64M
150M
64M
64M
HDFS
$ hadoop put /home/frank/data1.txt
Data Node1 Data Node2 Data Node3 Data Node4
Name Node
Datanode:1,2,3
Datanode:2,3,4
Datanode:1,2,4
/user/frank/data1.txt
Replications:3
Blocksize:64M
150M
64M
64M
22M
Computing Layer
Node
Manager
Node
Manager
Node
Manager
Node
Manager
Resource
Manager
YARN
Node
Manager
Node
Manager
Node
Manager
Node
Manager
Resource
Manager
$ hadoop jar wordcount.jar /user/frank/data1.txt /user/frank/wc_data1_output
AM
Application Master
YARN
Node
Manager
Node
Manager
Node
Manager
Node
Manager
Resource
Manager
$ hadoop jar wordcount.jar /user/frank/data1.txt /user/frank/wc_data1_output
AM Mapper Mapper Mapper
YARN
Node
Manager
Node
Manager
Node
Manager
Node
Manager
Resource
Manager
$ hadoop jar wordcount.jar /user/frank/data1.txt /user/frank/wc_data1_output
AM Mapper Mapper Mapper
Reduce
r
Reduce
r
Data Locality
Data Node
Node Manager
Data Node
Node Manager
Data Node
Node Manager
Data Node
Node Manager
Name Node
Resource
ManagerJVM daemons
Suggested Host Configurations
Data Node
Node Manager
Data Node
Node Manager
Data Node
Node Manager
Data Node
Node Manager
Name Node
Resource
ManagerJVM daemons
Mapper
Mapper
Mapper
Reducer
Reducer
Input
Spilt
Input
Spilt
Input
Spilt
Part-
00000
Part-
00001
Shuffle
And Sort
Map Reduce Architecture
JAR(computing)
HDFS://user/frank/data1.txt HDFS://user/frank/wc_data1_output
the cat sat on the mat the aardvark sat on the sofa
Mapper Mapper
(the, 1)(cat, 1)(sat, 1)
(on, 1)(the, 1)(mat, 1)
(the, 1)(aardvark, 1)
(sat, 1)(on, 1)(the, 1)
(sofa, 1)
Reducer Reducer
aardvark, 1
cat, 1
mat, 1
on, 2
sat, 2
sofa, 1
the, 4
(aardvark, 1) (cat, 1) (mat, 1) (on, 1, 1) (sat, 1, 1),(sofa, 1),(the, 1, 1, 1, 1)
Shuffle
Sort
NoSQL
NoSQL No Only SQL No, SQL
Time
SQL on Hadoop
0
Operational SQL
< 100 msec
Batch SQL
> 20 min
Interactive SQL
100 msec ~ 20 min
100 msec 20 min
SQL on Hadoop
• Batch SQL
– Hive
• Based on MapReduce
• Based on Tez(Hortonworks)
• Interactive SQL
– Spark SQL(Hive based on Spark)
– Impala(Cloudera)
– Apache Drill
– Presto(Facebook)
• Operational SQL
– Apache Phoenix(SQL on Hbase)
Empid Ename Salary Hire_date
Schema on Write
Create table before inserting data into
001
002
Frank
Linda
25149191
85091233
20000
19000
M
F
2000-11-01
2001-01-05
Schema on Write
Empid Ename Salary Hire_date
001 Frank 20000 2000-11-01
002 Linda 19000 2001-01-05
Missing Missing
Schema on Read
Empid Ename Job Salary Dept Hire_date
001
002
Frank
Linda
25149191
85091233
20000
19000
M
F
2000-11-01
2001-01-05
Schema on Read
Empid Ename Job Salary Dept Hire_date
001
002
Frank
Linda
25149191
85091233
20000
19000
M
F
2000-11-01
2001-01-05
Sqoop
/user/cloudera/t1/part-00000
/user/cloudera/t1/part-00001
RDBMS HDFS
t1
JDBC Map
Reduce
job
Map Reduce
Hive
Select id
From t1
Where name=‘Frank’
MapReduce job
MetaStore
Hive Architecture
Data
Dict.
Table
definition
Table
Data
/user/cloudera/t1/part-00000
/user/cloudera/t1/part-00001
Database
HDFS
t1(external table)
Oracle SQL Connector for HDFS
Data
Dict.
Table
Data
Instance
Oracle
SQL
Connector
for
HDFS
Data Pump files
Delimited text files
Delimited text files in Apache Hive tables
Web
server
source sink
HDFS
HBase
Solr
Flume Agent
Host
Channel
Flume Architecture
Data Warehouse based on RDBMS
OLTP
Data
Warehouse
Extract/Transform/Load
Web
Server
BI
Tools
Ad Hoc
Query
ReportUser Access
logs
Integrated RDBMS with Hadoop
OLTP
Data
Warehouse
Web
Server
Hadoop
EcoSystem
User
SqoopSqoop
BI
Tools
Ad Hoc
Query
Report
Access
logs
Q&A

Hadoop Essential for Oracle Professionals