SlideShare a Scribd company logo
1 of 59
Download to read offline
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
1 
An 
IntroducAon 
to 
Hadoop 
and 
Cloudera 
Nashville 
Cloudera 
User 
Group, 
10/23/14 
Ian 
Wrigley, 
Director, 
EducaAonal 
Curriculum 
ian@cloudera.com 
@iwrigley 
201405
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
2 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
Mo-va-on 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
IntegraAon: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
ExploraAon: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
TradiAonal 
Large-­‐Scale 
ComputaAon 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
3 
§ Tradi-onally, 
computa-on 
has 
been 
processor-­‐bound 
– RelaAvely 
small 
amounts 
of 
data 
– Lots 
of 
complex 
processing 
§ The 
early 
solu-on: 
bigger 
computers 
– Faster 
processor, 
more 
memory 
– But 
even 
this 
couldn’t 
keep 
up
Distributed 
Systems 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
4 
§ The 
beDer 
solu-on: 
more 
computers 
– Distributed 
systems 
– 
use 
mulAple 
machines 
for 
a 
single 
job 
“In 
pioneer 
days 
they 
used 
oxen 
for 
heavy 
pulling, 
and 
when 
one 
ox 
couldn’t 
budge 
a 
log, 
we 
didn’t 
try 
to 
grow 
a 
larger 
ox. 
We 
shouldn’t 
be 
trying 
for 
bigger 
computers, 
but 
for 
more 
systems 
of 
computers.” 
– 
Grace 
Hopper 
Database Hadoop Cluster
Distributed 
Systems: 
Challenges 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
5 
§ Challenges 
with 
distributed 
systems 
– Programming 
complexity 
– Keeping 
data 
and 
processes 
in 
sync 
– Finite 
bandwidth 
– ParAal 
failures
Distributed 
Systems: 
The 
Data 
Bo>leneck 
(1) 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
6 
§ Tradi-onally, 
data 
is 
stored 
in 
a 
central 
loca-on 
§ Data 
is 
copied 
to 
processors 
at 
run-me 
§ Fine 
for 
limited 
amounts 
of 
data
Distributed 
Systems: 
The 
Data 
Bo>leneck 
(2) 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
7 
§ Modern 
systems 
have 
much 
more 
data 
– terabytes+ 
a 
day 
– petabytes+ 
total 
§ We 
need 
a 
new 
approach…
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
8 
§ A 
radical 
new 
approach 
to 
distributed 
compu-ng 
– Distribute 
data 
when 
the 
data 
is 
stored 
– Run 
computaAon 
where 
the 
data 
is 
stored 
Hadoop
Hadoop: 
Very 
High-­‐Level 
Overview 
Slave 
Nodes 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
9 
§ Data 
is 
split 
into 
“blocks” 
when 
loaded 
§ Each 
task 
typically 
works 
on 
a 
single 
block 
– Many 
run 
in 
parallel 
§ A 
master 
program 
manages 
tasks 
Lorem ipsum dolor sit 
amet, consectetur sed 
adipisicing elit, ado lei 
eiusmod tempor etma 
incididunt ut libore tua 
dolore magna alli quio 
ut enim ad minim veni 
veniam, quis nostruda 
exercitation ul laco es 
sed laboris nisi ut eres 
aliquip ex eaco modai 
consequat. Duis hona 
irure dolor in repre sie 
honerit in ame mina lo 
voluptate elit esse oda 
cillum le dolore eu fugi 
gia nulla aria tur. Ente 
culpa qui officia ledea 
un mollit anim id est o 
laborum ame elita tu a 
magna omnibus et. 
Lorem ipsum dolor sit 
amet, consectetur sed 
adipisicing elit, ado lei 
eiusmod tempor etma 
incididunt ut libore tua 
dolore magna alli quio 
ut enim ad minim veni 
veniam, quis nostruda 
exercitation ul laco es 
sed laboris nisi ut eres 
aliquip ex eaco modai 
consequat. Duis hona 
irure dolor in repre sie 
honerit in ame mina lo 
voluptate elit esse oda 
cillum le dolore eu fugi 
gia nulla aria tur. Ente 
culpa qui officia ledea 
un mollit anim id est o 
laborum ame elita tu a 
magna omnibus et. 
Master
Core 
Hadoop 
Concepts 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
10 
§ Applica-ons 
are 
wriDen 
in 
high-­‐level 
code 
§ Nodes 
talk 
to 
each 
other 
as 
liDle 
as 
possible 
§ Data 
is 
distributed 
in 
advance 
– Bring 
the 
computaAon 
to 
the 
data 
§ Data 
is 
replicated 
for 
increased 
availability 
and 
reliability 
§ Hadoop 
is 
scalable 
and 
fault-­‐tolerant
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
11 
Scalability 
§ Adding 
nodes 
adds 
capacity 
propor-onally 
§ Increasing 
load 
results 
in 
a 
graceful 
decline 
in 
performance 
– Not 
failure 
of 
the 
system 
Number 
of 
Nodes 
Capacity
Fault 
Tolerance 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
12 
§ Node 
failure 
is 
inevitable 
§ What 
happens? 
– System 
conAnues 
to 
funcAon 
– Master 
re-­‐assigns 
tasks 
to 
a 
different 
node 
– Data 
replicaAon 
= 
no 
loss 
of 
data 
– Nodes 
which 
recover 
rejoin 
the 
cluster 
automaAcally 
“Failure 
is 
the 
defining 
difference 
between 
distributed 
and 
local 
programming, 
so 
you 
have 
to 
design 
distributed 
systems 
with 
the 
expectaAon 
of 
failure.” 
– 
Ken 
Arnold 
(CORBA 
designer)
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
13 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
MoAvaAon 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
IntegraAon: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
ExploraAon: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
§ The 
Hadoop 
Distributed 
File 
System 
(HDFS) 
is 
a 
filesystem 
wriDen 
in 
Java 
§ Sits 
on 
top 
of 
a 
na-ve 
filesystem 
§ Provides 
storage 
for 
massive 
amounts 
of 
data 
– Scalable 
– Fault 
tolerant 
– Supports 
efficient 
processing 
with 
MapReduce, 
Spark, 
and 
other 
tools 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
14 
Hadoop 
Cluster 
HDFS 
Basic 
Concepts 
HDFS
How 
Files 
are 
Stored 
(1) 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
15 
§ Data 
files 
are 
split 
into 
blocks 
and 
distributed 
to 
data 
nodes 
Block 
1 
Block 
2 
Block 
3 
Very 
Large 
Data 
File
How 
Files 
are 
Stored 
(2) 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
16 
§ Data 
files 
are 
split 
into 
blocks 
and 
distributed 
to 
data 
nodes 
Block 
1 
Block 
2 
Block 
3 
Block 
1 
Block 
1 
Block 
1 
Very 
Large 
Data 
File
How 
Files 
are 
Stored 
(3) 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
17 
§ Data 
files 
are 
split 
into 
blocks 
and 
distributed 
to 
data 
nodes 
§ Each 
block 
is 
replicated 
on 
mul-ple 
nodes 
(default 
3x) 
Block 
1 
Block 
2 
Block 
3 
Block 
1 
Block 
3 
Block 
2 
Block 
3 
Block 
1 
Block 
3 
Block 
1 
Block 
2 
Block 
2 
Very 
Large 
Data 
File
How 
Files 
are 
Stored 
(4) 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
18 
§ Data 
files 
are 
split 
into 
blocks 
and 
distributed 
to 
data 
nodes 
§ Each 
block 
is 
replicated 
on 
mul-ple 
nodes 
(default 
3x) 
§ NameNode 
stores 
metadata 
Name 
Node 
Block 
1 
Block 
2 
Block 
3 
Block 
1 
Block 
3 
Block 
2 
Block 
3 
Block 
1 
Block 
3 
Block 
1 
Block 
2 
Block 
2 
Metadata: 
informaAon 
about 
files 
and 
blocks 
Very 
Large 
Data 
File
2 
1 3 
Node 
B 
1 
/logs/041213.log? 
B4,B5 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
19 
Example: 
Storing 
and 
Retrieving 
Files 
(1) 
Metadata 
NameNode 
/logs/031512.log: B1,B2,B3 
/logs/041213.log: B4,B5 
B1: A,B,D 
B2: B,D,E 
B3: A,B,C 
B4: A,B,E 
B5: C,E,D 
/logs/ 
031512.log 
1 
/logs/ 
041213.log 
2 
3 
4 
5 
Node 
C 
3 5 
Node 
E 
5 
4 
Node 
A 
4 
2 
3 
4 
Node 
D 
1 
5 
2 
Client
2 
1 3 
Node 
B 
1 
/logs/041213.log? 
B4,B5 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
20 
Example: 
Storing 
and 
Retrieving 
Files 
(2) 
Metadata 
NameNode 
/logs/031512.log: B1,B2,B3 
/logs/041213.log: B4,B5 
B1: A,B,D 
B2: B,D,E 
B3: A,B,C 
B4: A,B,E 
B5: C,E,D 
/logs/ 
031512.log 
1 
/logs/ 
041213.log 
2 
3 
4 
5 
Node 
C 
3 5 
Node 
E 
5 
4 
Node 
A 
4 
2 
3 
4 
Node 
D 
1 
5 
2 
Client
Important 
Notes 
About 
HDFS 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
21 
§ HDFS 
performs 
best 
with 
a 
modest 
number 
of 
large 
files 
– Millions, 
rather 
than 
billions, 
of 
files 
– Each 
file 
typically 
100MB 
or 
more 
§ Files 
in 
HDFS 
are 
“write 
once” 
– Files 
can 
be 
replaced 
but 
not 
changed
Shuffle 
and 
Sort 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
22 
MapReduce 
§ The 
Mapper 
– Each 
Map 
task 
(typically) 
operates 
on 
a 
single 
HDFS 
block 
– Map 
tasks(usually) 
run 
on 
the 
node 
where 
the 
block 
is 
stored 
§ Shuffle 
and 
Sort 
– Sorts 
and 
consolidates 
intermediate 
data 
from 
all 
mappers 
– Happens 
amer 
all 
Map 
tasks 
are 
complete 
and 
before 
Reduce 
tasks 
start 
§ The 
Reducer 
– Operates 
on 
shuffled/sorted 
intermediate 
data 
(Map 
task 
output) 
– Produces 
final 
output 
Map 
Reduce
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
23 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
MoAvaAon 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
IntegraAon: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
ExploraAon: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
The 
Hadoop 
Ecosystem 
(1) 
Sqoop 
Impala 
Hive 
Pig 
HBase 
Flume 
Oozie 
… 
MapReduce 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
24 
Hadoop 
Distributed 
File 
System 
Hadoop 
Ecosystem 
Hadoop 
Core 
Components 
CDH
The 
Hadoop 
Ecosystem 
(2) 
HBase 
Flume 
Oozie 
… 
Hadoop 
Ecosystem 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
25 
Sqoop 
Impala 
Hive 
Pig 
§ CDH 
includes 
many 
Hadoop 
Ecosystem 
components 
§ Following 
are 
more 
details 
on 
some 
of 
the 
key 
components
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
26 
CDH 
§ CDH 
(Cloudera’s 
Distribu-on, 
including 
Apache 
Hadoop) 
– 100% 
open 
source, 
enterprise-­‐ready 
distribuAon 
of 
Hadoop 
and 
related 
projects 
– The 
most 
complete, 
tested, 
and 
widely-­‐ 
deployed 
distribuAon 
of 
Hadoop 
– Integrates 
all 
key 
Hadoop 
ecosystem 
projects
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
27 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
MoAvaAon 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
IntegraAon: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
ExploraAon: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
HBase: 
The 
Hadoop 
Database 
HDFS 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
28 
§ HBase: 
database 
layered 
on 
top 
of 
HDFS 
– Provides 
interacAve 
access 
to 
data 
§ Stores 
massive 
amounts 
of 
data 
– Petabytes+ 
§ High 
throughput 
– Thousands 
of 
writes 
per 
second 
(per 
node) 
§ Handles 
sparse 
data 
well 
– No 
wasted 
space 
for 
a 
row 
with 
empty 
columns 
§ Limited 
access 
model 
– OpAmized 
for 
lookup 
of 
a 
row 
by 
key 
rather 
than 
full 
queries 
– No 
transacAons: 
single 
row 
operaAons 
only
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
29 
HBase 
vs 
RDBMS 
RDBMS HBase 
Transactions Yes Single row only 
Query language SQL get/put/scan (or use Hive or 
Impala) 
Indexes Yes Row-key only 
Max data size TBs PBs 
Read/write throughput 
Thousands Millions 
(queries per second)
When 
To 
Use 
HBase 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
30 
§ Use 
plain 
HDFS 
if… 
– You 
only 
append 
to 
your 
dataset 
(no 
random 
write) 
– You 
usually 
read 
the 
whole 
dataset 
(no 
random 
read) 
§ Use 
HBase 
if… 
– You 
need 
random 
write 
and/or 
read 
– You 
do 
thousands 
of 
operaAons 
per 
second 
on 
TB+ 
of 
data 
§ Use 
an 
RDBMS 
if… 
– Your 
data 
fits 
on 
one 
big 
node 
– You 
need 
full 
transacAon 
support 
– You 
need 
real-­‐Ame 
query 
capabiliAes
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
31 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
MoAvaAon 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
Integra-on: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
ExploraAon: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
Flume: 
Real-­‐Ame 
Data 
Import 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
32 
§ What 
is 
Flume? 
– A 
service 
to 
move 
large 
amounts 
of 
data 
in 
real 
Ame 
– Example: 
storing 
log 
files 
in 
HDFS 
§ Flume 
is 
– Distributed 
– Reliable 
and 
available 
– Horizontally 
scalable 
– Extensible
data 
as 
it 
is 
produced 
syslogs, 
stdout 
or 
custom 
source 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
33 
Flume: 
High-­‐Level 
Overview 
Agent 
Agent 
Agent 
Agent 
Agent 
Agent(s) 
Agent 
encrypt 
compress 
• Collect 
• Files, 
• Pre-­‐process 
data 
before 
storing 
• 
e.g., 
transform, 
scrub, 
enrich 
• Store 
in 
any 
format 
• Text, 
compressed, 
binary, 
or 
custom 
sink 
Agent 
• Process 
in 
place 
• e.g., 
encrypt, 
compress 
• Write 
in 
parallel 
• Scalable 
throughput 
HDFS
Sqoop: 
Exchanging 
Data 
With 
RDBMSs 
Sqoop 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
34 
§ Sqoop: 
SQL 
to 
Hadoop 
– Transfers 
data 
between 
RDBMS 
and 
HDFS 
– Uses 
a 
command-­‐line 
tool 
or 
applicaAon 
connector 
– Allows 
incremental 
imports 
– Supports 
virtually 
all 
RDBMSs 
which 
speak 
JDBC 
– Custom 
connectors 
available 
for 
some 
RDBMSs 
for 
increased 
speed 
HDFS 
RDBMS
Flume 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
35 
Data 
Center 
IntegraAon 
File Server 
Relational Database 
(OLTP) 
Data Warehouse 
(OLAP) 
Web/App Servers 
Hadoop Cluster 
Sqoop 
hadoop fs 
Sqoop
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
36 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
MoAvaAon 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
IntegraAon: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
ExploraAon: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
Apache 
Spark 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
37 
§ Apache 
Spark 
is 
a 
fast, 
general 
engine 
for 
large-­‐scale 
data 
processing 
on 
a 
cluster 
§ Originally 
developed 
at 
AMPLab 
at 
UC 
Berkeley 
§ Open 
source 
Apache 
project 
§ Provides 
several 
benefits 
over 
MapReduce 
– Faster 
– Be>er 
suited 
for 
iteraAve 
algorithms 
– Can 
hold 
intermediate 
data 
in 
RAM, 
resulAng 
in 
much 
be>er 
performance 
– Easier 
API 
– Supports 
Python, 
Scala, 
Java 
– Supports 
real-­‐Ame 
streaming 
data 
processing
Spark 
vs 
Hadoop 
MapReduce 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
38 
§ MapReduce 
– Widely 
used, 
huge 
investment 
already 
made 
– Supports 
and 
supported 
by 
many 
complementary 
tools 
– Mature, 
well-­‐tested 
§ Spark 
– Flexible 
– Elegant 
– Fast 
– Supports 
real-­‐Ame 
streaming 
data 
processing 
§ Over 
-me 
Spark 
will 
supplant 
MapReduce 
as 
the 
general 
processing 
framework 
used 
by 
most 
organiza-ons
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
39 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
MoAvaAon 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
IntegraAon: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
ExploraAon: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
Hive 
and 
Pig: 
High 
Level 
Data 
Languages 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
40 
§ The 
mo-va-on: 
MapReduce 
is 
powerful 
but 
hard 
to 
master 
§ Even 
Spark 
requires 
a 
developer 
who 
can 
code 
in 
Scala 
or 
Python 
§ A 
solu-on: 
Hive 
and 
Pig 
– Built 
on 
top 
of 
MapReduce 
– Currently 
being 
ported 
to 
run 
on 
top 
of 
Spark 
for 
be>er 
performance 
– Leverage 
exisAng 
skillsets 
– Data 
analysts 
who 
use 
SQL 
– Programmers 
who 
use 
scripAng 
languages 
– Open 
source 
Apache 
projects 
– Hive 
iniAally 
developed 
at 
Facebook 
– Pig 
IniAally 
developed 
at 
Yahoo!
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
41 
Hive 
§ What 
is 
Hive? 
– HiveQL: 
An 
SQL-­‐like 
interface 
to 
Hadoop 
SELECT * FROM purchases WHERE price > 10000 ORDER BY 
storeid
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
42 
Pig 
§ What 
is 
Pig? 
– Pig 
La-n: 
A 
dataflow 
language 
for 
transforming 
large 
data 
sets 
purchases = LOAD "/user/dave/purchases" AS (itemID, 
price, storeID, purchaserID); 
bigticket = FILTER purchases BY price > 10000; 
...
Impala: 
High 
Performance 
Queries 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
43 
§ High-­‐performance 
SQL 
engine 
for 
vast 
amounts 
of 
data 
– Similar 
query 
language 
to 
HiveQL 
– 10 
to 
50+ 
Ames 
faster 
than 
Hive, 
Pig, 
or 
MapReduce 
– EffecAvely, 
provides 
‘real 
Ame’ 
results 
§ Impala 
runs 
on 
Hadoop 
clusters 
– Data 
stored 
in 
HDFS 
– Does 
not 
use 
MapReduce 
§ Developed 
by 
Cloudera 
– 100% 
open 
source, 
released 
under 
the 
Apache 
somware 
license
Which 
to 
Choose? 
(1) 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
44 
§ Choose 
the 
best 
solu-on 
for 
the 
given 
task 
– Mix 
and 
match 
as 
needed 
§ MapReduce 
– Low-­‐level 
approach 
offers 
flexibility, 
control, 
and 
performance 
– More 
Ame-­‐consuming 
and 
error-­‐prone 
to 
write 
– Choose 
when 
control 
and 
performance 
are 
most 
important 
§ Pig, 
Hive, 
and 
Impala 
– Faster 
to 
write, 
test, 
and 
deploy 
than 
MapReduce 
– Be>er 
choice 
for 
most 
analysis 
and 
processing 
tasks
Which 
to 
Choose? 
(2) 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
45 
§ Use 
Impala 
when… 
– You 
have 
analysts 
familiar 
with 
SQL 
– You 
need 
near 
real-­‐Ame 
responses 
to 
ad 
hoc 
queries 
– You 
have 
structured 
data 
with 
a 
defined 
schema 
§ Use 
Hive 
or 
Pig 
when… 
– You 
need 
support 
for 
custom 
file 
types, 
or 
complex 
data 
types 
§ Use 
Pig 
when… 
– You 
have 
developers 
experienced 
with 
wriAng 
scripts 
– Your 
data 
is 
unstructured/mulA-­‐structured 
§ Use 
Hive 
When… 
– Your 
data 
is 
structured 
and 
you 
are 
performing 
long-­‐running, 
batch 
jobs
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
46 
Comparing 
Pig, 
Hive, 
and 
Impala 
Descrip-on 
of 
Feature 
Pig 
Hive 
Impala 
SQL-­‐based 
query 
language 
No 
Yes 
Yes 
Schema 
OpAonal 
Required 
Required 
Supports 
user-­‐defined 
func-ons 
Yes 
Yes 
Yes 
Extensible 
file 
format 
support 
Yes 
Yes 
No 
Query 
speed 
Slow 
Slow 
Fast 
Accessible 
via 
ODBC/JDBC 
No 
Yes 
Yes
Do 
These 
Replace 
an 
RDBMS? 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
47 
§ Probably 
not 
if 
the 
RDBMS 
is 
used 
for 
its 
intended 
purpose 
§ Rela-onal 
databases 
are 
op-mized 
for: 
– RelaAvely 
small 
amounts 
of 
data 
– Immediate 
results 
– In-­‐place 
modificaAon 
of 
data 
§ Pig, 
Hive, 
and 
Impala 
are 
op-mized 
for: 
– Large 
amounts 
of 
read-­‐only 
data 
– Extensive 
scalability 
at 
low 
cost 
§ Pig 
and 
Hive 
are 
beDer 
suited 
for 
batch 
processing 
– Impala 
and 
RDBMSs 
are 
be>er 
for 
interacAve 
use
Sentiment Analysis on 
Social Media with Hive 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
48 
Analysis 
Workflow 
Example 
Import Transaction Data 
from RDBMS 
Sessionize Web 
Log Data with Pig 
Analyst using Impala 
shell for ad hoc queries 
Analyst using Impala 
via BI tool 
Hadoop Cluster 
with Impala 
Generate Nightly Reports 
using Pig, Hive, or Impala
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
49 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
MoAvaAon 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
IntegraAon: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
Explora-on: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
50 
Cloudera 
Search 
§ Real-­‐-me, 
scalable 
indexing 
§ Load 
any 
type 
of 
data 
§ Text 
and 
faceted 
searching
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
51 
Cloudera 
Search 
Example: 
Twi>er 
Feed 
Search 
IteraAve 
search 
using 
facets 
Full 
text 
search
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
52 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
MoAvaAon 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
IntegraAon: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
ExploraAon: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
Reducing 
Complexity 
With 
Cloudera 
Manager 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
53 
§ Pujng 
Hadoop 
into 
produc-on 
requires 
stringent 
up-mes 
§ Clusters 
are 
made 
up 
of 
a 
large 
number 
of 
hosts 
– Each 
host 
runs 
mulAple 
Hadoop 
services 
– Difficult 
to 
know 
the 
status 
of 
everything 
§ Inevitable 
issues 
will 
arise 
with 
hardware 
and 
sokware 
§ Keeping 
track 
of 
the 
cluster 
becomes 
an 
issue 
– Are 
all 
hosts 
healthy 
and 
working? 
– Am 
I 
using 
all 
of 
the 
best 
pracAces 
for 
the 
service? 
– Is 
there 
a 
performance 
issue 
for 
a 
host 
or 
service? 
– Is 
the 
cluster 
secure?
What 
Is 
Cloudera 
Manager? 
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
54 
§ Cloudera 
Manager 
is 
a 
purpose-­‐built 
applica-on 
designed 
to 
make 
the 
administra-on 
of 
Hadoop 
simple 
and 
straighmorward 
– Automates 
the 
installaAon 
of 
a 
Hadoop 
cluster 
– Quickly 
adds 
and 
configures 
new 
services 
on 
a 
cluster 
– Provides 
real-­‐Ame 
monitoring 
of 
cluster 
acAvity 
– Produces 
reports 
of 
cluster 
usage 
– Manages 
users 
and 
groups 
who 
have 
access 
to 
the 
cluster 
– Integrates 
with 
your 
exisAng 
enterprise 
monitoring 
tools 
§ Cloudera 
Manager 
Express 
Edi-on 
– Free 
§ Cloudera 
Enterprise 
– Cloudera 
Manager 
plus 
support 
– Contact 
us 
for 
pricing
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
55 
Cloudera 
Manager 
Dashboard
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
56 
Health 
Status 
and 
CharAng
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
57 
PresentaAon 
Topics 
An 
Introduc-on 
to 
Hadoop 
and 
Cloudera 
§ The 
MoAvaAon 
for 
Hadoop 
§ ‘Core 
Hadoop’: 
HDFS 
and 
MapReduce 
§ CDH 
and 
the 
Hadoop 
Ecosystem 
§ Data 
Storage: 
HBase 
§ Data 
IntegraAon: 
Flume 
and 
Sqoop 
§ Data 
Processing: 
Spark 
§ Data 
Analysis: 
Hive, 
Pig, 
and 
Impala 
§ Data 
ExploraAon: 
Cloudera 
Search 
§ Managing 
Everything: 
Cloudera 
Manager 
§ Conclusion
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
58 
Conclusion 
§ There 
are 
several 
more 
projects 
in 
CDH 
– CDH 
supports 
all 
the 
key 
projects 
you 
need 
§ We 
haven’t 
even 
talked 
about 
security! 
– CDH 
includes 
Kerberos 
integraAon 
for 
authenAcaAon 
– Cloudera 
Enterprise 
provides 
all 
the 
security 
you 
need, 
whatever 
your 
industry 
– Recently 
achieved 
PCI 
cerAficaAon 
§ Download 
the 
QuickStart 
VM 
to 
get 
started 
in 
a 
single 
VM 
§ Try 
Cloudera 
on 
a 
real 
cluster 
for 
free 
§ All 
available 
at 
cloudera.com/live 
§ Ques-ons?
© 
Copyright 
2010-­‐2014 
Cloudera. 
All 
rights 
reserved. 
Not 
to 
be 
reproduced 
without 
prior 
wri>en 
consent. 
59

More Related Content

What's hot

Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Managementrightsize
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introductionXuan-Chao Huang
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesMapR Technologies
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopCloudera, Inc.
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14John Sing
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryCloudera, Inc.
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceUwe Printz
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataWANdisco Plc
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherJanBask Training
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop IntroductionDzung Nguyen
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Cloudera, Inc.
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, HortonworksHortonworks
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableStefan Kupstaitis-Dunkler
 

What's hot (20)

Big Data Performance and Capacity Management
Big Data Performance and Capacity ManagementBig Data Performance and Capacity Management
Big Data Performance and Capacity Management
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction20131205 hadoop-hdfs-map reduce-introduction
20131205 hadoop-hdfs-map reduce-introduction
 
Productionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best PracticesProductionizing Hadoop: 7 Architectural Best Practices
Productionizing Hadoop: 7 Architectural Best Practices
 
Introduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache HadoopIntroduction to Cloudera's Administrator Training for Apache Hadoop
Introduction to Cloudera's Administrator Training for Apache Hadoop
 
Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14Hadoop_Its_Not_Just_Internal_Storage_V14
Hadoop_Its_Not_Just_Internal_Storage_V14
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
Hadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduceHadoop 2 - Beyond MapReduce
Hadoop 2 - Beyond MapReduce
 
Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Top Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for FresherTop Hadoop Big Data Interview Questions and Answers for Fresher
Top Hadoop Big Data Interview Questions and Answers for Fresher
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unloc...
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing:  Herb Cunitz, HortonworksDemystify Big Data Breakfast Briefing:  Herb Cunitz, Hortonworks
Demystify Big Data Breakfast Briefing: Herb Cunitz, Hortonworks
 
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the ImprobableDisaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
Disaster Recovery in the Hadoop Ecosystem: Preparing for the Improbable
 

Viewers also liked

Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Cloudera, Inc.
 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Cloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution
Cloudera, Inc.
 
Crawl Walk Run to Social Media Success
Crawl Walk Run to Social Media SuccessCrawl Walk Run to Social Media Success
Crawl Walk Run to Social Media SuccessEmily Davis Consulting
 
5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data StrategyWestern Digital
 
Hadoop Introduction in Paris
Hadoop Introduction in ParisHadoop Introduction in Paris
Hadoop Introduction in ParisTed Drake
 
Oral presentation my last summer
Oral presentation   my last summerOral presentation   my last summer
Oral presentation my last summermariaport1
 
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...Cloudera, Inc.
 
Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...
Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...
Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...Cloudera, Inc.
 
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest MindsCase study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest MindsHappiest Minds Technologies
 
OnSite Tageting Strategy
OnSite Tageting StrategyOnSite Tageting Strategy
OnSite Tageting StrategyJonathan Mendez
 
Crawl, Walk, Run: How to Get Started with Hadoop
Crawl, Walk, Run: How to Get Started with HadoopCrawl, Walk, Run: How to Get Started with Hadoop
Crawl, Walk, Run: How to Get Started with HadoopInside Analysis
 
BIG DATA AS A STRATEGY : CUSTOMER EXPERIENCE AND ANALYTICS
BIG DATA AS A STRATEGY : CUSTOMER EXPERIENCE AND ANALYTICSBIG DATA AS A STRATEGY : CUSTOMER EXPERIENCE AND ANALYTICS
BIG DATA AS A STRATEGY : CUSTOMER EXPERIENCE AND ANALYTICSJawad Jari, Enterprise Architect
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For HadoopCloudera, Inc.
 
Omnichannel business transformation
Omnichannel business transformationOmnichannel business transformation
Omnichannel business transformationHans Smellinckx
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Cloudera, Inc.
 
An Executive Insider's Guide to Enterprise Agile Transformation
An Executive Insider's Guide to Enterprise Agile TransformationAn Executive Insider's Guide to Enterprise Agile Transformation
An Executive Insider's Guide to Enterprise Agile TransformationScott Richardson
 

Viewers also liked (20)

Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1
Using Big Data to Transform Your Customer’s Experience - Part 1

Using Big Data to Transform Your Customer’s Experience - Part 1

 
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution
Enabling the Connected Car Revolution

Enabling the Connected Car Revolution

 
Crawl Walk Run to Social Media Success
Crawl Walk Run to Social Media SuccessCrawl Walk Run to Social Media Success
Crawl Walk Run to Social Media Success
 
5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy
 
Hadoop Introduction in Paris
Hadoop Introduction in ParisHadoop Introduction in Paris
Hadoop Introduction in Paris
 
Oral presentation my last summer
Oral presentation   my last summerOral presentation   my last summer
Oral presentation my last summer
 
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...
Modernizing Your IT Infrastructure with Hadoop - Cloudera Summer Webinar Seri...
 
Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...
Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...
Chicago Data Summit: Cloudera's Distribution including Apache Hadoop & Cloude...
 
Agile Flight Path
Agile Flight PathAgile Flight Path
Agile Flight Path
 
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest MindsCase study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
 
OnSite Tageting Strategy
OnSite Tageting StrategyOnSite Tageting Strategy
OnSite Tageting Strategy
 
Crawl, Walk, Run: How to Get Started with Hadoop
Crawl, Walk, Run: How to Get Started with HadoopCrawl, Walk, Run: How to Get Started with Hadoop
Crawl, Walk, Run: How to Get Started with Hadoop
 
BIG DATA AS A STRATEGY : CUSTOMER EXPERIENCE AND ANALYTICS
BIG DATA AS A STRATEGY : CUSTOMER EXPERIENCE AND ANALYTICSBIG DATA AS A STRATEGY : CUSTOMER EXPERIENCE AND ANALYTICS
BIG DATA AS A STRATEGY : CUSTOMER EXPERIENCE AND ANALYTICS
 
Cloudera
ClouderaCloudera
Cloudera
 
Hw09 Clouderas Distribution For Hadoop
Hw09   Clouderas Distribution For HadoopHw09   Clouderas Distribution For Hadoop
Hw09 Clouderas Distribution For Hadoop
 
Omnichannel business transformation
Omnichannel business transformationOmnichannel business transformation
Omnichannel business transformation
 
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
Hadoop World 2011: The Hadoop Stack - Then, Now and in the Future - Eli Colli...
 
An Executive Insider's Guide to Enterprise Agile Transformation
An Executive Insider's Guide to Enterprise Agile TransformationAn Executive Insider's Guide to Enterprise Agile Transformation
An Executive Insider's Guide to Enterprise Agile Transformation
 

Similar to An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

Catching the Software Defined Storage Wave
Catching the Software Defined Storage WaveCatching the Software Defined Storage Wave
Catching the Software Defined Storage WaveDataCore Software
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoopmarkgrover
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesCloudera, Inc.
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Sqrrl February Webinar: Breaking Down Data Silos
Sqrrl February Webinar: Breaking Down Data SilosSqrrl February Webinar: Breaking Down Data Silos
Sqrrl February Webinar: Breaking Down Data SilosSqrrl
 
ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...
ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...
ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...Victor Marmol
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Alluxio, Inc.
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopMike Pittaro
 
Increase Your Mission Critical Application Performance without Breaking the B...
Increase Your Mission Critical Application Performance without Breaking the B...Increase Your Mission Critical Application Performance without Breaking the B...
Increase Your Mission Critical Application Performance without Breaking the B...DataCore Software
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
HP Autonomy - Three Ways to Preserve and Protect your Virtual Infrastructure
HP Autonomy - Three Ways to Preserve and Protect your Virtual InfrastructureHP Autonomy - Three Ways to Preserve and Protect your Virtual Infrastructure
HP Autonomy - Three Ways to Preserve and Protect your Virtual InfrastructureVMUG IT
 
Top10 list planningpostgresdeployment.2014
Top10 list planningpostgresdeployment.2014Top10 list planningpostgresdeployment.2014
Top10 list planningpostgresdeployment.2014EDB
 
App Performance Tip: Sharing Flash Across Virtualized Workloads
App Performance Tip: Sharing Flash Across Virtualized WorkloadsApp Performance Tip: Sharing Flash Across Virtualized Workloads
App Performance Tip: Sharing Flash Across Virtualized WorkloadsDataCore Software
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateClouderaUserGroups
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 

Similar to An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14 (20)

EMC config Hadoop
EMC config HadoopEMC config Hadoop
EMC config Hadoop
 
Catching the Software Defined Storage Wave
Catching the Software Defined Storage WaveCatching the Software Defined Storage Wave
Catching the Software Defined Storage Wave
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
 
Harnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop SeriesHarnessing the Power of Apache Hadoop Series
Harnessing the Power of Apache Hadoop Series
 
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
 
Apache Hadoop 3
Apache Hadoop 3Apache Hadoop 3
Apache Hadoop 3
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Sqrrl February Webinar: Breaking Down Data Silos
Sqrrl February Webinar: Breaking Down Data SilosSqrrl February Webinar: Breaking Down Data Silos
Sqrrl February Webinar: Breaking Down Data Silos
 
ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...
ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...
ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale...
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
Optimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for HadoopOptimizing Dell PowerEdge Configurations for Hadoop
Optimizing Dell PowerEdge Configurations for Hadoop
 
Increase Your Mission Critical Application Performance without Breaking the B...
Increase Your Mission Critical Application Performance without Breaking the B...Increase Your Mission Critical Application Performance without Breaking the B...
Increase Your Mission Critical Application Performance without Breaking the B...
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
HP Autonomy - Three Ways to Preserve and Protect your Virtual Infrastructure
HP Autonomy - Three Ways to Preserve and Protect your Virtual InfrastructureHP Autonomy - Three Ways to Preserve and Protect your Virtual Infrastructure
HP Autonomy - Three Ways to Preserve and Protect your Virtual Infrastructure
 
Top10 list planningpostgresdeployment.2014
Top10 list planningpostgresdeployment.2014Top10 list planningpostgresdeployment.2014
Top10 list planningpostgresdeployment.2014
 
App Performance Tip: Sharing Flash Across Virtualized Workloads
App Performance Tip: Sharing Flash Across Virtualized WorkloadsApp Performance Tip: Sharing Flash Across Virtualized Workloads
App Performance Tip: Sharing Flash Across Virtualized Workloads
 
What it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready stateWhat it takes to bring Hadoop to a production-ready state
What it takes to bring Hadoop to a production-ready state
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 

An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

  • 1. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 1 An IntroducAon to Hadoop and Cloudera Nashville Cloudera User Group, 10/23/14 Ian Wrigley, Director, EducaAonal Curriculum ian@cloudera.com @iwrigley 201405
  • 2. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 2 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The Mo-va-on for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data IntegraAon: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data ExploraAon: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 3. TradiAonal Large-­‐Scale ComputaAon © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 3 § Tradi-onally, computa-on has been processor-­‐bound – RelaAvely small amounts of data – Lots of complex processing § The early solu-on: bigger computers – Faster processor, more memory – But even this couldn’t keep up
  • 4. Distributed Systems © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 4 § The beDer solu-on: more computers – Distributed systems – use mulAple machines for a single job “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.” – Grace Hopper Database Hadoop Cluster
  • 5. Distributed Systems: Challenges © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 5 § Challenges with distributed systems – Programming complexity – Keeping data and processes in sync – Finite bandwidth – ParAal failures
  • 6. Distributed Systems: The Data Bo>leneck (1) © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 6 § Tradi-onally, data is stored in a central loca-on § Data is copied to processors at run-me § Fine for limited amounts of data
  • 7. Distributed Systems: The Data Bo>leneck (2) © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 7 § Modern systems have much more data – terabytes+ a day – petabytes+ total § We need a new approach…
  • 8. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 8 § A radical new approach to distributed compu-ng – Distribute data when the data is stored – Run computaAon where the data is stored Hadoop
  • 9. Hadoop: Very High-­‐Level Overview Slave Nodes © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 9 § Data is split into “blocks” when loaded § Each task typically works on a single block – Many run in parallel § A master program manages tasks Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. Lorem ipsum dolor sit amet, consectetur sed adipisicing elit, ado lei eiusmod tempor etma incididunt ut libore tua dolore magna alli quio ut enim ad minim veni veniam, quis nostruda exercitation ul laco es sed laboris nisi ut eres aliquip ex eaco modai consequat. Duis hona irure dolor in repre sie honerit in ame mina lo voluptate elit esse oda cillum le dolore eu fugi gia nulla aria tur. Ente culpa qui officia ledea un mollit anim id est o laborum ame elita tu a magna omnibus et. Master
  • 10. Core Hadoop Concepts © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 10 § Applica-ons are wriDen in high-­‐level code § Nodes talk to each other as liDle as possible § Data is distributed in advance – Bring the computaAon to the data § Data is replicated for increased availability and reliability § Hadoop is scalable and fault-­‐tolerant
  • 11. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 11 Scalability § Adding nodes adds capacity propor-onally § Increasing load results in a graceful decline in performance – Not failure of the system Number of Nodes Capacity
  • 12. Fault Tolerance © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 12 § Node failure is inevitable § What happens? – System conAnues to funcAon – Master re-­‐assigns tasks to a different node – Data replicaAon = no loss of data – Nodes which recover rejoin the cluster automaAcally “Failure is the defining difference between distributed and local programming, so you have to design distributed systems with the expectaAon of failure.” – Ken Arnold (CORBA designer)
  • 13. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 13 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The MoAvaAon for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data IntegraAon: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data ExploraAon: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 14. § The Hadoop Distributed File System (HDFS) is a filesystem wriDen in Java § Sits on top of a na-ve filesystem § Provides storage for massive amounts of data – Scalable – Fault tolerant – Supports efficient processing with MapReduce, Spark, and other tools © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 14 Hadoop Cluster HDFS Basic Concepts HDFS
  • 15. How Files are Stored (1) © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 15 § Data files are split into blocks and distributed to data nodes Block 1 Block 2 Block 3 Very Large Data File
  • 16. How Files are Stored (2) © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 16 § Data files are split into blocks and distributed to data nodes Block 1 Block 2 Block 3 Block 1 Block 1 Block 1 Very Large Data File
  • 17. How Files are Stored (3) © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 17 § Data files are split into blocks and distributed to data nodes § Each block is replicated on mul-ple nodes (default 3x) Block 1 Block 2 Block 3 Block 1 Block 3 Block 2 Block 3 Block 1 Block 3 Block 1 Block 2 Block 2 Very Large Data File
  • 18. How Files are Stored (4) © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 18 § Data files are split into blocks and distributed to data nodes § Each block is replicated on mul-ple nodes (default 3x) § NameNode stores metadata Name Node Block 1 Block 2 Block 3 Block 1 Block 3 Block 2 Block 3 Block 1 Block 3 Block 1 Block 2 Block 2 Metadata: informaAon about files and blocks Very Large Data File
  • 19. 2 1 3 Node B 1 /logs/041213.log? B4,B5 © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 19 Example: Storing and Retrieving Files (1) Metadata NameNode /logs/031512.log: B1,B2,B3 /logs/041213.log: B4,B5 B1: A,B,D B2: B,D,E B3: A,B,C B4: A,B,E B5: C,E,D /logs/ 031512.log 1 /logs/ 041213.log 2 3 4 5 Node C 3 5 Node E 5 4 Node A 4 2 3 4 Node D 1 5 2 Client
  • 20. 2 1 3 Node B 1 /logs/041213.log? B4,B5 © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 20 Example: Storing and Retrieving Files (2) Metadata NameNode /logs/031512.log: B1,B2,B3 /logs/041213.log: B4,B5 B1: A,B,D B2: B,D,E B3: A,B,C B4: A,B,E B5: C,E,D /logs/ 031512.log 1 /logs/ 041213.log 2 3 4 5 Node C 3 5 Node E 5 4 Node A 4 2 3 4 Node D 1 5 2 Client
  • 21. Important Notes About HDFS © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 21 § HDFS performs best with a modest number of large files – Millions, rather than billions, of files – Each file typically 100MB or more § Files in HDFS are “write once” – Files can be replaced but not changed
  • 22. Shuffle and Sort © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 22 MapReduce § The Mapper – Each Map task (typically) operates on a single HDFS block – Map tasks(usually) run on the node where the block is stored § Shuffle and Sort – Sorts and consolidates intermediate data from all mappers – Happens amer all Map tasks are complete and before Reduce tasks start § The Reducer – Operates on shuffled/sorted intermediate data (Map task output) – Produces final output Map Reduce
  • 23. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 23 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The MoAvaAon for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data IntegraAon: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data ExploraAon: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 24. The Hadoop Ecosystem (1) Sqoop Impala Hive Pig HBase Flume Oozie … MapReduce © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 24 Hadoop Distributed File System Hadoop Ecosystem Hadoop Core Components CDH
  • 25. The Hadoop Ecosystem (2) HBase Flume Oozie … Hadoop Ecosystem © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 25 Sqoop Impala Hive Pig § CDH includes many Hadoop Ecosystem components § Following are more details on some of the key components
  • 26. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 26 CDH § CDH (Cloudera’s Distribu-on, including Apache Hadoop) – 100% open source, enterprise-­‐ready distribuAon of Hadoop and related projects – The most complete, tested, and widely-­‐ deployed distribuAon of Hadoop – Integrates all key Hadoop ecosystem projects
  • 27. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 27 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The MoAvaAon for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data IntegraAon: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data ExploraAon: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 28. HBase: The Hadoop Database HDFS © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 28 § HBase: database layered on top of HDFS – Provides interacAve access to data § Stores massive amounts of data – Petabytes+ § High throughput – Thousands of writes per second (per node) § Handles sparse data well – No wasted space for a row with empty columns § Limited access model – OpAmized for lookup of a row by key rather than full queries – No transacAons: single row operaAons only
  • 29. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 29 HBase vs RDBMS RDBMS HBase Transactions Yes Single row only Query language SQL get/put/scan (or use Hive or Impala) Indexes Yes Row-key only Max data size TBs PBs Read/write throughput Thousands Millions (queries per second)
  • 30. When To Use HBase © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 30 § Use plain HDFS if… – You only append to your dataset (no random write) – You usually read the whole dataset (no random read) § Use HBase if… – You need random write and/or read – You do thousands of operaAons per second on TB+ of data § Use an RDBMS if… – Your data fits on one big node – You need full transacAon support – You need real-­‐Ame query capabiliAes
  • 31. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 31 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The MoAvaAon for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data Integra-on: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data ExploraAon: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 32. Flume: Real-­‐Ame Data Import © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 32 § What is Flume? – A service to move large amounts of data in real Ame – Example: storing log files in HDFS § Flume is – Distributed – Reliable and available – Horizontally scalable – Extensible
  • 33. data as it is produced syslogs, stdout or custom source © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 33 Flume: High-­‐Level Overview Agent Agent Agent Agent Agent Agent(s) Agent encrypt compress • Collect • Files, • Pre-­‐process data before storing • e.g., transform, scrub, enrich • Store in any format • Text, compressed, binary, or custom sink Agent • Process in place • e.g., encrypt, compress • Write in parallel • Scalable throughput HDFS
  • 34. Sqoop: Exchanging Data With RDBMSs Sqoop © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 34 § Sqoop: SQL to Hadoop – Transfers data between RDBMS and HDFS – Uses a command-­‐line tool or applicaAon connector – Allows incremental imports – Supports virtually all RDBMSs which speak JDBC – Custom connectors available for some RDBMSs for increased speed HDFS RDBMS
  • 35. Flume © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 35 Data Center IntegraAon File Server Relational Database (OLTP) Data Warehouse (OLAP) Web/App Servers Hadoop Cluster Sqoop hadoop fs Sqoop
  • 36. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 36 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The MoAvaAon for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data IntegraAon: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data ExploraAon: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 37. Apache Spark © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 37 § Apache Spark is a fast, general engine for large-­‐scale data processing on a cluster § Originally developed at AMPLab at UC Berkeley § Open source Apache project § Provides several benefits over MapReduce – Faster – Be>er suited for iteraAve algorithms – Can hold intermediate data in RAM, resulAng in much be>er performance – Easier API – Supports Python, Scala, Java – Supports real-­‐Ame streaming data processing
  • 38. Spark vs Hadoop MapReduce © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 38 § MapReduce – Widely used, huge investment already made – Supports and supported by many complementary tools – Mature, well-­‐tested § Spark – Flexible – Elegant – Fast – Supports real-­‐Ame streaming data processing § Over -me Spark will supplant MapReduce as the general processing framework used by most organiza-ons
  • 39. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 39 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The MoAvaAon for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data IntegraAon: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data ExploraAon: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 40. Hive and Pig: High Level Data Languages © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 40 § The mo-va-on: MapReduce is powerful but hard to master § Even Spark requires a developer who can code in Scala or Python § A solu-on: Hive and Pig – Built on top of MapReduce – Currently being ported to run on top of Spark for be>er performance – Leverage exisAng skillsets – Data analysts who use SQL – Programmers who use scripAng languages – Open source Apache projects – Hive iniAally developed at Facebook – Pig IniAally developed at Yahoo!
  • 41. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 41 Hive § What is Hive? – HiveQL: An SQL-­‐like interface to Hadoop SELECT * FROM purchases WHERE price > 10000 ORDER BY storeid
  • 42. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 42 Pig § What is Pig? – Pig La-n: A dataflow language for transforming large data sets purchases = LOAD "/user/dave/purchases" AS (itemID, price, storeID, purchaserID); bigticket = FILTER purchases BY price > 10000; ...
  • 43. Impala: High Performance Queries © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 43 § High-­‐performance SQL engine for vast amounts of data – Similar query language to HiveQL – 10 to 50+ Ames faster than Hive, Pig, or MapReduce – EffecAvely, provides ‘real Ame’ results § Impala runs on Hadoop clusters – Data stored in HDFS – Does not use MapReduce § Developed by Cloudera – 100% open source, released under the Apache somware license
  • 44. Which to Choose? (1) © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 44 § Choose the best solu-on for the given task – Mix and match as needed § MapReduce – Low-­‐level approach offers flexibility, control, and performance – More Ame-­‐consuming and error-­‐prone to write – Choose when control and performance are most important § Pig, Hive, and Impala – Faster to write, test, and deploy than MapReduce – Be>er choice for most analysis and processing tasks
  • 45. Which to Choose? (2) © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 45 § Use Impala when… – You have analysts familiar with SQL – You need near real-­‐Ame responses to ad hoc queries – You have structured data with a defined schema § Use Hive or Pig when… – You need support for custom file types, or complex data types § Use Pig when… – You have developers experienced with wriAng scripts – Your data is unstructured/mulA-­‐structured § Use Hive When… – Your data is structured and you are performing long-­‐running, batch jobs
  • 46. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 46 Comparing Pig, Hive, and Impala Descrip-on of Feature Pig Hive Impala SQL-­‐based query language No Yes Yes Schema OpAonal Required Required Supports user-­‐defined func-ons Yes Yes Yes Extensible file format support Yes Yes No Query speed Slow Slow Fast Accessible via ODBC/JDBC No Yes Yes
  • 47. Do These Replace an RDBMS? © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 47 § Probably not if the RDBMS is used for its intended purpose § Rela-onal databases are op-mized for: – RelaAvely small amounts of data – Immediate results – In-­‐place modificaAon of data § Pig, Hive, and Impala are op-mized for: – Large amounts of read-­‐only data – Extensive scalability at low cost § Pig and Hive are beDer suited for batch processing – Impala and RDBMSs are be>er for interacAve use
  • 48. Sentiment Analysis on Social Media with Hive © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 48 Analysis Workflow Example Import Transaction Data from RDBMS Sessionize Web Log Data with Pig Analyst using Impala shell for ad hoc queries Analyst using Impala via BI tool Hadoop Cluster with Impala Generate Nightly Reports using Pig, Hive, or Impala
  • 49. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 49 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The MoAvaAon for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data IntegraAon: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data Explora-on: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 50. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 50 Cloudera Search § Real-­‐-me, scalable indexing § Load any type of data § Text and faceted searching
  • 51. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 51 Cloudera Search Example: Twi>er Feed Search IteraAve search using facets Full text search
  • 52. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 52 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The MoAvaAon for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data IntegraAon: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data ExploraAon: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 53. Reducing Complexity With Cloudera Manager © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 53 § Pujng Hadoop into produc-on requires stringent up-mes § Clusters are made up of a large number of hosts – Each host runs mulAple Hadoop services – Difficult to know the status of everything § Inevitable issues will arise with hardware and sokware § Keeping track of the cluster becomes an issue – Are all hosts healthy and working? – Am I using all of the best pracAces for the service? – Is there a performance issue for a host or service? – Is the cluster secure?
  • 54. What Is Cloudera Manager? © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 54 § Cloudera Manager is a purpose-­‐built applica-on designed to make the administra-on of Hadoop simple and straighmorward – Automates the installaAon of a Hadoop cluster – Quickly adds and configures new services on a cluster – Provides real-­‐Ame monitoring of cluster acAvity – Produces reports of cluster usage – Manages users and groups who have access to the cluster – Integrates with your exisAng enterprise monitoring tools § Cloudera Manager Express Edi-on – Free § Cloudera Enterprise – Cloudera Manager plus support – Contact us for pricing
  • 55. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 55 Cloudera Manager Dashboard
  • 56. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 56 Health Status and CharAng
  • 57. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 57 PresentaAon Topics An Introduc-on to Hadoop and Cloudera § The MoAvaAon for Hadoop § ‘Core Hadoop’: HDFS and MapReduce § CDH and the Hadoop Ecosystem § Data Storage: HBase § Data IntegraAon: Flume and Sqoop § Data Processing: Spark § Data Analysis: Hive, Pig, and Impala § Data ExploraAon: Cloudera Search § Managing Everything: Cloudera Manager § Conclusion
  • 58. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 58 Conclusion § There are several more projects in CDH – CDH supports all the key projects you need § We haven’t even talked about security! – CDH includes Kerberos integraAon for authenAcaAon – Cloudera Enterprise provides all the security you need, whatever your industry – Recently achieved PCI cerAficaAon § Download the QuickStart VM to get started in a single VM § Try Cloudera on a real cluster for free § All available at cloudera.com/live § Ques-ons?
  • 59. © Copyright 2010-­‐2014 Cloudera. All rights reserved. Not to be reproduced without prior wri>en consent. 59