An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
1
An
IntroducAon
to
Hadoop
and
Cloudera
Nashville
Cloudera
User
Group,
10/23/14
Ian
Wrigley,
Director,
EducaAonal
Curriculum
ian@cloudera.com
@iwrigley
201405

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
2
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
Mo-va-on
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
IntegraAon:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
ExploraAon:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

TradiAonal
Large-‐Scale
ComputaAon
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
3
§ Tradi-onally,
computa-on
has
been
processor-‐bound
– RelaAvely
small
amounts
of
data
– Lots
of
complex
processing
§ The
early
solu-on:
bigger
computers
– Faster
processor,
more
memory
– But
even
this
couldn’t
keep
up

Distributed
Systems
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
4
§ The
beDer
solu-on:
more
computers
– Distributed
systems
–
use
mulAple
machines
for
a
single
job
“In
pioneer
days
they
used
oxen
for
heavy
pulling,
and
when
one
ox
couldn’t
budge
a
log,
we
didn’t
try
to
grow
a
larger
ox.
We
shouldn’t
be
trying
for
bigger
computers,
but
for
more
systems
of
computers.”
–
Grace
Hopper
Database Hadoop Cluster

Distributed
Systems:
Challenges
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
5
§ Challenges
with
distributed
systems
– Programming
complexity
– Keeping
data
and
processes
in
sync
– Finite
bandwidth
– ParAal
failures

Distributed
Systems:
The
Data
Bo>leneck
(1)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
6
§ Tradi-onally,
data
is
stored
in
a
central
loca-on
§ Data
is
copied
to
processors
at
run-me
§ Fine
for
limited
amounts
of
data

Distributed
Systems:
The
Data
Bo>leneck
(2)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
7
§ Modern
systems
have
much
more
data
– terabytes+
a
day
– petabytes+
total
§ We
need
a
new
approach…

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
8
§ A
radical
new
approach
to
distributed
compu-ng
– Distribute
data
when
the
data
is
stored
– Run
computaAon
where
the
data
is
stored
Hadoop

Hadoop:
Very
High-‐Level
Overview
Slave
Nodes
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
9
§ Data
is
split
into
“blocks”
when
loaded
§ Each
task
typically
works
on
a
single
block
– Many
run
in
parallel
§ A
master
program
manages
tasks
Lorem ipsum dolor sit
amet, consectetur sed
adipisicing elit, ado lei
eiusmod tempor etma
incididunt ut libore tua
dolore magna alli quio
ut enim ad minim veni
veniam, quis nostruda
exercitation ul laco es
sed laboris nisi ut eres
aliquip ex eaco modai
consequat. Duis hona
irure dolor in repre sie
honerit in ame mina lo
voluptate elit esse oda
cillum le dolore eu fugi
gia nulla aria tur. Ente
culpa qui officia ledea
un mollit anim id est o
laborum ame elita tu a
magna omnibus et.
Lorem ipsum dolor sit
amet, consectetur sed
adipisicing elit, ado lei
eiusmod tempor etma
incididunt ut libore tua
dolore magna alli quio
ut enim ad minim veni
veniam, quis nostruda
exercitation ul laco es
sed laboris nisi ut eres
aliquip ex eaco modai
consequat. Duis hona
irure dolor in repre sie
honerit in ame mina lo
voluptate elit esse oda
cillum le dolore eu fugi
gia nulla aria tur. Ente
culpa qui officia ledea
un mollit anim id est o
laborum ame elita tu a
magna omnibus et.
Master

Core
Hadoop
Concepts
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
10
§ Applica-ons
are
wriDen
in
high-‐level
code
§ Nodes
talk
to
each
other
as
liDle
as
possible
§ Data
is
distributed
in
advance
– Bring
the
computaAon
to
the
data
§ Data
is
replicated
for
increased
availability
and
reliability
§ Hadoop
is
scalable
and
fault-‐tolerant

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
11
Scalability
§ Adding
nodes
adds
capacity
propor-onally
§ Increasing
load
results
in
a
graceful
decline
in
performance
– Not
failure
of
the
system
Number
of
Nodes
Capacity

Fault
Tolerance
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
12
§ Node
failure
is
inevitable
§ What
happens?
– System
conAnues
to
funcAon
– Master
re-‐assigns
tasks
to
a
different
node
– Data
replicaAon
=
no
loss
of
data
– Nodes
which
recover
rejoin
the
cluster
automaAcally
“Failure
is
the
defining
difference
between
distributed
and
local
programming,
so
you
have
to
design
distributed
systems
with
the
expectaAon
of
failure.”
–
Ken
Arnold
(CORBA
designer)

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
13
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
MoAvaAon
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
IntegraAon:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
ExploraAon:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

§ The
Hadoop
Distributed
File
System
(HDFS)
is
a
filesystem
wriDen
in
Java
§ Sits
on
top
of
a
na-ve
filesystem
§ Provides
storage
for
massive
amounts
of
data
– Scalable
– Fault
tolerant
– Supports
efficient
processing
with
MapReduce,
Spark,
and
other
tools
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
14
Hadoop
Cluster
HDFS
Basic
Concepts
HDFS

How
Files
are
Stored
(1)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
15
§ Data
files
are
split
into
blocks
and
distributed
to
data
nodes
Block
1
Block
2
Block
3
Very
Large
Data
File

How
Files
are
Stored
(2)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
16
§ Data
files
are
split
into
blocks
and
distributed
to
data
nodes
Block
1
Block
2
Block
3
Block
1
Block
1
Block
1
Very
Large
Data
File

How
Files
are
Stored
(3)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
17
§ Data
files
are
split
into
blocks
and
distributed
to
data
nodes
§ Each
block
is
replicated
on
mul-ple
nodes
(default
3x)
Block
1
Block
2
Block
3
Block
1
Block
3
Block
2
Block
3
Block
1
Block
3
Block
1
Block
2
Block
2
Very
Large
Data
File

How
Files
are
Stored
(4)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
18
§ Data
files
are
split
into
blocks
and
distributed
to
data
nodes
§ Each
block
is
replicated
on
mul-ple
nodes
(default
3x)
§ NameNode
stores
metadata
Name
Node
Block
1
Block
2
Block
3
Block
1
Block
3
Block
2
Block
3
Block
1
Block
3
Block
1
Block
2
Block
2
Metadata:
informaAon
about
files
and
blocks
Very
Large
Data
File

2
1 3
Node
B
1
/logs/041213.log?
B4,B5
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
19
Example:
Storing
and
Retrieving
Files
(1)
Metadata
NameNode
/logs/031512.log: B1,B2,B3
/logs/041213.log: B4,B5
B1: A,B,D
B2: B,D,E
B3: A,B,C
B4: A,B,E
B5: C,E,D
/logs/
031512.log
1
/logs/
041213.log
2
3
4
5
Node
C
3 5
Node
E
5
4
Node
A
4
2
3
4
Node
D
1
5
2
Client

2
1 3
Node
B
1
/logs/041213.log?
B4,B5
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
20
Example:
Storing
and
Retrieving
Files
(2)
Metadata
NameNode
/logs/031512.log: B1,B2,B3
/logs/041213.log: B4,B5
B1: A,B,D
B2: B,D,E
B3: A,B,C
B4: A,B,E
B5: C,E,D
/logs/
031512.log
1
/logs/
041213.log
2
3
4
5
Node
C
3 5
Node
E
5
4
Node
A
4
2
3
4
Node
D
1
5
2
Client

Important
Notes
About
HDFS
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
21
§ HDFS
performs
best
with
a
modest
number
of
large
files
– Millions,
rather
than
billions,
of
files
– Each
file
typically
100MB
or
more
§ Files
in
HDFS
are
“write
once”
– Files
can
be
replaced
but
not
changed

Shuffle
and
Sort
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
22
MapReduce
§ The
Mapper
– Each
Map
task
(typically)
operates
on
a
single
HDFS
block
– Map
tasks(usually)
run
on
the
node
where
the
block
is
stored
§ Shuffle
and
Sort
– Sorts
and
consolidates
intermediate
data
from
all
mappers
– Happens
amer
all
Map
tasks
are
complete
and
before
Reduce
tasks
start
§ The
Reducer
– Operates
on
shuffled/sorted
intermediate
data
(Map
task
output)
– Produces
final
output
Map
Reduce

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
23
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
MoAvaAon
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
IntegraAon:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
ExploraAon:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

The
Hadoop
Ecosystem
(1)
Sqoop
Impala
Hive
Pig
HBase
Flume
Oozie
…
MapReduce
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
24
Hadoop
Distributed
File
System
Hadoop
Ecosystem
Hadoop
Core
Components
CDH

The
Hadoop
Ecosystem
(2)
HBase
Flume
Oozie
…
Hadoop
Ecosystem
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
25
Sqoop
Impala
Hive
Pig
§ CDH
includes
many
Hadoop
Ecosystem
components
§ Following
are
more
details
on
some
of
the
key
components

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
26
CDH
§ CDH
(Cloudera’s
Distribu-on,
including
Apache
Hadoop)
– 100%
open
source,
enterprise-‐ready
distribuAon
of
Hadoop
and
related
projects
– The
most
complete,
tested,
and
widely-‐
deployed
distribuAon
of
Hadoop
– Integrates
all
key
Hadoop
ecosystem
projects

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
27
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
MoAvaAon
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
IntegraAon:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
ExploraAon:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

HBase:
The
Hadoop
Database
HDFS
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
28
§ HBase:
database
layered
on
top
of
HDFS
– Provides
interacAve
access
to
data
§ Stores
massive
amounts
of
data
– Petabytes+
§ High
throughput
– Thousands
of
writes
per
second
(per
node)
§ Handles
sparse
data
well
– No
wasted
space
for
a
row
with
empty
columns
§ Limited
access
model
– OpAmized
for
lookup
of
a
row
by
key
rather
than
full
queries
– No
transacAons:
single
row
operaAons
only

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
29
HBase
vs
RDBMS
RDBMS HBase
Transactions Yes Single row only
Query language SQL get/put/scan (or use Hive or
Impala)
Indexes Yes Row-key only
Max data size TBs PBs
Read/write throughput
Thousands Millions
(queries per second)

When
To
Use
HBase
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
30
§ Use
plain
HDFS
if…
– You
only
append
to
your
dataset
(no
random
write)
– You
usually
read
the
whole
dataset
(no
random
read)
§ Use
HBase
if…
– You
need
random
write
and/or
read
– You
do
thousands
of
operaAons
per
second
on
TB+
of
data
§ Use
an
RDBMS
if…
– Your
data
fits
on
one
big
node
– You
need
full
transacAon
support
– You
need
real-‐Ame
query
capabiliAes

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
31
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
MoAvaAon
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
Integra-on:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
ExploraAon:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

Flume:
Real-‐Ame
Data
Import
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
32
§ What
is
Flume?
– A
service
to
move
large
amounts
of
data
in
real
Ame
– Example:
storing
log
files
in
HDFS
§ Flume
is
– Distributed
– Reliable
and
available
– Horizontally
scalable
– Extensible

data
as
it
is
produced
syslogs,
stdout
or
custom
source
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
33
Flume:
High-‐Level
Overview
Agent
Agent
Agent
Agent
Agent
Agent(s)
Agent
encrypt
compress
• Collect
• Files,
• Pre-‐process
data
before
storing
•
e.g.,
transform,
scrub,
enrich
• Store
in
any
format
• Text,
compressed,
binary,
or
custom
sink
Agent
• Process
in
place
• e.g.,
encrypt,
compress
• Write
in
parallel
• Scalable
throughput
HDFS

Sqoop:
Exchanging
Data
With
RDBMSs
Sqoop
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
34
§ Sqoop:
SQL
to
Hadoop
– Transfers
data
between
RDBMS
and
HDFS
– Uses
a
command-‐line
tool
or
applicaAon
connector
– Allows
incremental
imports
– Supports
virtually
all
RDBMSs
which
speak
JDBC
– Custom
connectors
available
for
some
RDBMSs
for
increased
speed
HDFS
RDBMS

Flume
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
35
Data
Center
IntegraAon
File Server
Relational Database
(OLTP)
Data Warehouse
(OLAP)
Web/App Servers
Hadoop Cluster
Sqoop
hadoop fs
Sqoop

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
36
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
MoAvaAon
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
IntegraAon:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
ExploraAon:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

Apache
Spark
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
37
§ Apache
Spark
is
a
fast,
general
engine
for
large-‐scale
data
processing
on
a
cluster
§ Originally
developed
at
AMPLab
at
UC
Berkeley
§ Open
source
Apache
project
§ Provides
several
benefits
over
MapReduce
– Faster
– Be>er
suited
for
iteraAve
algorithms
– Can
hold
intermediate
data
in
RAM,
resulAng
in
much
be>er
performance
– Easier
API
– Supports
Python,
Scala,
Java
– Supports
real-‐Ame
streaming
data
processing

Spark
vs
Hadoop
MapReduce
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
38
§ MapReduce
– Widely
used,
huge
investment
already
made
– Supports
and
supported
by
many
complementary
tools
– Mature,
well-‐tested
§ Spark
– Flexible
– Elegant
– Fast
– Supports
real-‐Ame
streaming
data
processing
§ Over
-me
Spark
will
supplant
MapReduce
as
the
general
processing
framework
used
by
most
organiza-ons

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
39
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
MoAvaAon
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
IntegraAon:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
ExploraAon:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

Hive
and
Pig:
High
Level
Data
Languages
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
40
§ The
mo-va-on:
MapReduce
is
powerful
but
hard
to
master
§ Even
Spark
requires
a
developer
who
can
code
in
Scala
or
Python
§ A
solu-on:
Hive
and
Pig
– Built
on
top
of
MapReduce
– Currently
being
ported
to
run
on
top
of
Spark
for
be>er
performance
– Leverage
exisAng
skillsets
– Data
analysts
who
use
SQL
– Programmers
who
use
scripAng
languages
– Open
source
Apache
projects
– Hive
iniAally
developed
at
Facebook
– Pig
IniAally
developed
at
Yahoo!

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
41
Hive
§ What
is
Hive?
– HiveQL:
An
SQL-‐like
interface
to
Hadoop
SELECT * FROM purchases WHERE price > 10000 ORDER BY
storeid

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
42
Pig
§ What
is
Pig?
– Pig
La-n:
A
dataflow
language
for
transforming
large
data
sets
purchases = LOAD "/user/dave/purchases" AS (itemID,
price, storeID, purchaserID);
bigticket = FILTER purchases BY price > 10000;
...

Impala:
High
Performance
Queries
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
43
§ High-‐performance
SQL
engine
for
vast
amounts
of
data
– Similar
query
language
to
HiveQL
– 10
to
50+
Ames
faster
than
Hive,
Pig,
or
MapReduce
– EffecAvely,
provides
‘real
Ame’
results
§ Impala
runs
on
Hadoop
clusters
– Data
stored
in
HDFS
– Does
not
use
MapReduce
§ Developed
by
Cloudera
– 100%
open
source,
released
under
the
Apache
somware
license

Which
to
Choose?
(1)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
44
§ Choose
the
best
solu-on
for
the
given
task
– Mix
and
match
as
needed
§ MapReduce
– Low-‐level
approach
offers
flexibility,
control,
and
performance
– More
Ame-‐consuming
and
error-‐prone
to
write
– Choose
when
control
and
performance
are
most
important
§ Pig,
Hive,
and
Impala
– Faster
to
write,
test,
and
deploy
than
MapReduce
– Be>er
choice
for
most
analysis
and
processing
tasks

Which
to
Choose?
(2)
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
45
§ Use
Impala
when…
– You
have
analysts
familiar
with
SQL
– You
need
near
real-‐Ame
responses
to
ad
hoc
queries
– You
have
structured
data
with
a
defined
schema
§ Use
Hive
or
Pig
when…
– You
need
support
for
custom
file
types,
or
complex
data
types
§ Use
Pig
when…
– You
have
developers
experienced
with
wriAng
scripts
– Your
data
is
unstructured/mulA-‐structured
§ Use
Hive
When…
– Your
data
is
structured
and
you
are
performing
long-‐running,
batch
jobs

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
46
Comparing
Pig,
Hive,
and
Impala
Descrip-on
of
Feature
Pig
Hive
Impala
SQL-‐based
query
language
No
Yes
Yes
Schema
OpAonal
Required
Required
Supports
user-‐defined
func-ons
Yes
Yes
Yes
Extensible
file
format
support
Yes
Yes
No
Query
speed
Slow
Slow
Fast
Accessible
via
ODBC/JDBC
No
Yes
Yes

Do
These
Replace
an
RDBMS?
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
47
§ Probably
not
if
the
RDBMS
is
used
for
its
intended
purpose
§ Rela-onal
databases
are
op-mized
for:
– RelaAvely
small
amounts
of
data
– Immediate
results
– In-‐place
modificaAon
of
data
§ Pig,
Hive,
and
Impala
are
op-mized
for:
– Large
amounts
of
read-‐only
data
– Extensive
scalability
at
low
cost
§ Pig
and
Hive
are
beDer
suited
for
batch
processing
– Impala
and
RDBMSs
are
be>er
for
interacAve
use

Sentiment Analysis on
Social Media with Hive
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
48
Analysis
Workflow
Example
Import Transaction Data
from RDBMS
Sessionize Web
Log Data with Pig
Analyst using Impala
shell for ad hoc queries
Analyst using Impala
via BI tool
Hadoop Cluster
with Impala
Generate Nightly Reports
using Pig, Hive, or Impala

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
49
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
MoAvaAon
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
IntegraAon:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
Explora-on:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
50
Cloudera
Search
§ Real-‐-me,
scalable
indexing
§ Load
any
type
of
data
§ Text
and
faceted
searching

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
52
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
MoAvaAon
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
IntegraAon:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
ExploraAon:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

Reducing
Complexity
With
Cloudera
Manager
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
53
§ Pujng
Hadoop
into
produc-on
requires
stringent
up-mes
§ Clusters
are
made
up
of
a
large
number
of
hosts
– Each
host
runs
mulAple
Hadoop
services
– Difficult
to
know
the
status
of
everything
§ Inevitable
issues
will
arise
with
hardware
and
sokware
§ Keeping
track
of
the
cluster
becomes
an
issue
– Are
all
hosts
healthy
and
working?
– Am
I
using
all
of
the
best
pracAces
for
the
service?
– Is
there
a
performance
issue
for
a
host
or
service?
– Is
the
cluster
secure?

What
Is
Cloudera
Manager?
©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
54
§ Cloudera
Manager
is
a
purpose-‐built
applica-on
designed
to
make
the
administra-on
of
Hadoop
simple
and
straighmorward
– Automates
the
installaAon
of
a
Hadoop
cluster
– Quickly
adds
and
configures
new
services
on
a
cluster
– Provides
real-‐Ame
monitoring
of
cluster
acAvity
– Produces
reports
of
cluster
usage
– Manages
users
and
groups
who
have
access
to
the
cluster
– Integrates
with
your
exisAng
enterprise
monitoring
tools
§ Cloudera
Manager
Express
Edi-on
– Free
§ Cloudera
Enterprise
– Cloudera
Manager
plus
support
– Contact
us
for
pricing

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
57
PresentaAon
Topics
An
Introduc-on
to
Hadoop
and
Cloudera
§ The
MoAvaAon
for
Hadoop
§ ‘Core
Hadoop’:
HDFS
and
MapReduce
§ CDH
and
the
Hadoop
Ecosystem
§ Data
Storage:
HBase
§ Data
IntegraAon:
Flume
and
Sqoop
§ Data
Processing:
Spark
§ Data
Analysis:
Hive,
Pig,
and
Impala
§ Data
ExploraAon:
Cloudera
Search
§ Managing
Everything:
Cloudera
Manager
§ Conclusion

©
Copyright
2010-‐2014
Cloudera.
All
rights
reserved.
Not
to
be
reproduced
without
prior
wri>en
consent.
58
Conclusion
§ There
are
several
more
projects
in
CDH
– CDH
supports
all
the
key
projects
you
need
§ We
haven’t
even
talked
about
security!
– CDH
includes
Kerberos
integraAon
for
authenAcaAon
– Cloudera
Enterprise
provides
all
the
security
you
need,
whatever
your
industry
– Recently
achieved
PCI
cerAficaAon
§ Download
the
QuickStart
VM
to
get
started
in
a
single
VM
§ Try
Cloudera
on
a
real
cluster
for
free
§ All
available
at
cloudera.com/live
§ Ques-ons?

An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14

Similar to An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14 (20)

An Introduction to Hadoop and Cloudera: Nashville Cloudera User Group, 10/23/14