1. Hadoop
–
Taming
Big
Data
Jax
ArcSig,
June
2012
Ovidiu
Dimulescu
2. About
@odimulescu
• Working
on
the
Web
since
1997
• Likes
stuff
well
done
• Into
engineering
cultures
and
all
around
automaEon
• Speaker
at
local
user
groups
• Organizer
for
the
local
Mobile
User
Group
jaxmug.com
4. What
is
?
• Apache
Hadoop
is
an
open
source
Java
soSware
framework
for
running
data-‐intensive
applicaEons
on
large
clusters
of
commodity
hardware
• Created
by
Doug
CuVng
(Lucene
&
Nutch
creator)
• Named
aSer
Doug’s
son’s
toy
elephant
5. What
and
how
is
solving?
• Processing
diverse
large
datasets
in
pracAcal
Ame
at
low
cost
• Consolidates
data
in
a
distributed
file
system
• Moves
computaAon
to
data
rather
then
data
to
computaEon
• Simpler
programming
model
6. Why
does
it
maEer?
• Volume,
Velocity,
Variety
and
Value
• Datasets
do
not
fit
on
local
HDDs
let
alone
RAM
• Data
grows
at
tremendous
pace
• Data
is
heterogeneous
• Scaling
up
is
expensive
(licensing,
cpus,
disks,
interconnects,
etc.)
• Scaling
up
has
a
ceiling
(physical,
technical,
etc.)
7. Why
does
it
maEer?
Data
types
Complex
Data
Images,
Video
20%
Logs
Documents
Call
records
Sensor
data
Mail
archives
80%
Structured
Data
User
Profiles
Complex
Structured
CRM
HR
Records
*
Chart
Source:
IDC
White
Paper
8. Why
does
it
maEer?
• Need
to
process
a
10TB
dataset
• Assume
sustained
transfer
of
75MB/s
• On
1
node
-‐
Scanning
data
~
2
days
• On
10
node
cluster
-‐
Scanning
data
~
5
hrs
• Low
$/TB
for
commodity
drives
• Low-‐end
servers
are
mulEcore
capable
9. Use
Cases
• ETL
-‐
Extract
Transform
Load
• RecommendaEon
Engines
• Customer
Churn
Analysis
• Ad
TargeEng
• Data
“sandbox”
10. Use
Cases
-‐
Typical
ETL
Data
Warehouse
BI
ApplicaAons
Live
DB
ETL
1
ETL
2
ReporAng
DB
Logs
11. Use
Cases
-‐
Hadoop
ETL
Data
Warehouse
BI
ApplicaAons
Live
DB
Data
Loading
Data
Loading
ReporAng
Hadoop
DB
Logs
12. Use
Cases
–
Analysis
methods
• Pakern
recogniEon
• Index
building
• Text
mining
• CollaboraEve
filtering
• PredicEon
models
• SenEment
analysis
• Graphs
creaEon
and
traversal
15. Why
use
Hadoop?
• PracEcal
to
do
things
that
were
previously
not
ü Shorter
execuEon
Eme
ü Costs
less
ü Simpler
programming
model
• Open
system
with
greater
flexibility
• Large
and
growing
ecosystem
16. Hadoop
–
Silver
bullet?
• Not
a
database
replacement
• Not
a
data
warehousing
(complements
it)
• Not
for
interacEve
reporEng
• Not
a
general
purpose
storage
mechanism
• Not
for
problems
that
are
not
parallelizable
in
a
share-‐nothing
fashion
17. Architecture
–
Design
Axioms
• System
Shall
Manage
and
Heal
Itself
• Performance
Shall
Scale
Linearly
• Compute
Should
Move
to
Data
• Simple
Core,
Modular
and
Extensible
18. Architecture
–
Core
Components
HDFS
Distributed
filesystem
designed
for
low
cost
storage
and
high
bandwidth
access
across
the
cluster.
Map-‐Reduce
Programming
model
for
processing
and
generaEng
large
data
sets.
21. HDFS
-‐
Design
• Based
on
Google’s
GFS
• Files
are
stored
as
blocks
(64MB
default
size)
• Configurable
data
replicaEon
(3x,
Rack
Aware)
• Fault
Tolerant,
Expects
HW
failures
• HUGE
files,
Expects
Streaming
not
Low
Latency
• Mostly
WORM
22. HDFS
-‐
Architecture
Namenode
(NN)
Client
ask
NN
for
file
H
NN
returns
DNs
that
D
host
it
F
Client
ask
DN
for
data
S
Datanode
1
Datanode
2
Datanode
N
Namenode
-‐
Master
Datanode
-‐
Slaves
• Filesystem
metadata
• Reads
/
Write
blocks
to/from
clients
• Controls
read/write
to
files
• Replicates
blocks
at
master’s
request
• Manages
blocks
replicaEon
• Applies
transacEon
log
on
startup
23. HDFS
–
Fault
tolerance
• DataNode
§ Uses
CRC
to
avoid
corrupEon
§ Data
is
replicated
on
other
nodes
(3x)
• NameNode
§ Checkpoint
NameNode
§ Backup
NameNode
§ Failover
is
manual
24. MapReduce
-‐
Design
• Based
on
Google’s
MR
paper
• Borrows
from
funcEonal
programming
• Simpler
programming
model
§ map
(in_key,
in_value)
-‐>
(out_key,
intermediate_value)
list
§ reduce
(out_key,
intermediate_value
list)
-‐>
out_value
list
• No
user
synchronizaEon
and
coordinaEon
Input
-‐>
Map
-‐>
Reduce
-‐>
Output
25. MapReduce
-‐
Architecture
Client
launches
a
job
J
JobsTracker
(JT)
O
-‐
ConfiguraEon
-‐
Mapper
B
-‐
Reducer
S
-‐
Input
-‐
Output
TaskTracker
1
TaskTracker
2
TaskTracker
N
API
JobTracker
-‐
Master
TaskTracker
-‐
Slaves
• Accepts
MR
jobs
submiked
by
clients
• Run
Map
and
Reduce
tasks
received
• Assigns
Map
and
Reduce
tasks
to
from
Jobtracker
TaskTrackers,
data
locality
aware
• Manage
storage
and
transmission
of
• Monitors
tasks
and
TaskTracker
status,
intermediate
output
re-‐executes
tasks
upon
failure
• SpeculaEve
execuEon
26. Hadoop
-‐
Core
Architecture
J
JobsTracker
O
B
S
API
TaskTracker
1
TaskTracker
2
TaskTracker
N
DataNode
1
DataNode
2
DataNode
N
H
D
F
S
NameNode
Mini
OS
• File
system
• Scheduler
27. MapReduce
–
Head
First
Style
hkp://www.slideshare.net/esaliya/mapreduce-‐
in-‐simple-‐terms
28. MapReduce
–
Mapper
Types
One-‐to-‐One
map(k,
v)
=
emit
(k,
transform(v))
Exploder
map(k,
v)
=
foreach
p
in
v:
emit
(k,
p)
Filter
map(k,
v)
=
if
cond(v)
then
emit
(k,
v)
29. MapReduce
–
Reducer
Types
Sum
Reducer
reduce(k,
vals)
=
sum
=
0
foreach
v
in
vals:
sum
+=
v
emit
(k,
sum)
32. MapReduce
–
Combiner
Phase
• OpEonal
• Runs
on
mapper
nodes
aSer
map
phase
• “
Mini-‐reduce,”
only
on
local
map
output
• Used
to
save
bandwidth
before
sending
data
to
full
reducer
• The
Reducer
can
be
Combiner
if
1. Output
key,
values
are
the
same
as
input
key,
values
2. CommutaEve
and
AssociaEve
(SUM,
MAX
ok
but
AVG
not)
Diagram:
hkp://developer.yahoo.com/hadoop/tutorial/module4.html
33. InstallaAon
1. Download
&
configure
single-‐node
cluster
hadoop.apache.org/common/releases.html
2. Download
a
demo
VM
Cloudera
Hortonwork
3. Use
a
hosted
environment
(Amazon’s
EMR,
Azure)
34. InstallaAon
–
Pla[orm
Notes
ProducAon
Linux
–
Official
Development
Linux
OSX
Windows
via
Cygwin
*Nix
35. MapReduce
–
Client
Languages
Java,
Any
JVM
Languages
-‐
NaEve
hadoop
jar
jar_path
main_class
input_path
output_path
C++
-‐
Pipes
framework
–
Socket
IO
hadoop
pipes
-‐input
path_in
-‐output
path_out
-‐program
exec_program
Any
–
Streaming
–
Stdin
/
Stdout
hadoop
jar
hadoop-‐streaming.jar
-‐mapper
map_prog
-‐reducer
reduce_prog
-‐input
path_in
-‐output
path_out
Pig
LaEn,
Hive
HQL,
C
via
JNI
36. MapReduce
–
Client
Anatomy
• Main
Program
(aka
Driver)
Configures
the
Job
IniEates
the
Job
• Input
LocaEon
• Mapper
• Combiner
(opEonal)
• Reducer
• Output
LocaEon
44. Summary
is
an
economical
scalable
distributed
data
processing
system
which
enables
data:
ü ConsolidaAon
(Structured
or
Not)
ü Query
Flexibility
(Any
Language)
ü Agility
(Evolving
Schemas)
46. References
Hadoop
at
Yahoo!,
by
Y!
Developer
Network
MapReduce
in
Simple
Terms,
by Saliya Ekanayake
Hadoop
Architecture,
by Phillipe Julio
10
Hadoop-‐able
Problems,
by Cloudera
Hadoop,
An
Industry
PerspecEve,
by Amr Awadallah
Anatomy of a MapReduce Job Run by Tom White
MapReduceJobs in Hadoop