Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Introduc=on
to
Apache
Hadoop

and
its
Ecosystem

Mark
Grover

|

Intro
to
Cloud
Compu=ng,
Carnegie
Mellon
SV

github.com/markgrover/hadoop-‐intro-‐fast

©
Copyright
2010-‐2014

Cloudera,
Inc.

All
rights
reserved.

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

About
Me

•  CommiNer
on
Apache
Bigtop,
commiNer
and
PPMC
member

on
Apache
Sentry
(incuba=ng).

•  Contributor
to
Apache
Hadoop,
Hive,
Spark,
Sqoop,
Flume.

•  SoUware
developer
at
Cloudera

•  @mark_grover

•  www.linkedin.com/in/grovermark

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Co-‐author
O’Reilly
book

•  @hadooparchbook

•  hadooparchitecturebook.com

•  To
be
released
early
2015

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

About
the
Presenta=on…

•  What’s
ahead

•  Fundamental
Concepts

•  HDFS:
The
Hadoop
Distributed
File
System

•  Data
Processing
with
MapReduce

•  Demo

•  Conclusion
+
Q&A

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Fundamental
Concepts

Why
the
World
Needs
Hadoop

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

What’s
the
craze
about
Hadoop?

•  Volume

•  More
and
more
data
being
generated

•  Machine
generated
data
increasing

•  Velocity

•  Data
coming
it
at
higher
speed

•  Variety

•  Audio,
video,
images,
log
ﬁles,
web
pages,
social
network

connec=ons,
etc.

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

We
Need
a
System
that
Scales

•  Too
much
data
for
tradi=onal
tools

•  Two
key
problems

•  How
to
reliably
store
this
data
at
a
reasonable
cost

•  How
to
we
process
all
the
data
we’ve
stored

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

What
is
Apache
Hadoop?

•  Scalable
data
storage
and
processing

•  Distributed
and
fault-‐tolerant

•  Runs
on
standard
hardware

•  Two
main
components

•  Storage:
Hadoop
Distributed
File
System
(HDFS)

•  Processing:
MapReduce

•  Hadoop
clusters
are
composed
of
computers
called
nodes

•  Clusters
range
from
a
single
node
up
to
several
thousand
nodes

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

How
Did
Apache
Hadoop
Originate?

•  Heavily
inﬂuenced
by
Google’s
architecture

•  Notably,
the
Google
Filesystem
and
MapReduce
papers

•  Other
Web
companies
quickly
saw
the
beneﬁts

•  Early
adop=on
by
Yahoo,
Facebook
and
others

2002 2003 2004 2005 2006
Google publishes
MapReduce paper
Nutch rewritten
for MapReduce
Hadoop becomes
Lucene subproject
Nutch spun off
from Lucene
Google publishes
GFS paper

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Comparing
Hadoop
to
Other
Systems

•  Monolithic
systems
don’t
scale

•  Modern
high-‐performance
compu=ng
systems
are
distributed

•  They
spread
computa=ons
across
many
machines
in
parallel

•  Widely-‐used
used
for
scien=ﬁc
applica=ons

•  Let’s
examine
how
a
typical
HPC
system
works

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Architecture
of
a
Typical
HPC
System

Storage System
Compute Nodes
Fast Network

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Architecture
of
a
Typical
HPC
System

Storage System
Compute Nodes
Step 1: Copy input data
Fast Network

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Architecture
of
a
Typical
HPC
System

Storage System
Compute Nodes
Step 2: Process the data
Fast Network

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Architecture
of
a
Typical
HPC
System

Storage System
Compute Nodes
Step 3: Copy output data
Fast Network

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

You
Don’t
Just
Need
Speed…

•  The
problem
is
that
we
have
way
more
data
than
code

$ du -ks code/
1,087
$ du –ks data/
854,632,947,314

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

You
Need
Speed
At
Scale

Storage System
Compute Nodes
Bottleneck

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Hadoop
Design
Fundamental:
Data
Locality

•  This
is
a
hallmark
of
Hadoop’s
design

•  Don’t
bring
the
data
to
the
computa=on

•  Bring
the
computa=on
to
the
data

•  Hadoop
uses
the
same
machines
for
storage
and
processing

•  Signiﬁcantly
reduces
need
to
transfer
data
across
network

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Other
Hadoop
Design
Fundamentals

•  Machine
failure
is
unavoidable
–
embrace
it

•  Build
reliability
into
the
system

•  “More”
is
usually
beNer
than
“faster”

•  Throughput
maNers
more
than
latency

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

The
Hadoop
Distributed
Filesystem

HDFS

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

HDFS:
Hadoop
Distributed
File
System

•  Inspired
by
the
Google
File
System

•  Reliable,
low-‐cost
storage
for
massive
amounts
of
data

•  Similar
to
a
UNIX
ﬁlesystem
in
some
ways

•  Hierarchical

•  UNIX-‐style
paths
(e.g.,
/sales/alice.txt)

•  UNIX-‐style
ﬁle
ownership
and
permissions

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

HDFS:
Hadoop
Distributed
File
System

•  There
are
also
some
major
devia=ons
from
UNIX
filesystems

•  Highly-‐op=mized
for
processing
data
with
MapReduce

•  Designed
for
sequen=al
access
to
large
files

•  Cannot
modify
file
content
once
wriNen

•  It’s
actually
a
user-‐space
Java
process

•  Accessed
using
special
commands
or
APIs

•  No
concept
of
a
current
working
directory

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Copying
Local
Data
To
and
From
HDFS

•  Remember
that
HDFS
is
dis=nct
from
your
local
filesystem

•  hadoop fs –put
copies
local
files
to
HDFS

•  hadoop fs –get
fetches
a
local
copy
of
a
file
from
HDFS

$ hadoop fs -put sales.txt /reports
Hadoop Cluster
Client Machine
$ hadoop fs -get /reports/sales.txt

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

HDFS
Demo

•  I
will
now
demonstrate
the
following

1.  How
to
list
the
contents
of
a
directory

2.  How
to
create
a
directory
in
HDFS

3.  How
to
copy
a
local
file
to
HDFS

4.  How
to
display
the
contents
of
a
file
in
HDFS

5.  How
to
remove
a
file
from
HDFS

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

A
Scalable
Data
Processing
Framework

Data
Processing
with
MapReduce

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

What
is
MapReduce?

•  MapReduce
is
a
programming
model

•  It’s
a
way
of
processing
data

•  You
can
implement
MapReduce
in
any
language

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Understanding
Map
and
Reduce

•  You
supply
two
func=ons
to
process
data:
Map
and
Reduce

•  Map:
typically
used
to
transform,
parse,
or
ﬁlter
data

•  Reduce:
typically
used
to
summarize
results

•  The
Map
func=on
always
runs
ﬁrst

•  The
Reduce
func=on
runs
aUerwards,
but
is
op=onal

•  Each
piece
is
simple,
but
can
be
powerful
when
combined

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

MapReduce
Beneﬁts

•  Scalability

•  Hadoop
divides
the
processing
job
into
individual
tasks

•  Tasks
execute
in
parallel
(independently)
across
cluster

•  Simplicity

•  Processes
one
record
at
a
=me

•  Ease
of
use

•  Hadoop
provides
job
scheduling
and
other
infrastructure

•  Far
simpler
for
developers
than
typical
distributed
compu=ng

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

MapReduce
in
Hadoop

•  MapReduce
processing
in
Hadoop
is
batch-‐oriented

•  A
MapReduce
job
is
broken
down
into
smaller
tasks

•  Tasks
run
concurrently

•  Each
processes
a
small
amount
of
overall
input

•  MapReduce
code
for
Hadoop
is
usually
wriNen
in
Java

•  This
uses
Hadoop’s
API
directly

•  You
can
do
basic
MapReduce
in
other
languages

•  Using
the
Hadoop
Streaming
wrapper
program

•  Some
advanced
features
require
Java
code

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

MapReduce
Example
in
Python

•  The
following
example
uses
Python

•  Via
Hadoop
Streaming

•  It
processes
log
ﬁles
and
summarizes
events
by
type

•  I’ll
explain
both
the
data
ﬂow
and
the
code

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Job
Input

•  Here’s
the
job
input

•  Each
map
task
gets
a
chunk
of
this
data
to
process

•  Typically
corresponds
to
a
single
block
in
HDFS

2013-06-29 22:16:49.391 CDT INFO "This can wait"
2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"
2013-06-29 22:16:54.276 CDT WARN "This seems bad"
2013-06-29 22:16:57.471 CDT INFO "More blather"
2013-06-29 22:17:01.290 CDT WARN "Not looking good"
2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"
2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

#!/usr/bin/env python
import sys
levels = ['TRACE', 'DEBUG', 'INFO',
'WARN', 'ERROR', 'FATAL']
for line in sys.stdin:
fields = line.split()
level = fields[3].upper()
if level in levels:
print "%st1" % level
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Python
Code
for
Map
Func=on

If
it
matches
a
known
level,
print

it,
a
tab
separator,
and
the
literal

value
1
(since
the
level
can
only

occur
once
per
line)

Read
records
from
standard
input.

Use
whitespace
to
split
into
fields.

Define
list
of
known
log
levels

Extract
“level”
field
and
convert
to

uppercase
for
consistency.

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

The
“Shuﬄe
and
Sort”

•  Hadoop
automa9cally
merges,
sorts,
and
groups
map
output

•  The
result
is
passed
as
input
to
the
reduce
func=on

•  More
on
this
later…

INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
Shuﬄe
and
Sort

Map
Output
Reduce
Input

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Input
to
Reduce
Func=on

•  Reduce
func=on
receives
a
key
and
all
values
for
that
key

•  Keys
are
always
passed
to
reducers
in
sorted
order

•  Although
not
obvious
here,
values
are
unordered

ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Python
Code
for
Reduce
Func=on

#!/usr/bin/env python
import sys
previous_key = None
sum = 0
for line in sys.stdin:
key, value = line.split()
if key == previous_key:
sum = sum + int(value)
# continued on next slide
1
2
3
4
5
6
7
8
9
10
11
12
13
Ini=alize
loop
variables

Extract
the
key
and
value

passed
via
standard
input

If
key
unchanged,

increment
the
count

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Python
Code
for
Reduce
Func=on

# continued from previous slide
else:
if previous_key:
print '%st%i' % (previous_key, sum)
previous_key = key
sum = 1
print '%st%i' % (previous_key, sum)
14
15
16
17
18
19
20
21
22 Print
data
for
the
ﬁnal

key

If
key
changed,

print
data
for
old
level

Start
tracking
data
for

the
new
record

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Recap
of
Data
Flow

ERROR 1
INFO 4
WARN 2
INFO 1
INFO 1
WARN 1
INFO 1
WARN 1
INFO 1
ERROR 1
ERROR 1
INFO 1
INFO 1
INFO 1
INFO 1
WARN 1
WARN 1
Map
input

Map
output
Reduce
input
Reduce
output

2013-06-29 22:16:49.391 CDT INFO "This can wait"
2013-06-29 22:16:52.143 CDT INFO "Blah blah blah"
2013-06-29 22:16:54.276 CDT WARN "This seems bad"
2013-06-29 22:16:57.471 CDT INFO "More blather"
2013-06-29 22:17:01.290 CDT WARN "Not looking good"
2013-06-29 22:17:03.812 CDT INFO "Fairly unimportant"
2013-06-29 22:17:05.362 CDT ERROR "Out of memory!"
Shuﬄe

and
sort

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

The
Hadoop
Ecosystem

•  "Core
Hadoop"
consists
of
HDFS
and
MapReduce

•  These
are
the
kernel
of
a
much
broader
plauorm

•  Hadoop
has
many
related
projects

•  Some
help
you
integrate
Hadoop
with
other
systems

•  Others
help
you
analyze
your
data

•  These
are
not
considered
“core
Hadoop”

•  Rather,
they’re
part
of
the
Hadoop
ecosystem

•  Many
are
also
open
source
Apache
projects

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Visual
Overview
of
a
Complete
Workﬂow

Import Transaction Data
from RDBMSSessionize Web
Log Data with Pig
Analyst uses Impala for
business intelligence
Sentiment Analysis on
Social Media with Hive
Hadoop Cluster
with Impala
Generate Nightly Reports
using Pig, Hive, or Impala
Build product
recommendations for
Web site

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Key
Points

•  We’re
genera=ng
massive
volumes
of
data

•  This
data
can
be
extremely
valuable

•  Companies
can
now
analyze
what
they
previously
discarded

•  Hadoop
supports
large-‐scale
data
storage
and
processing

•  Heavily
inﬂuenced
by
Google's
architecture

•  Already
in
produc=on
by
thousands
of
organiza=ons

•  HDFS
is
Hadoop's
storage
layer

•  MapReduce
is
Hadoop's
processing
framework

•  Many
ecosystem
projects
complement
Hadoop

•  Some
help
you
to
integrate
Hadoop
with
exis=ng
systems

•  Others
help
you
analyze
the
data
you’ve
stored

©
2010
–
2015
Cloudera,
Inc.
All
Rights
Reserved

Ques=ons?

•  Thank
you
for
aNending!

•  I’ll
be
happy
to
answer
any
addi=onal
ques=ons
now…

•  Demo
and
slides
at
github.com/markgrover/hadoop-‐intro-‐fast

•  TwiNer:
mark_grover

•  Survey
page:
=ny.cloudera.com/mark

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley

Similar to Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley (20)

More from markgrover

More from markgrover (20)

Recently uploaded

Recently uploaded (20)

Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley