Hadoop Application Architectures tutorial at Big DataService 2015

1
©
Cloudera,
Inc.
All
rights
reserved.

Architectural
Considera8ons

for
Hadoop
Applica8ons

Mark
Grover
|
@mark_grover

Jonathan
Seidman
|
@jseidman

Gwen
Shapira
|
@gwenshap

IEEE
BigDataService
2015
|
March
30th,
2015

8ny.cloudera.com/app-‐arch-‐slides

2
©
Cloudera,
Inc.
All
rights
reserved.

About
the
book

•  @hadooparchbook

•  hadooparchitecturebook.com

•  github.com/hadooparchitecturebook

•  slideshare.com/hadooparchbook

3
©
Cloudera,
Inc.
All
rights
reserved.

About
the
presenters

• Principal
Solu8ons
Architect

at
Cloudera

• Previously,
lead
architect
at

FINRA

• Contributor
to
Apache

Hadoop,
HBase,
Flume,

Avro,
Pig
and
Spark

• Senior
Solu8ons
Architect/
Partner
Enablement
at

Cloudera

• Previously,
Technical
Lead

on
the
big
data
team
at

Orbitz
Worldwide

• Co-‐founder
of
the
Chicago

Hadoop
User
Group
and

Chicago
Big
Data

Ted
Malaska
Jonathan
Seidman

4
©
Cloudera,
Inc.
All
rights
reserved.

About
the
presenters

• Solu8ons
Architect
turned

So]ware
Engineer
at

Cloudera

• Commi^er
on
Apache

Sqoop

• Contributor
to
Apache

Flume
and
Apache
Kaaa

• So]ware
Engineer
at

Cloudera

• Commi^er
on
Apache

Bigtop,
PMC
member
on

Apache
Sentry
(incuba8ng)

• Contributor
to
Apache

Hadoop,
Spark,
Hive,
Sqoop,

Pig
and
Flume

Gwen
Shapira
Mark
Grover

5
©
Cloudera,
Inc.
All
rights
reserved.

Logis8cs

•  Break
at
10:30-‐11:00
PM

•  Ques8ons
at
the
end
of
each
sec8on

6
©
Cloudera,
Inc.
All
rights
reserved.

Case
Study

Clickstream
Analysis

7
©
Cloudera,
Inc.
All
rights
reserved.

Analy8cs

8
©
Cloudera,
Inc.
All
rights
reserved.

Analy8cs

9
©
Cloudera,
Inc.
All
rights
reserved.

Analy8cs

10
©
Cloudera,
Inc.
All
rights
reserved.

Analy8cs

11
©
Cloudera,
Inc.
All
rights
reserved.

Analy8cs

12
©
Cloudera,
Inc.
All
rights
reserved.

Analy8cs

13
©
Cloudera,
Inc.
All
rights
reserved.

Analy8cs

14
©
Cloudera,
Inc.
All
rights
reserved.

Web
Logs
–
Combined
Log
Format

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0"
200 4463 "http://bestcyclingreviews.com/top_online_shops" "Mozilla/
5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?
productID=1023 HTTP/1.0" 200 3757 "http://www.casualcyclist.com"
"Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC Vision Build/
GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile
Safari/533.1”

15
©
Cloudera,
Inc.
All
rights
reserved.

Clickstream
Analy8cs

244.157.45.12 - - [17/Oct/
2014:21:08:30 ] "GET /seatposts
HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/
top_online_shops" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/
537.36”

16
©
Cloudera,
Inc.
All
rights
reserved.

Similar
use-‐cases

•  Sensors
–
heart,
agriculture,
etc.

•  Casinos
–
session
of
a
person
at
a
table

17
©
Cloudera,
Inc.
All
rights
reserved.

Pre-‐Hadoop
Architecture

Clickstream
Analysis

18
©
Cloudera,
Inc.
All
rights
reserved.

Click
Stream
Analysis
(Before
Hadoop)

Web
logs

(full
ﬁdelity)

(2
weeks)

Data
Warehouse

Transform/
Aggregate
Business

Intelligence

Tape
Archive

19
©
Cloudera,
Inc.
All
rights
reserved.

Problems
with
Pre-‐Hadoop
Architecture

•  Full
fidelity
data
is
stored
for
small
amount
of
8me
(~weeks).

•  Older
data
is
sent
to
tape,
or
even
worse,
deleted!

•  Inflexible
workflow
-‐
think
of
all
aggrega8ons
beforehand

20
©
Cloudera,
Inc.
All
rights
reserved.

Eﬀects
of
Pre-‐Hadoop
Architecture

•  Regenera8ng
aggregates
is
expensive
or
worse,
impossible

•  Can’t
correct
bugs
in
the
workﬂow/aggrega8on
logic

•  Can’t
do
experiments
on
exis8ng
data

21
©
Cloudera,
Inc.
All
rights
reserved.

Why
is
Hadoop
A
Great
Fit?

Clickstream
Analysis

22
©
Cloudera,
Inc.
All
rights
reserved.

Why
is
Hadoop
a
great
ﬁt?

•  Volume
of
clickstream
data
is
huge

•  Velocity
at
which
it
comes
in
is
high

•  Variety
of
data
is
diverse
-‐
semi-‐structured
data

•  Hadoop
enables

• ac8ve
archival
of
data

• Aggrega8on
jobs

• Querying
the
above
aggregates
or
raw
ﬁdelity
data

23
©
Cloudera,
Inc.
All
rights
reserved.

Click
Stream
Analysis
(with
Hadoop)

Web
logs
Hadoop
Business

Intelligence

Ac8ve
archive
(no
tape)

Aggrega8on
engine

Querying
engine

24
©
Cloudera,
Inc.
All
rights
reserved.

Challenges
of
Hadoop
Implementa8on

25
©
Cloudera,
Inc.
All
rights
reserved.

Challenges
of
Hadoop
Implementa8on

26
©
Cloudera,
Inc.
All
rights
reserved.

Other
challenges
-‐
Architectural
Considera8ons

•  Storage
managers?

•  HDFS?
HBase?

•  Data
storage
and
modeling:

•  File
formats?
Compression?
Schema
design?

•  Data
movement

•  How
do
we
actually
get
the
data
into
Hadoop?
How
do
we
get
it
out?

•  Metadata

•  How
do
we
manage
data
about
the
data?

•  Data
access
and
processing

•  How
will
the
data
be
accessed
once
in
Hadoop?
How
can
we
transform
it?
How
do
we
query

it?

•  Orchestra8on

•  How
do
we
manage
the
workﬂow
for
all
of
this?

27
©
Cloudera,
Inc.
All
rights
reserved.

Case
Study
Requirements

Overview
of
Requirements

28
©
Cloudera,
Inc.
All
rights
reserved.

Overview
of
Requirements

Data

Sources

Inges8on

Raw
Data

Storage

(Formats,

Schema)

Processed

Data
Storage

(Formats,

Schema)

Processing

Data

Consump8on

Orchestra8on

(Scheduling,

Managing,

Monitoring)

29
©
Cloudera,
Inc.
All
rights
reserved.

Case
Study
Requirements

Data
Inges8on

30
©
Cloudera,
Inc.
All
rights
reserved.

Data
Inges8on
Requirements

Web
Servers

Web
Servers

Web
Servers

Web
Servers

Web
Servers

Logs

244.157.45.12 - - [17/
Oct/2014:21:08:30 ]
"GET /seatposts HTTP/
1.0" 200 4463 …

CRM
Data

ODS

Web
Servers

Web
Servers

Web
Servers

Web
Servers

Web
Servers

Logs

Hadoop

31
©
Cloudera,
Inc.
All
rights
reserved.

Data
Inges8on
Requirements

•  So
we
need
to
be
able
to
support:

• Reliable
inges8on
of
large
volumes
of
semi-‐structured
event
data
arriving
with

high
velocity
(e.g.
logs).

• Timeliness
of
data
availability
–
data
needs
to
be
available
for
processing
to

meet
business
service
level
agreements.

• Periodic
inges8on
of
data
from
rela8onal
data
stores.

32
©
Cloudera,
Inc.
All
rights
reserved.

Case
Study
Requirements

Data
Storage

33
©
Cloudera,
Inc.
All
rights
reserved.

Data
Storage
Requirements

Store
all
the
data

Make
the
data

accessible
for

processing

Compress
the

data

34
©
Cloudera,
Inc.
All
rights
reserved.

Case
Study
Requirements

Data
Processing

35
©
Cloudera,
Inc.
All
rights
reserved.

Processing
requirements

Be
able
to
answer
ques8ons
like:

•  What
is
my
website’s
bounce
rate?

• i.e.
how
many
%
of
visitors
don’t
go
past
the
landing
page?

•  Which
marke8ng
channels
are
leading
to
most
sessions?

•  Do
a^ribu8on
analysis

• Which
channels
are
responsible
for
most
conversions?

36
©
Cloudera,
Inc.
All
rights
reserved.

Sessioniza8on

Website
visit

Visitor
1

Session
1

Visitor
1

Session
2

Visitor
2

Session
1

>
30
minutes

37
©
Cloudera,
Inc.
All
rights
reserved.

Case
Study
Requirements

Orchestra8on

38
©
Cloudera,
Inc.
All
rights
reserved.

Orchestra8on
is
simple

We
just
need
to
execute
ac8ons

One
a]er
another

39
©
Cloudera,
Inc.
All
rights
reserved.

Actually,

we
also
need
to
handle
errors

And
user
no8ﬁca8ons

….

40
©
Cloudera,
Inc.
All
rights
reserved.

And…

•  Re-‐start
workflows
a]er
errors

•  Reuse
of
ac8ons
in
mul8ple
workflows

•  Complex
workflows
with
decision
points

•  Trigger
ac8ons
based
on
events

•  Tracking
metadata

•  Integra8on
with
enterprise
so]ware

•  Data
lifecycle

•  Data
quality
control

•  Reports

41
©
Cloudera,
Inc.
All
rights
reserved.

OK,
maybe
we
need
a
product

To
help
us
do
all
that

42
©
Cloudera,
Inc.
All
rights
reserved.

Architectural
Considera8ons

Data
Modeling

43
©
Cloudera,
Inc.
All
rights
reserved.

Data
Modeling
Considera8ons

•  We
need
to
consider
the
following
in
our
architecture:

• Storage
layer
–
HDFS?
HBase?
Etc.

• File
system
schemas
–
how
will
we
lay
out
the
data?

• File
formats
–
what
storage
formats
to
use
for
our
data,
both
raw
and

processed
data?

• Data
compression
formats?

44
©
Cloudera,
Inc.
All
rights
reserved.

Architectural
Considera8ons

Data
Modeling
–
Storage
Layer

45
©
Cloudera,
Inc.
All
rights
reserved.

Data
Storage
Layer
Choices

•  Two
likely
choices
for
raw
data:

46
©
Cloudera,
Inc.
All
rights
reserved.

Data
Storage
Layer
Choices

• Stores
data
directly
as
ﬁles

• Fast
scans

• Poor
random
reads/writes

• Stores
data
as
Hﬁles
on
HDFS

• Slow
scans

• Fast
random
reads/writes

47
©
Cloudera,
Inc.
All
rights
reserved.

Data
Storage
–
Storage
Manager
Considera8ons

•  Incoming
raw
data:

• Processing
requirements
call
for
batch
transforma8ons
across
mul8ple
records

–
for
example
sessioniza8on.

•  Processed
data:

• Access
to
processed
data
will
be
via
things
like
analy8cal
queries
–
again

requiring
access
to
mul8ple
records.

•  We
choose
HDFS

• Processing
needs
in
this
case
served
be^er
by
fast
scans.

48
©
Cloudera,
Inc.
All
rights
reserved.

Architectural
Considera8ons

Data
Modeling
–
Raw
Data
Storage

49
©
Cloudera,
Inc.
All
rights
reserved.

Storage
Formats
–
Raw
Data
and
Processed
Data

Processed

Data

Raw

Data

50
©
Cloudera,
Inc.
All
rights
reserved.

Data
Storage
–
Format
Considera8ons

Logs

(plain

text)

51
©
Cloudera,
Inc.
All
rights
reserved.

Data
Storage
–
Format
Considera8ons

Logs

(plain

text)

Logs

(plain

text)

Logs

(plain

text)

Logs

(plain

text)

Logs

(plain

text)

Logs

(plain

text)

Logs

Logs

Logs

Logs

Logs

52
©
Cloudera,
Inc.
All
rights
reserved.

Data
Storage
–
Compression

snappy
Well,
maybe.

But
not
spliâble.

X
Spliâble.
Gewng

beêr…

Hmmm….
Spliâble,
but
no...

53
©
Cloudera,
Inc.
All
rights
reserved.

Raw
Data
Storage
–
More
About
Snappy

• Designed
at
Google
to
provide
high
compression
speeds
with
reasonable

compression.

• Not
the
highest
compression,
but
provides
very
good
performance
for

processing
on
Hadoop.

• Snappy
is
not
spli^able
though,
which
brings
us
to…

54
©
Cloudera,
Inc.
All
rights
reserved.

Hadoop
File
Types

•  Formats
designed
speciﬁcally
to
store
and
process
data
on
Hadoop:

• File
based
–
SequenceFile

• Serializa8on
formats
–
Thri],
Protocol
Buﬀers,
Avro

• Columnar
formats
–
RCFile,
ORC,
Parquet

55
©
Cloudera,
Inc.
All
rights
reserved.

SequenceFile

• Stores
records
as
binary

key/value
pairs.

• SequenceFile
“blocks”
can

be
compressed.

• This
enables
spli^ability

with
non-‐spli^able

compression.

56
©
Cloudera,
Inc.
All
rights
reserved.

Avro

• Kinda
SequenceFile
on

Steroids.

• Self-‐documen8ng
–
stores

schema
in
header.

• Provides
very
eﬃcient

storage.

• Supports
spli^able

compression.

57
©
Cloudera,
Inc.
All
rights
reserved.

Our
Format
Recommenda8ons
for
Raw
Data…

•  Avro
with
Snappy

• Snappy
provides
op8mized
compression.

• Avro
provides
compact
storage,
self-‐documen8ng
ﬁles,
and
supports
schema

evolu8on.

• Avro
also
provides
be^er
failure
handling
than
other
choices.

•  SequenceFiles
would
also
be
a
good
choice,
and
are
directly
supported
by

inges8on
tools
in
the
ecosystem.

• But
only
supports
Java.

58
©
Cloudera,
Inc.
All
rights
reserved.

But
Note…

•  For
simplicity,
we’ll
use
plain
text
for
raw
data
in
our
example.

59
©
Cloudera,
Inc.
All
rights
reserved.

Architectural
Considera8ons

Data
Modeling
–
Processed
Data
Storage

60
©
Cloudera,
Inc.
All
rights
reserved.

Storage
Formats
–
Raw
Data
and
Processed
Data

Processed

Data

Raw

Data

61
©
Cloudera,
Inc.
All
rights
reserved.

Access
to
Processed
Data

Column
A
Column
B
Column
C
Column
D

Value
Value
Value
Value

Value
Value
Value
Value

Value
Value
Value
Value

Value
Value
Value
Value

Analy8cal

Queries

62
©
Cloudera,
Inc.
All
rights
reserved.

Columnar
Formats

•  Eliminates
I/O
for
columns
that
are
not
part
of
a
query.

•  Works
well
for
queries
that
access
a
subset
of
columns.

•  O]en
provide
be^er
compression.

•  These
add
up
to
drama8cally
improved
performance
for
many
queries.

1
2014-‐10-‐13
abc

2
2014-‐10-‐14
def

3
2014-‐10-‐15
ghi

1
2
3

2014-‐10-‐13
2014-‐10-‐14
2014-‐10-‐15

abc
def
ghi

63
©
Cloudera,
Inc.
All
rights
reserved.

Columnar
Choices
–
RCFile

•  Designed
to
provide
eﬃcient
processing
for
Hive
queries.

•  Only
supports
Java.

•  No
Avro
support.

•  Limited
compression
support.

•  Sub-‐op8mal
performance
compared
to
newer
columnar
formats.

64
©
Cloudera,
Inc.
All
rights
reserved.

Columnar
Choices
–
ORC

•  A
be^er
RCFile.

•  Also
designed
to
provide
eﬃcient
processing
of
Hive
queries.

•  Only
supports
Java.

65
©
Cloudera,
Inc.
All
rights
reserved.

Columnar
Choices
–
Parquet

•  Designed
to
provide
eﬃcient
processing
across
Hadoop
programming
interfaces

–
MapReduce,
Hive,
Impala,
Pig.

•  Mul8ple
language
support
–
Java,
C++

•  Good
object
model
support,
including
Avro.

•  Broad
vendor
support.

•  These
features
make
Parquet
a
good
choice
for
our
processed
data.

66
©
Cloudera,
Inc.
All
rights
reserved.

Architectural
Considera8ons

Data
Modeling
–
Schema
Design

67
©
Cloudera,
Inc.
All
rights
reserved.

HDFS
Schema
Design
–
One
Recommenda8on

/etl
–
Data
in
various
stages
of
ETL
workflow

/data
–
processed
data
to
be
shared
data
with
the
en8re
organiza8on

/tmp
–
temp
data
from
tools
or
shared
between
users

/user/<username>
-‐
User
specific
data,
jars,
conf
files

/app
–
Everything
but
data:
UDF
jars,
HQL
files,
Oozie
workflows

68
©
Cloudera,
Inc.
All
rights
reserved.

Par88oning

•  Split
the
dataset
into
smaller
consumable
chunks.

•  Rudimentary
form
of
“indexing”.
Reduces
I/O
needed
to
process
queries.

69
©
Cloudera,
Inc.
All
rights
reserved.

Par88oning

dataset

col=val1/file.txt

col=val2/file.txt

…

col=valn/file.txt

dataset

file1.txt

file2.txt

…

filen.txt

Un-‐par88oned
HDFS

directory
structure

Par88oned
HDFS

directory
structure

70
©
Cloudera,
Inc.
All
rights
reserved.

Par88oning
considera8ons

•  What
column
to
par88on
by?

• Don’t
have
too
many
par88ons
(<10,000)

• Don’t
have
too
many
small
ﬁles
in
the
par88ons

• Good
to
have
par88on
sizes
at
least
~1
GB,
generally
a
mul8ple
of
block
size.

•  We’ll
par88on
by
6mestamp.
This
applies
to
both
our
raw
and
processed
data.

71
©
Cloudera,
Inc.
All
rights
reserved.

Par88oning
For
Our
Case
Study

•  Raw
dataset:

•  /etl/BI/casualcyclist/clicks/rawlogs/year=2014/month=10/day=10!
•  Processed
dataset:

•  /data/bikeshop/clickstream/year=2014/month=10/day=10!

72
©
Cloudera,
Inc.
All
rights
reserved.

Architectural
Considera8ons

Data
Inges8on

73
©
Cloudera,
Inc.
All
rights
reserved.

•  Omniture
data
on
FTP

•  Apps

•  App
Logs

•  RDBMS

Typical
Clickstream
data
sources

74
©
Cloudera,
Inc.
All
rights
reserved.

Gewng
Files
from
FTP

75
©
Cloudera,
Inc.
All
rights
reserved.

curl ftp://myftpsite.com/sitecatalyst/
myreport_2014-10-05.tar.gz
--user name:password | hdfs -put - /etl/clickstream/raw/
sitecatalyst/myreport_2014-10-05.tar.gz
Don’t
over-‐complicate
things

76
©
Cloudera,
Inc.
All
rights
reserved.

Apache
NiFi

77
©
Cloudera,
Inc.
All
rights
reserved.

Reliable,
distributed
and
highly
available
systems

That
allow
streaming
events
to
Hadoop

Event
Streaming
–
Flume
and
Kaaa

78
©
Cloudera,
Inc.
All
rights
reserved.

•  Many
available
data
collec8on
sources

•  Well
integrated
into
Hadoop

•  Supports
ﬁle
transforma8ons

•  Can
implement
complex
topologies

•  Very
low
latency

•  No
programming
required

Flume:

79
©
Cloudera,
Inc.
All
rights
reserved.

“We
just
want
to
grab
data

from
this
directory

and
write
it
to
HDFS”

We
use
Flume
when:

80
©
Cloudera,
Inc.
All
rights
reserved.

•  Very
high-‐throughput
publish-‐subscribe
messaging

•  Highly
available

•  Stores
data
and
can
replay

•  Can
support
many
consumers
with
no
extra
latency

Kaaa
is:

81
©
Cloudera,
Inc.
All
rights
reserved.

“Kaaa
is
awesome.

We
heard
it
cures
cancer”

Use
Kaaa
When:

82
©
Cloudera,
Inc.
All
rights
reserved.

•  Use
Flume
with
a
Kaaa
Source

•  Allows
to
get
data
from
Kaaa,

run
some
transforma8ons

write
to
HDFS,
HBase
or
Solr

Actually,
why
choose?

83
©
Cloudera,
Inc.
All
rights
reserved.

•  We
want
to
ingest
events
from
log
ﬁles

•  Flume’s
Spooling
Directory
source
ﬁts

•  With
HDFS
Sink

•  We
would
have
used
Kaaa
if…

• We
wanted
the
data
in
non-‐Hadoop
systems
too

In
Our
Example…

84
©
Cloudera,
Inc.
All
rights
reserved.

Sources
Interceptors
Selectors
Channels
Sinks

Flume
Agent

Short
Intro
to
Flume

Twi^er,
logs,
JMS,

webserver,
Kaaa

Mask,
re-‐format,

validate…

DR,
cri8cal
Memory,
ﬁle,
Kaaa
HDFS,
HBase,
Solr

85
©
Cloudera,
Inc.
All
rights
reserved.

Configura8on

• Declara8ve

• No
coding
required.

• Configura8on
specifies
how

components
are
wired

together.

86
©
Cloudera,
Inc.
All
rights
reserved.

Interceptors

• Mask
ﬁelds

• Validate
informa8on

against
external
source

• Extract
ﬁelds

• Modify
data
format

• Filter
or
split
events

87
©
Cloudera,
Inc.
All
rights
reserved.

Any
suﬃciently
complex
conﬁgura8on

Is
indis8nguishable
from
code

88
©
Cloudera,
Inc.
All
rights
reserved.

A
Brief
Discussion
of
Flume
Pa^erns
–
Fan-‐in

• Flume
agent
runs
on
each

of
our
servers.

• These
client
agents
send

data
to
mul8ple
agents
to

provide
reliability.

• Flume
provides
support

for
load
balancing.

89
©
Cloudera,
Inc.
All
rights
reserved.

A
Brief
Discussion
of
Flume
Pa^erns
–
Spliwng

• Common
need
is
to
split

data
on
ingest.

• For
example:

• Sending
data
to
mul8ple

clusters
for
DR.

• To
mul8ple
des8na8ons.

• Flume
also
supports

par88oning,
which
is
key
to

our
implementa8on.

90
©
Cloudera,
Inc.
All
rights
reserved.

Flume
Agent

Web
Logs

Spooling
Dir

Source

Timestamp

Interceptor

File

Channel

Avro
Sink

Avro
Sink

Avro
Sink

Flume
Architecture
–
Client
Tier

Web
Server
Flume
Agent

91
©
Cloudera,
Inc.
All
rights
reserved.

Flume
Architecture
–
Collector
Tier

Flume
Agent

Flume
Agent

HDFS

Avro

Source

File

Channel

HDFS
Sink

HDFS
Sink

92
©
Cloudera,
Inc.
All
rights
reserved.

•  Add
Kaaa
producer
to
our
webapp

•  Send
clicks
and
searches
as
messages

•  Flume
can
ingest
events
from
Kaaa

•  We
can
add
a
second
consumer
for
real-‐8me

processing
in
SparkStreaming

•  Another
consumer
for
aler8ng…

•  And
maybe
a
batch
consumer
too

What
if….
We
were
to
use
Kaaa?

93
©
Cloudera,
Inc.
All
rights
reserved.

Channels
Sinks

Flume
Agent

The
Kaaa
Channel

Kafka HDFS,
HBase,
Solr

Producer

A

Producer

B

Producer

C

Kaaa

Producers

94
©
Cloudera,
Inc.
All
rights
reserved.

Sources
Interceptors
Channels

Flume
Agent

The
Kaaa
Channel

Twi^er,
logs,
JMS,

webserver

Mask,
re-‐format,

validate…

Kafka
Consumer
A

Kaaa

Consumers

Consumer
B

Consumer
C

95
©
Cloudera,
Inc.
All
rights
reserved.

Sources
Interceptors
Selectors
Channels
Sinks

Flume
Agent

The
Kaaa
Channel

Twi^er,
logs,
JMS,

webserver

Mask,
re-‐format,

validate…

DR,
cri8cal
Kafka HDFS,
HBase,
Solr

96
©
Cloudera,
Inc.
All
rights
reserved.

Architectural
Considera8ons

Data
Processing
–
Engines

tiny.cloudera.com/app-‐arch-‐slides

97
©
Cloudera,
Inc.
All
rights
reserved.

Processing
Engines

•  MapReduce

•  Abstrac8ons

•  Spark

•  Spark
Streaming

•  Impala

98
©
Cloudera,
Inc.
All
rights
reserved.

Data
Processing
Engines

MapReduce
and
Abstrac8ons

99
©
Cloudera,
Inc.
All
rights
reserved.

MapReduce

•  Oldie
but
goody

•  Restric8ve
Framework
/
Innovated
Work
Around

•  Extreme
Batch

100
©
Cloudera,
Inc.
All
rights
reserved.

MapReduce
Basic
High
Level

Mapper

HDFS

(Replicated)

Na8ve
File
System

Block
of
Data

Temp
Spill

Data

Par88oned

Sorted
Data

Reducer

Reducer
Local

Copy

Output
File

101
©
Cloudera,
Inc.
All
rights
reserved.

MapReduce
Innova8on

•  Mapper
Memory
Joins

•  Reducer
Memory
Joins

•  Buckets
Sorted
Joins

•  Cross
Task
Communica8on

•  Windowing

•  And
Much
More

102
©
Cloudera,
Inc.
All
rights
reserved.

Abstrac8ons

•  SQL

• Hive

•  Script/Code

• Pig:
Pig
La8n

• Crunch:
Java/Scala

• Cascading:
Java/Scala

103
©
Cloudera,
Inc.
All
rights
reserved.

Data
Processing
Engines

Spark
and
Spark
Streaming

104
©
Cloudera,
Inc.
All
rights
reserved.

Spark

•  The
New
Kid
that
isn’t
that
New
Anymore

•  Easily
10x
less
code

•  Extremely
Easy
and
Powerful
API

•  Very
good
for
machine
learning

•  Scala,
Java,
and
Python

•  RDDs

•  DAG
Engine

105
©
Cloudera,
Inc.
All
rights
reserved.

Spark
-‐
DAG

106
©
Cloudera,
Inc.
All
rights
reserved.

Spark
-‐
DAG

Filter
KeyBy

KeyBy

TextFile

TextFile

Join
Filter
Take

107
©
Cloudera,
Inc.
All
rights
reserved.

Spark
-‐
DAG

Filter
KeyBy

KeyBy

TextFile

TextFile

Join
Filter
Take

Good

Good

Good

Good

Good

Good

Good-‐Replay

Good-‐Replay

Good-‐Replay

Good

Good-‐Replay

Good

Good-‐Replay

Lost
Block

Replay

Good-‐Replay

Lost
Block

Good

Future

Future

Future

Future

108
©
Cloudera,
Inc.
All
rights
reserved.

Spark
Streaming

•  Calling
Spark
in
a
Loop

•  Extends
RDDs
with
DStream

•  Very
Li^le
Code
Changes
from
ETL
to
Streaming

109
©
Cloudera,
Inc.
All
rights
reserved.

Spark
Streaming

Single
Pass

Source
Receiver
RDD

Source
Receiver
RDD

RDD

Filter
Count
Print

Source
Receiver
RDD

RDD

RDD

Single
Pass

Filter
Count
Print

Pre-‐ﬁrst

Batch

First

Batch

Second

Batch

110
©
Cloudera,
Inc.
All
rights
reserved.

Spark
Streaming

Single
Pass

Source
Receiver
RDD

Source
Receiver
RDD

RDD

Filter
Count

Print

Source
Receiver
RDD

RDD

RDD

Single
Pass

Filter
Count

Pre-‐ﬁrst

Batch

First

Batch

Second

Batch

Stateful
RDD
1

Print

Stateful
RDD
2

Stateful
RDD
1

111
©
Cloudera,
Inc.
All
rights
reserved.

Data
Processing
Engines

Impala

113
©
Cloudera,
Inc.
All
rights
reserved.

Impala
–
Broadcast
Join

Impala
Daemon

Smaller
Table

Data
Block

100%
Cached

Smaller
Table

Smaller
Table

Data
Block

Impala
Daemon

100%
Cached

Smaller
Table

Impala
Daemon

100%
Cached

Smaller
Table

Impala
Daemon

Hash
Join
Func8on

Bigger
Table
Data

Block

100%
Cached

Smaller
Table

Output

Impala
Daemon

Hash
Join
Func8on

Bigger
Table
Data

Block

100%
Cached

Smaller
Table

Output

Impala
Daemon

Hash
Join
Func8on

Bigger
Table
Data

Block

100%
Cached

Smaller
Table

Output

114
©
Cloudera,
Inc.
All
rights
reserved.

Impala
–
Par88oned
Hash
Join

Impala
Daemon

Smaller
Table

Data
Block

~33%
Cached

Smaller
Table

Smaller
Table

Data
Block

Impala
Daemon

~33%
Cached

Smaller
Table

Impala
Daemon

~33%
Cached

Smaller
Table

Hash
Par88oner
Hash
Par88oner

Impala
Daemon

BiggerTable
Data

Block

Impala
Daemon
Impala
Daemon

Hash
Par88oner

Hash
Join
Func8on

33%
Cached

Smaller
Table

Hash
Join
Func8on

33%
Cached

Smaller
Table

Hash
Join
Func8on

33%
Cached

Smaller
Table

Output
Output
Output

BiggerTable
Data

Block

Hash
Par88oner

BiggerTable
Data

Block

Hash
Par88oner

119
©
Cloudera,
Inc.
All
rights
reserved.

Why
sessionize?

Helps
answers
ques8ons
like:

•  What
is
my
website’s
bounce
rate?

• i.e.
how
many
%
of
visitors
don’t
go
past
the
landing
page?

•  Which
marke8ng
channels
(e.g.
organic
search,
display
ad,
etc.)
are
leading
to

most
sessions?

• Which
ones
of
those
lead
to
most
conversions
(e.g.
people
buying
things,

signing
up,
etc.)

•  Do
a^ribu8on
analysis
–
which
channels
are
responsible
for
most
conversions?

120
©
Cloudera,
Inc.
All
rights
reserved.

Sessioniza8on

244.157.45.12 - - [17/Oct/2014:21:08:30 ] "GET /seatposts HTTP/1.0" 200 4463 "http://
bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12+1413580110
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/
1.0" 200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5;
en-us; HTC Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0
Mobile Safari/533.1” 244.157.45.12+1413583199

121
©
Cloudera,
Inc.
All
rights
reserved.

How
to
Sessionize?

1.  Given
a
list
of
clicks,
determine
which
clicks
came
from

the
same
user
(Par88oning,
ordering)

2.  Given
a
par8cular
user's
clicks,
determine
if
a
given
click

is
a
part
of
a
new
session
or
a
con8nua8on
of
the

previous
session
(Iden8fying
session
boundaries)

122
©
Cloudera,
Inc.
All
rights
reserved.

#1
–
Which
clicks
are
from
same
user?

•  We
can
use:

• IP
address
(244.157.45.12)

• Cookies
(A9A3BECE0563982D)

• IP
address
(244.157.45.12)and
user
agent
string
((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")

123
©
Cloudera,
Inc.
All
rights
reserved.

#1
–
Which
clicks
are
from
same
user?

•  We
can
use:

• IP
address
(244.157.45.12)

• Cookies
(A9A3BECE0563982D)

• IP
address
(244.157.45.12)and
user
agent
string
((KHTML, like
Gecko) Chrome/36.0.1944.0 Safari/537.36")

124
©
Cloudera,
Inc.
All
rights
reserved.

#1
–
Which
clicks
are
from
same
user?

bestcyclingreviews.com/top_online_shops" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36”
244.157.45.12 - - [17/Oct/2014:21:59:59 ] "GET /Store/cart.jsp?productID=1023 HTTP/1.0"
200 3757 "http://www.casualcyclist.com" "Mozilla/5.0 (Linux; U; Android 2.3.5; en-us; HTC
Vision Build/GRI40) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1”

126
©
Cloudera,
Inc.
All
rights
reserved.

Sessioniza8on
engine
recommenda8on

•  We
have
sessioniza8on
code
in
MR
and
Spark
on
github.
The
complexity
of
the

code
varies,
depends
on
the
exper8se
in
the
organiza8on.

•  We
choose
MR

• MR
API
is
stable
and
widely
known

• No
Spark
+
Oozie
(orchestra8on
engine)
integra8on
currently

129
©
Cloudera,
Inc.
All
rights
reserved.

Filtering
recommenda8on

•  Bot/Spider
filtering
can
be
done
easily
in
any
of
the
engines

•  Incomplete
records
are
harder
to
filter
in
schema
systems
like
Hive,
Impala,

Pig,
etc.

•  Flume
interceptors
can
also
be
used

•  Pre^y
close
choice
between
MR,
Hive
and
Spark

•  Can
be
done
in
Spark
using
rdd.filter()

•  We
can
simply
embed
this
in
our
MR
sessioniza8on
job

131
©
Cloudera,
Inc.
All
rights
reserved.

Deduplica8on
recommenda8on

•  Can
be
done
in
all
engines.

•  We
already
have
a
Hive
table
with
all
the
columns,
a
simple
DISTINCT
query

will
perform
deduplica8on

•  reduce()
in
spark

•  We
use
Pig

132
©
Cloudera,
Inc.
All
rights
reserved.

BI/Discovery
engine
recommenda8on

•  Main
requirements
for
this
are:

• Low
latency

• SQL
interface
(e.g.
JDBC/ODBC)

• Users
don’t
know
how
to
code

•  We
chose
Impala

• It’s
a
SQL
engine

• Much
faster
than
other
engines

• Provides
standard
JDBC/ODBC
interfaces

135
©
Cloudera,
Inc.
All
rights
reserved.

•  Data
arrives
through
Flume

•  Triggers
a
processing
event:

• Sessionize

• Enrich
–
Loca8on,
marke8ng
channel…

• Store
as
Parquet

•  Each
day
we
process
events
from
the
previous
day

Orchestra8ng
Clickstream

136
©
Cloudera,
Inc.
All
rights
reserved.

•  Workﬂow
is
fairly
simple

•  Need
to
trigger
workﬂow
based
on
data

•  Be
able
to
recover
from
errors

•  Perhaps
no8fy
on
the
status

•  And
collect
metrics
for
repor8ng

Choosing
Right

139
©
Cloudera,
Inc.
All
rights
reserved.

•  Part
of
all
major
Hadoop
distribu8ons

•  Hue
integra8on

•  Built
-‐in
ac8ons
–
Hive,
Sqoop,
MapReduce,
SSH

•  Complex
workﬂows
with
decisions

•  Event
and
8me
based
scheduling

•  No8ﬁca8ons

•  SLA
Monitoring

•  REST
API

Oozie
features

144
©
Cloudera,
Inc.
All
rights
reserved.

•  Workﬂow
is
fairly
simple

•  Need
to
trigger
workﬂow
based
on
data

•  Be
able
to
recover
from
errors

•  Perhaps
no8fy
on
the
status

•  And
collect
metrics
for
repor8ng

Choosing…

Easier
in
Oozie

145
©
Cloudera,
Inc.
All
rights
reserved.

•  Workflow
is
fairly
simple

•  Need
to
trigger
workflow
based
on
data

•  Be
able
to
recover
from
errors

•  Perhaps
no8fy
on
the
status

•  And
collect
metrics
for
repor8ng

Choosing
the
right
Orchestra8on
Tool

Beêr
in
Azkaban

150
©
Cloudera,
Inc.
All
rights
reserved.

Final
Architecture
–
High
Level
Overview

Data

Sources

Inges8on

Raw
Data

Storage

(Formats,

Schema)

Processed

Data
Storage

(Formats,

Schema)

Processing

Data

Consump8on

Orchestra8on

(Scheduling,

Managing,

Monitoring)

151
©
Cloudera,
Inc.
All
rights
reserved.

Final
Architecture
–
High
Level
Overview

Data

Sources

Inges8on

Raw
Data

Storage

(Formats,

Schema)

Processed

Data
Storage

(Formats,

Schema)

Processing

Data

Consump8on

Orchestra8on

(Scheduling,

Managing,

Monitoring)

152
©
Cloudera,
Inc.
All
rights
reserved.

Final
Architecture
–
Inges8on/Storage

Web
Server
Flume
Agent

Web
Server
Flume
Agent

Web
Server
Flume
Agent

Web
Server
Flume
Agent

Web
Server
Flume
Agent

Web
Server
Flume
Agent

Web
Server
Flume
Agent

Web
Server
Flume
Agent

Flume
Agent

Flume
Agent

Flume
Agent

Flume
Agent

Fan-‐in

Pa^ern

Mul8
Agents
for

Failover
and
rolling
restarts

HDFS

/etl/BI/casualcyclist/
clicks/rawlogs/
year=2014/month=10/
day=10!

153
©
Cloudera,
Inc.
All
rights
reserved.

Final
Architecture
–
High
Level
Overview

Data

Sources

Inges8on

Raw
Data

Storage

(Formats,

Schema)

Processed

Data
Storage

(Formats,

Schema)

Processing

Data

Consump8on

Orchestra8on

(Scheduling,

Managing,

Monitoring)

154
©
Cloudera,
Inc.
All
rights
reserved.

Final
Architecture
–
Processing
and
Storage

/etl/BI/
casualcyclist/clicks/
rawlogs/year=2014/
month=10/day=10

…

dedup-‐>ﬁltering-‐
>sessioniza8on

/data/bikeshop/
clickstream/
year=2014/
month=10/day=10

…

parque8ze

155
©
Cloudera,
Inc.
All
rights
reserved.

Final
Architecture
–
High
Level
Overview

Data

Sources

Inges8on

Raw
Data

Storage

(Formats,

Schema)

Processed

Data
Storage

(Formats,

Schema)

Processing

Data

Consump8on

Orchestra8on

(Scheduling,

Managing,

Monitoring)

Hadoop Application Architectures tutorial at Big DataService 2015

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Hadoop Application Architectures tutorial at Big DataService 2015

Similar to Hadoop Application Architectures tutorial at Big DataService 2015 (20)

Recently uploaded

Recently uploaded (20)

Hadoop Application Architectures tutorial at Big DataService 2015