Hadoop-based architecture approaches

2015

Miraj
Godha

6/5/2015

Hadoop
Architecture
Approaches

1

Table
of
Contents

EXECUTIVE
SUMMARY
.................................................................................................................................
2

Big
data
Classification
...................................................................................................................................
3

Hadoop-‐based
architecture
approaches
......................................................................................................
5

Data
Lake
..................................................................................................................................................
5

Lambda
.....................................................................................................................................................
5

Choosing
the
correct
architecture
............................................................................................................
5

Data
Lake
Architecture
.................................................................................................................................
9

Generic
Data
lake
Architecture
..............................................................................................................
11

Steps
Involved
....................................................................................................................................
12

Lambda
Architecture
..................................................................................................................................
13

Batch
Layer
.............................................................................................................................................
14

Serving
layer
...........................................................................................................................................
14

Speed
layer
.............................................................................................................................................
14

Generic
Lambda
Architecture
................................................................................................................
16

References
..................................................................................................................................................
17

2

EXECUTIVE
SUMMARY

Apache
Hadoop
didn’t
disrupt
the
datacenter,
the
data
did.
Shortly
after
Corporate
IT
functions
within

enterprises
adopted
large
scale
systems
to
manage
data
then
the
Enterprise
Data
Warehouse
(EDW)

emerged
as
the
logical
home
of
all
enterprise
data.
Today,
every
enterprise
has
a
Data
Warehouse
that

serves
to
model
and
capture
the
essence
of
the
business
from
their
enterprise
systems.
The
explosion
of

new
types
of
data
in
recent
years
–
from
inputs
such
as
the
web
and
connected
devices,
or
just
sheer

volumes
of
records
–
has
put
tremendous
pressure
on
the
EDW.
In
response
to
this
disruption,
an

increasing
number
of
organizations
have
turned
to
Apache
Hadoop
to
help
manage
the
enormous

increase
in
data
whilst
maintaining
coherence
of
the
Data
Warehouse.
This
POV
discusses
Apache

Hadoop,
its
capabilities
as
a
data
platform
and
data
processing.
How
the
core
of
Hadoop
and
its

surrounding
ecosystems
provides
the
enterprise
requirements
to
integrate
alongside
the
Data

Warehouse
and
other
enterprise
data
systems
as
part
of
a
modern
data
architecture.
A
step
on
the

journey
toward
delivering
an
enterprise
‘Data
Lake’
or
Lambda
Architecture
(Immutable
data
+
views).

An
enterprise
data
lake
provides
the
following
core
benefits
to
an
enterprise:
New
efficiencies
for
data

architecture
through
a
significantly
lower
cost
of
storage,
and
through
optimization
of
data
processing

workloads
such
as
data
transformation
and
integration.
New
opportunities
for
business
through
flexible

‘schema-‐on-‐read’
access
to
all
enterprise
data,
and
through
multi-‐use
and
multi-‐workload
data

processing
on
the
same
sets
of
data:
from
batch
to
real-‐time.

Apache
Hadoop
provides
both
reliable
storage
(HDFS)
and
a
processing
system
(MapReduce)
for
large

data
sets
across
clusters
of
computers.
MapReduce
is
a
batch
query
processor
that
is
targeted
at
long-‐
running
background
processes.
Hadoop
can
handle
Volume.
But
to
handle
Velocity,
we
need
real-‐time

processing
tools
that
can
compensate
for
the
high-‐latency
of
batch
systems,
and
serve
the
most
recent

data
continuously,
as
new
data
arrives
and
older
data
is
progressively
integrated
into
the
batch

framework.
And
the
answer
to
the
problem
is
Lambda
Architecture.

3

Big
data
Classification

Processing
Type
Batch

Processing

Methodology

Near
Real
time
Real
Time
+
Batch

Prescriptive

Predictive

Diagnostic

Descriptive

Data
Frequency

On
demand
Continuous
Real
Time
Batch

Data
Type

Transactional
Historical
Master
data
Meta
data

Content
Format
Structured

Unstructured:-‐Images,

Text,
Videos,
Documents,

emails
etc.

Semi-‐Structured:
-‐

XML,
JSON

Data
Sources

Machine

generated

Web
&
Social

media

IOT
Human

Generated

Transactional

data
Via
other
data
providers

4

It's
helpful
to
look
at
the
characteristics
of
the
big
data
along
certain
lines
—
for
example,
how
the
data

is
collected,
analyzed,
and
processed.
Once
the
data
and
its
processing
are
classified,
it
can
be
matched

with
the
appropriate
big
data
analysis
architecture:

• Processing
type
-‐
Whether
the
data
is
analyzed
in
real
time
or
batched
for
later
analysis.
Give
careful

consideration
to
choosing
the
analysis
type,
since
it
affects
several
other
decisions
about
products,
tools,

hardware,
data
sources,
and
expected
data
frequency.
A
mix
of
both
types
‘Near
real
time
or
micro

batch”
may
also
be
required
by
the
use
case.

• Processing
methodology
-‐
The
type
of
technique
to
be
applied
for
processing
data
(e.g.,
predictive,

analytical,
ad-‐hoc
query,
and
reporting).
Business
requirements
determine
the
appropriate
processing

methodology.
A
combination
of
techniques
can
be
used.
The
choice
of
processing
methodology
helps

identify
the
appropriate
tools
and
techniques
to
be
used
in
your
big
data
solution.

• Data
frequency
and
size
-‐
How
much
data
is
expected
and
at
what
frequency
does
it
arrive.
Knowing

frequency
and
size
helps
determine
the
storage
mechanism,
storage
format,
and
the
necessary

preprocessing
tools.
Data
frequency
and
size
depend
on
data
sources:

• On
demand,
as
with
social
media
data

• Continuous
feed,
real-‐time
(weather
data,
transactional
data)

• Time
series
(time-‐based
data)

• Data
type
-‐
Type
of
data
to
be
processed
—
transactional,
historical,
master
data,
and
others.
Knowing

the
data
type
helps
segregate
the
data
in
storage.

• Content
format
-‐
Format
of
incoming
data
—
structured
(RDMBS,
for
example),
unstructured
(audio,

video,
and
images,
for
example),
or
semi-‐structured.
Format
determines
how
the
incoming
data
needs

to
be
processed
and
is
key
to
choosing
tools
and
techniques
and
defining
a
solution
from
a
business

perspective.

• Data
source
-‐
Sources
of
data
(where
the
data
is
generated)
—
web
and
social
media,
machine-‐
generated,
human-‐generated,
etc.
Identifying
all
the
data
sources
helps
determine
the
scope
from
a

business
perspective.

5

Hadoop-‐based
architecture
approaches

Data
Lake

A
data
lake
is
a
set
of
centralized
repositories
containing
vast
amounts
of
raw
data
(either
structured
or

unstructured),
described
by
metadata,
organized
into
identifiable
data
sets,
and
available
on
demand.

Data
in
the
lake
supports
discovery,
analytics,
and
reporting,
usually
by
deploying
cluster
tools
like

Hadoop.

Lambda

Lambda
architecture
is
a
data-‐processing
architecture
designed
to
handle
massive
quantities
of
data
by

taking
advantage
of
both
batch-‐
and
stream-‐processing
methods.
This
approach
to
architecture

attempts
to
balance
latency,
throughput,
and
fault-‐tolerance
by
using
batch
processing
to
provide

comprehensive
and
accurate
views
of
batch
data,
while
simultaneously
using
real-‐time
stream

processing
to
provide
views
of
online
data.
The
two
view
outputs
may
be
joined
before
presentation.

The
rise
of
lambda
architecture
is
correlated
with
the
growth
of
big
data,
real-‐time
analytics,
and
the

drive
to
mitigate
the
latencies
of
map-‐reduce.

Choosing
the
correct
architecture

6

Parameter
Data
Lake
Lambda

Simultaneous
access
to
Real

time
and
Batch
data

Data
Lake
can
use
real
time

processing
technologies
like

Storm
to
return
real
time

results,
however
in
such
a

scenario
historical
results

cannot
be
made
available.
If
we

use
technologies
like
Spark
to

process
data,
real
time
data
and

historical
data,
on
request
there

can
be
significant
delays
in

response
time
to
clients
as

compared
to
Lambda

architecture.

Lambda
Architecture’s
Serving

Layer
merges
the
output
of

Batch
Layer
and
Speed
Layer,

before
sending
the
results
of

user
queries.
As
data
is
already

processed
into
views
at
both

the
layers,
the
response
time
is

significantly
less.

Latency

Latency
is
high
as
compared
to

Lambda,
as
real
time
data
need

to
be
processed
with
historical

data
on-‐demand
or
as
a
part
of

batch.

Low-‐latency
real
time
results

are
processed
by
Speed
layer

and
Batch
results
are
pre-‐
processed
in
Batch
layer.
On

request,
both
the
results
are

just
merged,
there
by
resulting

low
latency
time
for
real
time

processing.

Ease
of
Data
Governance

Data
lake
is
coined
to
convey

the
concept
of
centralized

repository
containing
virtually

inexhaustible
amounts
of
raw

data
(or
minimally
curated)
data

that
is
readily
made
available

anytime
to
anyone
authorized

to
perform
analytical
activities.

Lambda
architecture’s
serving

layer
gives
access
to
processed

and
analyzed
data.
As
uses
get

access
to
processed
data

directly,
it
can
lead
to
top
down

data
governance
issues.

Updates
in
source
data

As
data
lake
stores
only
raw

data,
updates
are
just
appended

to
raw
data,
thereby
makes
life

of
business
users
difficult
to

write
business
logic,
in
such
a

way
that
latest
updated
records

are
considered
in
calculations.

Batch
Views
are
always

computed
from
starch
in

Lambda
Architecture.
As
a

result,
updates
can
be
easily

incorporated
in
calculated

Views
in
each
reprocess
batch

cycle.

Fault
tolerance
against
human

errors

Data
Scientist
or
business
users,

running
business
logic
on

relevant
raw
data
in
Data
Lake

might
lead
to
human
errors.

Although,
re-‐covering
from

those
errors
is
not
difficult
as

it’s
just
a
matter
of
re-‐running

the
logic.
However,
the

reprocessing
time
for
large

datasets
might
lead
to
some

delays.

Lambda
architecture
assures

fault
tolerance
not
only
against

hardware
failures
but
against

human
errors.
Re-‐computation

of
views
every
time
from
raw

data
in
batch
layer,
insures
that

any
human
errors
in
business

logic
would
not
be
cascaded
to
a

level
where
it’s
unrecoverable.

Ease
of
business
users
Data
is
stored
in
raw
format,
Data
is
processed
and
available

7

with
data
definitions
and

sometime
groomed
to
make

digestible
by
data
management

tools.
At
times,
it
difficult
for

business
users
to
use
data
in
as-‐
is
conditions.

from
Serving
makes
life
easy
for

business
users.

Accuracy
for
real
time
results

Irrespective
of
any
scenario,

users
accessing
data
from
Data

Lake
has
access
to
immutable

raw
data,
they
can
do
exact

computations,
thereby
always

get
the
accurate
results.

In
scenarios,
where
real
time

calculations
need
to
access

historical
data,
which
is
not

possible,
Lambda
architecture

would
return
you
estimated

results.
For
example,
calculation

of
mean
value,
cannot
be

achieved
until
whole
historical

data
and
real
time
data
is

referenced
at
one
go.
In
such
a

scenario,
serving
layer
would

return
estimated
results.

Infrastructure
Cost

Data
lake
architecture
process

the
data
as
and
when
need
and

thereby
the
cluster
cost
can
be

much
less
as
compared
to

Lambda.
Moreover,
it
only

persist
the
raw
data
however

Lambda
architecture
not
only

persist
the
raw
data
but

processed
data
too.
This
leads

to
extra
storage
cost
in
Lambda

architecture.

Lambda
architecture
data

processing
life
cycle
is
designed

in
such
a
fashion
that
as
soon

the
one
cycle
of
batch
process
is

finished,
it
starts
a
new
cycle
of

batch
processing
which
includes

the
recently
inserted
data.

Simultaneously,
the
speed
layer

is
always
processing
the
real

time
data.

OLAP

Unlike
data
marts,
which
are

optimized
for
data
analysis
by

storing
only
some
attributes
and

dropping
data
below
the
level

aggregation,
a
data
lake
is

designed
to
retain
all
attributes,

especially
so
when
you
do
not

yet
know
what
the
scope
of

data
or
its
use
will
be.

As
Lambda
exposes
the

processed
views
from
serving

layer,
all
the
attributes
of
data

would
not
be
available
to
Data

Scientist
for
running
an

analytical
queries
at
times.

Historical
data
reference
for

processing

OLAP
&
OLTP
queries
access
the

raw
or
groomed
data
directly

from
the
data
lake,
making
it

feasible
to
access
and
refer
the

historical
data
while
processing

data
for
given
time
interval.

Speed
layer
do
not
have

reference
to
historical
data

stored
in
batch
layer,
make
it

difficult
to
run
queries
which

refer
historical
data.
For
e.g.

‘Unique
count’
type
of
queries

cannot
return
correct
results

from
Speed
layer.
However,

‘calculating
average’
type
of

8

query
calculations
be
done

easily
on
Serving
layer,
by

generating
the
average
of

results
returned
from
Speed
and

Batch
layer
on
the
fly.

Slowly
Changing
Dimensions

Although,
data
lake
has
records

of
changed
dimension

attributes,
however
extra

business
logic
need
to
be

written
by
business
uses
to

cater
it.

Lambda
architecture
can
easily

cater
the
slowly
changing

dimensions
by
creating

surrogate
keys
parallel
to

natural
keys
in
case
of
any

change
detected
in
dimension

attributes
while
batch
layer

processing
cycle.

Slowly
changing
Facts

However,
in
Data
Lake
both
the

versions
of
facts
are
available

for
users
to
look
at,
this
would

lead
to
good
analytical
results
if

fact
life
cycle
is
an
attribute
in

business
logic
for
data
analytics.

Although
it’s
easy
to
change
the

facts
in
Lambda
architecture,

but
this
will
lead
to
loss
in

information
of
fact
life
cycle.
As

the
previous
state
of
fact
in
case

of
slowly
changing
facts
is
not

available
to
Data
Scientist,
the

analytical
queries
might
not
give

desired
results
on
views

exposed
by
Serving
Layer.

Frequently
changing
business

logic

Changes
in
processing
code

need
to
be
done.
But
there
is
no

clear
solution,
of
how
the

historically
processed
data
need

to
be
handled.

As
data
is
re-‐processed
from

starch,
even
if
business
logic

changes
frequently
the

historical
data
problem
is

resolved
automatically.

Implementation
lifecycle

Data
lake
is
fast
to
implement

as
it
eliminates
the
dependency

of
data
modeling
upfront

Processing
logic
need
to
be

implemented
at
batch
and

speed
layer,
leading
to

significant
implementation
time

as
comparted
to
Data
Lake

Adding
new
data
sources

Very
easy
to
add

Need
to
be
incorporated
in

processing
layers
and
would

require
code
changes

9

IF
YOU
THINK
OF
A
DATAMART
AS
A
STORE

OF
BOTTLED
WATER
–
CLEANSED
AND

PACKAGED
AND
STRUCTURED
FOR
EASY

CONSUMPTION
–
THE
DATA
LAKE
IS
A

LARGE
BODY
OF
WATER
IN
A
MORE

NATURAL
STATE.
THE
CONTENTS
OF
THE

DATA
LAKE
STREAM
IN
FROM
A
SOURCE
TO

FILL
THE
LAKE,
AND
VARIOUS
USERS
OF
THE

LAKE
CAN
COME
TO
EXAMINE,
DIVE
IN,
OR

TAKE
SAMPLES.

BY:
JAMES
DIXON
(PENTAHO
CTO)

Data
Lake
Architecture

Much
of
today's
research
and
decision
making
are
based
on
knowledge
and
insight
that
can
be
gained

from
analyzing
and
contextualizing
the
vast
(and
growing)
amount
of
“open”
or
“raw”
data.
The
concept

that
the
large
number
of
data
sources
available
today
facilitates
analyses
on
combinations
of

heterogeneous
information
that
would
not
be
achievable
via
“siloed”
data
maintained
in
warehouses
is

very
powerful.
The
term
data
lake
has
been
coined
to

convey
the
concept
of
a
centralized
repository
containing

virtually
inexhaustible
amounts
of
raw
(or
minimally

curated)
data
that
is
readily
made
available
anytime
to

anyone
authorized
to
perform
analytical
activities.

A
data
lake
is
a
set
of
centralized
repositories
containing

vast
amounts
of
raw
data
(either
structured
or

unstructured),
described
by
metadata,
organized
into

identifiable
data
sets,
and
available
on
demand.
Data
in

the
lake
supports
discovery,
analytics,
and
reporting,

usually
by
deploying
cluster
tools
like

Hadoop.
Unlike
traditional
warehouses,
the

format
of
the
data
is
not
described
(that
is,

its
schema
is
not
available)
until
the
data
is

needed.
By
delaying
the
categorization
of

data
from
the
point
of
entry
to
the
point
of

use,
analytical
operations
that
transcend
the

rigid
format
of
an
adopted
schema
become

possible.
Query
and
search
operations
on
the
data
can
be
performed
using
traditional
database

technologies
(when
structured),
as
well
as
via
alternate
means
such
as
indexing
and
NoSQL
derivatives.

Key
Features

• Stores
Raw
data
–
Single
source
of
truth

• Data
accessible
to
anyone
authorized

• Polyglot
Persistence

• Support
multiple
applications
&
Workloads

• Low
Cost,
High
Performance
storage

• Flexible,
easy
to
use
data
organization

• Self-‐service
end-‐user

• More
Flexible
to
answer
new
questions

• Easy
to
add
new
data
sources

• Loosely
coupled
architecture
–
enables

flexibility
of
analysis

• Eliminating
dependency
of
data
modeling

upfront
–
thereby
fast
to
implement

• Storage
is
highly
optimized
as
raw
data
is

stored

Disadvantages

• High
Latency
for
composite
analysis
view
of

both
real
time
and
historical
data

• Raw
data
does
not
provide
relational
structure

that
is
not
friendly
for
business
analytis
on
the

fly

10

In
a
practical
sense,
a
data
lake
is
characterized
by
three
key
attributes:

• Collect
everything:

A
data
lake
contains
all
data,
both
raw
sources
over
extended
periods
of

time
as
well
as
any
processed
data.

• Dive
in
anywhere:
A
data
lake
enables
users
across
multiple
business
units
to
refine,
explore

and
enrich
data
on
their
terms.

• Flexible
access:

A
data
lake
enables
multiple
data
access
patterns
across
a
shared

infrastructure:
batch,
interactive,
online,
search,
in-‐memory
and
other
processing
engines.

11

Generic
Data
lake
Architecture

H

Data

Sources
Real

Time
Micro

Batch

Mega

Batch

Desktop
&
Mobile

Social
Media
and
cloud

Operational
Systems

Internet
of
Things
(IOT)

Ingestion

Tier
Query

Interface
SQL
No
SQL
Extern
al

Storag
e
Centralized
Management
System
monitoring System
management
Unified
Data
Management
Tier
Data
mgmt. Data
Access
Processing
Tier
Workflow
Management
HDFS
storage
Unstructured
and
structured
data
In-‐memory
MapReduce/
Hive/MPP
Flexible

Actions
Real-‐time

insights
Interactive

insights
Batch
insights
Schematic
Metadata Grooming
Data
Processed

Data
Raw

Data
Processed

Data
Processed

Data

12

Steps
Involved

• Procuring
data
–
Process
of
obtaining
data
and
metadata
and
preparing
them
for
eventual

inclusion
in
a
data
lake.

• Obtaining
data
–Transferring
the
data
physically
from
source
to
Data
Lake.

• Describing
data
–
Data
scientist
searching
a
data
lake
for
useful
data
must
be
able
to
find
the

data
relevant
to
his
or
her
need,
for
the
same
they
require
metadata
for
the
data.
Schematic

metadata
for
this
data
set
would
include
information
about
how
the
data
is
formatted
and

information
about
the
schema.

• Grooming
data
–
Although
we
were
talking
about
raw
data
is
made
consumable
by
analytics

applications.
However,
in
some
scenarios
grooming
process
use
schematic
metadata
to

transform
raw
data,
into
data
that
can
be
processed
by
standard
data
management
tools.

• Provisioning
data
–
Authentication
and
authorization
policies
by
which
consumers
take
out
data

from
Data
Lake.

• Preserving
data
–
Managing
Data
Lake
also
require
attention
to
maintenance
issues
such
as

staleness,
expiration,
decommissions
and
renewals.

13

LAMBDA
ARCHITECTURE
IS
A
DATA-‐
PROCESSING
ARCHITECTURE
DESIGNED
TO

HANDLE
MASSIVE
QUANTITIES
OF
DATA
BY

TAKING
ADVANTAGE
OF
BOTH
BATCH-‐
AND

STREAM-‐PROCESSING
METHODS.
THIS

APPROACH
TO
ARCHITECTURE
ATTEMPTS

TO
BALANCE
LATENCY,
THROUGHPUT,
AND

FAULT-‐TOLERANCE
BY
USING
BATCH

PROCESSING
TO
PROVIDE
COMPREHENSIVE

AND
ACCURATE
VIEWS
OF
BATCH
DATA,

WHILE
SIMULTANEOUSLY
USING
REAL-‐TIME

STREAM
PROCESSING
TO
PROVIDE
VIEWS

OF
ONLINE
DATA.
THE
TWO
VIEW
OUTPUTS

MAY
BE
JOINED
BEFORE
PRESENTATION.

Lambda
Architecture

The
Lambda
architecture
is
split
into
three

layers,
the
batch
layer,
the
serving
layer,
and
the

speed
layer.

1. Batch layer (Apache Hadoop)

2. Serving layer (Cloudera Impala,
Spark)

3. Speed layer (Storm, Spark,
Apache HBase, Cassandra)

Key
Features

• Low
latency
simultaneous
analysis
of
the
(near)
real-‐
time
information
extracted
from
a
continuous
inflow

of
data
and
persisting
analysis
of
a
massive
volume
of

data.

• Fault
tolerant
not
against
hardware
failure
but
against

human
error
too

• Mistakes
are
corrected
by
re-‐computations

• Storage
is
highly
optimized
as
raw
data
is
stored

14

Batch
Layer

The
batch
layer
is
responsible
for
two

things.
The
first
is
to
store
the
immutable,

constantly
growing
master
dataset
(HDFS),

and
the
second
is
to
compute
arbitrary

views
from
this
dataset
(MapReduce).

Computing
the
views
is
a
continuous

operation,
so
when
new
data
arrives
it
will

be
aggregated
into
the
views
when
they

are
recomputed
during
the
next

MapReduce
iteration.

The
views
should
be
computed
from
the

entire
dataset
and
therefore
the
batch

layer
is
not
expected
to
update
the
views

frequently.
Depending
on
the
size
of
your

dataset
and
cluster,
each
iteration
could
take
hours.

Serving
layer

The
output
from
the
batch
layer
is
a
set
of
flat
files
containing
the
precomputed
views.
The
serving
layer

is
responsible
for
indexing
and
exposing
the
views
so
that
they
can
be
queried.
Although,
the
batch
and

serving
layers
alone
do
not
satisfy
any
realtime
requirement
because
MapReduce
(by
design)
is
high

latency
and
it
could
take
a
few
hours
for
new
data
to
be
represented
in
the
views
and
propagated
to
the

serving
layer.
This
is
why
we
need
the
speed
layer.

Speed
layer

In
essence
the
speed
layer
is
the
same
as
the
batch
layer
in
that
it
computes
views
from
the
data
it

receives.
The
speed
layer
is
needed
to
compensate
for
the
high
latency
of
the
batch
layer
and
it
does

this
by
computing
realtime
views
in
Storm.
The
realtime
views
contain
only
the
delta
results
to

supplement
the
batch
views.

Whilst
the
batch
layer
is
designed
to
continuously
recompute
the
batch
views
from
scratch,
the
speed

layer
uses
an
incremental
model
whereby
the
realtime
views
are
incremented
as
and
when
new
data
is

received.
What’s
clever
about
the
speed
layer
is
the
realtime
views
are
intended
to
be
transient
and
as

soon
as
the
data
propagates
through
the
batch
and
serving
layers
the
corresponding
results
in
the

Disadvantages

• Maintaining
copies
code
that
needs
to
produce

the
same
result
in
two
complex
distributed

systems

• Could
return
estimated
or
approx.
results.

• Expensive
full
recomputation
is
required
for

fault
tolerance

• Requires
high
cluster
up-‐time,
as
batch
data

need
to
be
processed
continuously.

• Requires
more
implementation
time,
as

duplicate
code
need
to
be
written
in
separate

technologies
to
process
real
time
and
batch

data.

• Time
taken
to
process
batch
is
linearly

15

realtime
views
can
be
discarded.
This
is
referred
to
as
“complexity
isolation”,
meaning
that
the
most

complex
part
of
the
architecture
is
pushed
into
the
layer
whose
results
are
only
temporary.

Realtime
views
are
discarded

once
the
data
they
contain
is

represented
in
batch
view

Now

Batch

Batch

Batch

Realtime

Realtime

Realtime

Time

16

Generic
Lambda
Architecture

Batch
Layer

Serving
Layer

Speed
Layer

All
Data

(HDFS)

Pre-‐computed

Views
&

Summarized
data

Batch

Precompute

Data
Stream

Data
Stream

Data
Stream

Data
Stream

Process

Stream

Increment
views
/

Stream

Summarization

Query

V
V
V

V
V
V

Near
real
time
-‐

Increment

Real
time

views

Batch

Views

Storm
or
Spark

MR
/
Hive/
Pig

Data
Management
&

Access

17

References

http://www.ibm.com/developerworks/library/bd-‐archpatterns1/

http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf

https://en.wikipedia.org/wiki/Lambda_architecture

http://voltdb.com/blog/simplifying-‐complex-‐lambda-‐architecture

http://en.wiktionary.org/wiki/data_lake

Hadoop-based architecture approaches

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop-based architecture approaches

Similar to Hadoop-based architecture approaches (20)

Hadoop-based architecture approaches