Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirements

White

Paper

Achieving
Flexible
Scalability
of
Hadoop

to
Meet
Enterprise
Workload

Requirements

By
Nik
Rouda,
Senior
Analyst

December
2014

This
ESG
White
Paper
was
commissioned
by
EMC

and
is
distributed
under
license
from
ESG.

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

White
Paper:
Achieving
Flexible
Scalability
of
Hadoop
to
Meet
Enterprise
Workload
Requirements

2

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

Contents

Big
Data
Environments
Have
Varying
Goals
and
Requirements
for
Scaling
...............................................
3

Big
Data
Is
Big
By
Definition
.....................................................................................................................................
3

Hadoop
Scales
for
Big
Data,
But
Scale
Isn’t
Always
Easy
.........................................................................................
3

Big
Data
Needs
Scale
in
Multiple
Dimensions
..........................................................................................................
4

Emerging
Choices
for
Implementation
and
Scaling
of
Hadoop
Environments
...........................................
6

Independent
Scaling
of
Servers
and
Storage
...........................................................................................................
6

Evaluation
of
Big
Data
Solutions
..............................................................................................................................
8

When
Scaling
Environments,
“How
to
Host
Hadoop”
Is
as
Important
as
“Where
to
Host
Hadoop”
..........
9

The
Bigger
Truth
.......................................................................................................................................
10

All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The
Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are
subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of
this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the
express consent of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil damages and,
if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188.

White
Paper:
Achieving
Flexible
Scalability
of
Hadoop
to
Meet
Enterprise
Workload
Requirements

3

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

Big
Data
Environments
Have
Varying
Goals
and
Requirements
for
Scaling

Big
Data
Is
Big
By
Definition

The
popularity
of
big
data
continues
to
grow
with
new
applications
and
new
enthusiasm
in
almost
every
industry.

Many
organizations
are
looking
for
opportunities
to
transform
their
business
using
the
possibilities
afforded
by
new

data
processing
and
analytics
technologies
and
their
promise
of
new
capabilities
and
improved
economics.

The
broad
Hadoop
ecosystem
that
has
developed
around
the
Apache
open-‐source
project,
along
with
commercial

distributions,
is
one
of
the
most
instrumental
forces
powering
this
change
in
IT.
This
change
is
not
a
result
of

marketing
hype.
It
is
due
to
Hadoop’s
suitability
for
accommodating
large,
intricate
data
volumes.
Indeed,
recently

conducted
ESG
research
revealed
that
the
ability
to
process
(54%),
store
(49%),
and
run
complex
queries
on
(47%)

large
volumes
of
diverse
data
are
the
three
most
commonly
identified
terms
or
criteria
that
organizations
use
to

define
“big
data”
(see
Figure
1).1

Figure
1.
Top
Three
Terms
or
Criteria
that
Best
Align
with
Organizations’
Definitions
of
“Big
Data”

Source:
Enterprise
Strategy
Group,
2014.

Hadoop
Scales
for
Big
Data,
But
Scale
Isn’t
Always
Easy

As
Hadoop
becomes
more
popular,
a
more
nuanced
understanding
of
its
operational
requirements
is
growing.

Some
of
the
more
common
considerations
are
the
advantages
of
the
platform
compared
with
more
traditional
data

management
tools.
These
advantages
include
nominally
low
entry
costs,
a
range
of
analytics
options,
support
for

distributed
and
parallel
jobs,
and
a
simple
mechanism
for
extreme
scale-‐out
based
on
generic
hardware.

These
pluses
are
spawning
a
new
community
of
Hadoop
champions,
including
data
scientists
and
architects
who

are
looking
for
new
technologies
capable
of
supporting
their
specific
big
data
initiatives.
Reflecting
this
popularity,

ESG
survey
respondents
indicated
that
39%
of
organizations
are
now
planning
to
deploy
a
new
Hadoop

environment
within
the
next
12
to
18
months.2

With
the
top
three
definitions
of
big
data
all
referencing
large
volumes
of
data,
it
follows
that
scalability
is
a
priority

for
most
deployments.
Fortunately,
Hadoop
has
been
developed
with
the
inherent
need
for
scale
in
mind.

1

Source:
ESG
Research
Report,
Enterprise
Data
Analytics
Trends,
May
2014.

2

Source:
ESG
Research
Report,
Enterprise
Big
Data,
Business
Intelligence,
and
Analytics
Trends,
to
be
published
December
2014.

47%

49%

54%

42%
44%
46%
48%
50%
52%
54%
56%

Ability
to
run
complex
queries
against
large

datasets

Ability
to
store
large
volumes
of
diverse
data

Ability
to
process
large
volumes
of
diverse
data

Which
of
the
following
terms
or
criteria
best
align
with
your
deﬁniMon
of
“big

data”?
(Percent
of
respondents,
N=375,
ﬁve
responses
accepted)

White
Paper:
Achieving
Flexible
Scalability
of
Hadoop
to
Meet
Enterprise
Workload
Requirements

4

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

However,
some
organizations
are
discovering
that
scalability
comes
with
trade-‐offs,
falling
generally
into
three

areas:

• Storing
and
processing
extremely
large
volumes
of
data.
The
value
of
that
data—and
the
insights
it
may

provide—may
not
always
be
clear.
More
data
isn’t
necessarily
better.
And
even
with
the
market
expecting

that
managing
this
data
on
open
source
software
and
commodity
hardware
will
be
nominally
less

expensive,
the
cost
per
gigabyte
can
still
quickly
add
up.

• Performance
at
scale.
If
all
the
data
sets
can
be
stored,
can
they
be
analyzed
fast
enough
on
demand?
Big

data
won’t
be
particularly
helpful
if
processing
a
large
amount
of
data
takes
an
unreasonably
long
time
to

accomplish.
Batch
analytics
need
to
return
results
while
the
question
is
relevant
and
appropriate
actions

can
be
taken.

• Diversity
of
data.
The
various
data
sets
need
to
be
recognized
and
reconciled
and
be
made
conducive
to

intricate
calculations
and
advanced
modeling
techniques.
Again,
scale
can
complicate
matters
in
terms
of

logical
design
and
computation
demands.

Failure
to
address
those
issues
could
mean
that
the
big
data
environment
won’t
live
up
to
the
high
expectations
of

IT
departments
and
line
of
business
users.
Although
these
challenges
can
seem
trivial
or
distant
during
proof-‐of-‐
concept
and
pilot
programs,
most
organizations
will
eventually
discover
the
limitations
of
their
approach
after
an

enterprise-‐wide
production
deployment
starts
to
expand
in
earnest.

Big
Data
Needs
Scale
in
Multiple
Dimensions

Today,
the
most
common
model
of
Hadoop
deployments
involves
clusters
of
commodity
servers
with
embedded

storage.
This
is
a
fairly
standard
approach
that
would
seem
to
provide
incremental
scalability
for
the
inevitable

increases
in
data
processing,
storage,
and
analysis.
However,
Hadoop
use
cases
vary
tremendously
even
within
a

single
company
or
environment.
IT
infrastructure
architects,
data
scientists,
and
analytic
staff
in
lines
of
business

need
to
collaborate
to
define
likely
demands
and
prioritize
their
design
choices.
(Figure
2
on
page
5
shows
how

changing/different
workloads
need
differing
types
of
system
resources
to
be
handled
most
effectively.)

Business
adoption
of
Hadoop
is
relatively
new;
many
organizations
begin
their
initial
experimentation
with
lower-‐
risk
use
cases
in
spite
of
their
enthusiasm
for
the
technology.
They
may
begin
with
a
small
cluster
by
copying

multiple
internal
data
sources
and
capturing
external
public
data
as
well.
Generally,
as
they
gain
familiarity
and

confidence
and
can
start
demonstrating
success,
organizations
will
expand
the
environment
and
realize
increasing

value—they
will
find
a
growing
number
of
use
cases
and
win
more
converts
among
their
analysts
and
business

users.

Scaling
Hadoop
for
Storage-‐intensive
Use
Cases

Many
organizations
in
the
early
stages
of
Hadoop
implementations
may
be
saving
massive
quantities
of
data
that

are
rarely
used
or
that
will
be
valuable
only
in
the
future.

This
scenario
can
be
seen
in
geophysical
remote
sensing
and
surveying
for
oil
and
gas
exploration,
where
seismic,

gravimetric,
and
electrical
conductive
spatial
maps
have
been
defined
and
captured
down
to
a
fine
granularity

across
a
vast
territory.
Today’s
natural
resource
extraction
methods
might
not
make
it
economical
for
an
energy

company
to
mine
a
given
oil
or
gas
deposit
identified
during
the
exploration
process.
However,
that
situation
is

subject
to
changes
in
market
conditions
(e.g.,
rising
energy
prices)
or
advances
in
resource
extraction
technology

(e.g.,
hydraulic
fracturing).
If
conditions
change
and
the
energy
firm
decides
to
initiate
operations
on
a
new
deposit,

this
may
require
deep
storage
yet
minimal
server
effort
(see
Figure
2).

White
Paper:
Achieving
Flexible
Scalability
of
Hadoop
to
Meet
Enterprise
Workload
Requirements

5

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

Figure
2.
Different
Workloads
Need
Different
Resources
to
Be
Handled
Most
Effectively

Source:
Enterprise
Strategy
Group,
2014.

Yet,
28%
of
IT
professionals
surveyed
by
ESG
have
said
that
storage
costs
alone
remain
too
high
for
the
big
data

archive
they’d
like
to
build.3

This
problem
remains
even
when
using
inexpensive,
“slow”
internal
server
hard
disk

drives
instead
of
more
robust
external
high-‐capacity,
high-‐performance
storage
arrays.

Scaling
Hadoop
for
Memory-‐intensive
Use
Cases

Some
analytics
workloads,
such
as
real-‐time
security
analytics,
need
large
pools
of
memory
for
the
fastest
possible

search,
read,
and
write
performance.
It
is
often
dictated
by
the
trend
toward
delivering
the
fastest
possible

response
for
complex
analytics
jobs.

Think
of
on-‐the-‐fly
customization
of
displayed
products
and
promotional
offers
for
web-‐scale
e-‐commerce
sites,

where
a
360-‐degree
understanding
of
the
shopper,
inventory,
regional
pricing,
and
competitors’
activity
must
all
be

instantly
and
simultaneously
considered.

In-‐memory
operations
will
almost
certainly
outperform
those
that
need
to
conduct
I/O
from
traditional
hard
disk
or

even
server-‐embedded
flash
storage
or
SSD
drives
simply
due
to
reduced
data
movement.
However,
in
this

scenario,
any
individual
commodity
server
will
have
a
specific
limit
on
the
amount
of
memory
available
for

analytics.

One
way
to
address
this
challenge
is
to
spread
the
workload
across
multiple
parallel
servers
to
increase
the
total

available
memory.
However,
this
approach
would
still
add
some
job
coordination
overhead
(and
therefore
delay),

which
could
result
in
unacceptable
application
performance
for
business
users.

To
identify
the
memory
requirements
of
typical
big
data
environments,
ESG
surveyed
many
large
enterprises
and

found
that
nearly
two-‐thirds
of
respondents
expect
to
process
more
than
5TB
of
data
as
part
of
a
typical
analytics

job.4

This
threshold
is
well
above
the
memory
a
typical
low-‐end
server
can
handle,
even
if
the
right
mix
of
data
just

happened
to
be
concentrated
on
that
particular
node.
And
often,
organizations’
queries
will
need
to
work
with
data

sets
several
orders
of
magnitude
larger.
Finding
a
way
to
bridge
the
server
memory
requirements
to
storage

without
wasting
resources
is
imperative.

3

Source:
ESG
Research
Report,
Enterprise
Data
Analytics
Trends,
May
2014.

4

Source:
ESG
Research
Report,
Enterprise
Big
Data,
Business
Intelligence,
and
Analytics
Trends,
to
be
published
December
2014.

White
Paper:
Achieving
Flexible
Scalability
of
Hadoop
to
Meet
Enterprise
Workload
Requirements

6

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

Scaling
Hadoop
for
Compute-‐intensive
Use
Cases

A
third
category
of
use
cases
will
need
more
compute
power
for
advanced
analytics
and
complex
transformations.

These
jobs
can
be
very
processor
intensive.

Take,
for
example,
a
pharmaceutical
research
and
design
study
that
completes
full
DNA
analyses
alongside
a
variety

of
environmental
factors
and
unique
patient
healthcare
histories—and
much
of
the
data
is
in
unstructured
format

such
as
nurses’
patient
notes.
The
vast
number
of
discrete
calculations
involved
in
discovery,
model
fitting,
and

assessing
accuracy
can
tax
even
the
largest
appliances
or
mainframes.
Yet,
this
is
now
being
done
instead
on

clusters
of
low-‐cost
servers.

Large
data
sets
are
involved,
but
they
are
perhaps
more
transient
in
nature,
translating
into
scalability

requirements
that
are
focused
less
on
storage
capacity
and
more
on
handling
the
demands
of
real-‐time
streaming.

ESG
research
found
that
26%
of
companies
say
that
data
set
sizes
are
now
limiting
their
ability
to
perform
the

requisite
analytics
exercises.5

Scaling
Hadoop
for
Geographically
Dispersed
Use
Cases

Another
dimension
of
scaling
is
focused
on
how
to
use
Hadoop
in
a
geographically
distributed
environment.
Data

sets
may
be
collected
and
hosted
in
different
data
centers
globally
or
around
a
specific
region.
For
example,
clicks

and
purchase
activity
from
a
public
cloud-‐based
web
application
may
be
retained
in
the
cloud
rather
than
copied

back
to
a
company’s
on-‐premises
cluster.
Alternately,
human
resources
data
may
be
managed
at
corporate

headquarters
while
financial
account
information
is
kept
in
other
countries
due
to
local
government
regulations

tied
to
the
exporting
of
private
information.

In
these
cases,
the
addition
of
a
replication
capability
between
Hadoop
clusters
could
prove
valuable—provided
the

sensitive
data
is
appropriately
masked
and
encrypted.
This
functionality
is
not
innately
present
in
Hadoop
today,

but
it
could
be
managed
through
proprietary
scale-‐out
storage
features.

Whether
for
analytics
or
transaction
workloads,
a
level
of
concern
about
disaster
recovery
and
business
continuity

may
also
require
geographic
dispersion,
particularly
as
Hadoop
begins
to
support
more
mission-‐critical
operations

in
the
enterprise.

Each
of
the
scenarios
above
is
fundamentally
related
to
the
scalability
of
big
data,
but
each
shows
a
different

dimension
of
scalability
that
may
be
required
to
achieve
the
goals
of
the
initiative.
Any
given
Hadoop
cluster
may

have
a
different
blend
of
these
needs
when
initially
deployed.
More
challenging
is
the
fact
that
the
characteristics

may
change
radically
over
time.

Emerging
Choices
for
Implementation
and
Scaling
of
Hadoop
Environments

As
noted,
a
typical
Hadoop
environment
is
a
cluster
of
commodity
servers
with
internal
storage
that
is
self-‐
managing
and
built
to
the
most
common
denominator
of
system
specifications.
For
generic
workloads,
this
cost-‐
first
approach
can
be
adequate,
especially
if
the
analytics
aren’t
particularly
intensive
or
deemed
to
be
mission-‐
critical
for
the
business.
However,
a
small-‐scale
test
of
the
capabilities
may
not
demonstrate
the
limitations
that
will

emerge
as
scale
inevitably
increases.
Finding
the
optimal
server
specifications
for
each
node
in
the
cluster
(or

clusters)
can
be
a
non-‐trivial
exercise
that
may
not
be
addressed
with
one
correct
solution.

Independent
Scaling
of
Servers
and
Storage

Given
the
variety
and
changing
nature
of
Hadoop
implementation
requirements
(often
unknown
at
the
time
of

deployment),
the
assumed
infrastructure
model
of
commodity
servers
with
embedded
storage
doesn’t
always

make
sense.

5

Ibid.

White
Paper:
Achieving
Flexible
Scalability
of
Hadoop
to
Meet
Enterprise
Workload
Requirements

7

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

This
model,
in
which
servers
and
storage
capacity
are
embedded
together
as
“one
size
fits
all”
units
and
forced
to

scale
linearly
in
a
homogenous
cluster,
can
potentially
waste
a
lot
of
resources.
No
single
configuration
of
server

may
be
able
to
handle
the
various
workloads
(again
noting
that
particular
jobs
can
be
limited
by
system
memory,

processor
speed,
or
storage
capacity).
And
of
course,
increasing
any
one
of
these
components
will
have
an
effect
on

the
overall
price
of
each
node.
Over-‐provisioning
all
three
may
sound
good
for
performance,
but
this
“bigger

hammer”
approach
will
adversely
and
dramatically
increase
the
total
costs
of
the
environment
and
obviate
the

commodity-‐scale
benefits
of
Hadoop.

A
promising
approach
is
the
separation
of
servers
and
storage
into
pools
of
resources
to
be
drawn
on
as
needed

(see
Figure
3).
This
division
between
scaling
storage
capacity
and
computing
power
enables
more
targeted

scalability
to
satisfy
specific
demands
associated
with
different
workloads.
The
approach
may
also
require
the

adoption
of
a
shared
storage
platform,
which
is
not
the
most
common
model
for
Hadoop,
but
it
brings
with
it
many

of
the
key
qualities
organizations
say
they
want
(such
as
a
balance
across
cost,
performance,
flexibility,
and
data

protection
considerations).
And,
even
in
a
basic
MapReduce
operation,
data
is
often
migrated
and
joined
in

performing
typical
jobs
so
the
facility
can
be
accommodated,
just
with
a
different
route
for
accessing
storage.

An
independent
server-‐scaling
strategy
complements
the
many
advantages
of
centralized,
shared
storage,
which

ESG
previously
outlined
in
a
white
paper
titled
EMC
Isilon:
A
Scalable
Storage
Platform
for
Big
Data
(April
2014).

Some
benefits
of
this
approach
include
(but
are
not
limited
to)
multi-‐protocol
access,
in-‐place
analytics
(i.e.,
no

extract,
transform,
load
[ETL]),
and
better
efficiency
and
safety.

If
one
views
the
Hadoop
cluster
servers
essentially
as
virtualized
resources,
this
computing
power
then
can
be
used

to
access
different
storage
as
needed.
Effectively,
this
model
can
be
viewed
as
a
logical
independence
versus
a

necessarily
physical
distinction,
and
one
needn’t
assume
the
physical
layer
itself
is
virtualized
for
the
solution
to
be

workable.
In
fact,
this
pattern
has
arisen
in
computing
history
before,
with
isolated,
locally
embedded
storage.
The

storage
is
eventually
replaced
or
augmented
with
much
larger
centralized
pools
of
resources
shared
between

servers,
proving
that
hyper-‐convergence
doesn’t
always
lead
to
optimal
utilization.

Figure
3.
Diagram
of
Storage
Hosting
Options
for
Hadoop

Source:
Enterprise
Strategy
Group,
2014.

White
Paper:
Achieving
Flexible
Scalability
of
Hadoop
to
Meet
Enterprise
Workload
Requirements

8

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

Evaluation
of
Big
Data
Solutions

ESG
research
has
found
that
financial
considerations,
performance,
flexibility,
the
efficient
use
of
storage
resources,

and
scalability
are
among
the
most
commonly
identified
attributes
organizations
use
to
evaluate
potential
big
data

solutions
(see
Figure
4).6

Yet
these
criteria
can
be
contradictory
or
even
mutually
exclusive
in
the
real
world.

Perhaps
the
most
valuable
effect
of
separately
scaling
compute
resources
and
storage
resources
is
that
it
enables

organizations
to
achieve
a
more
optimal
blend
of
the
attributes
they
say
they
are
looking
for
in
a
big
data
solution.

Figure
4.
Most
Important
Solution
Evaluation
Criteria
for
New
Big
Data
Solutions

Source:
Enterprise
Strategy
Group,
2014.

For
example,
an
important
benefit
of
separately
scaling
compute
and
storage
resources
is
the
ability
to
group

differing
classes
of
servers
to
meet
specific
workload
requirements:
It
provides
a
more
cost-‐effective
solution
with

better
overall
performance.
In
theory,
separating
compute
from
storage
should
also
simplify
administration
while

increasing
system
reliability
and
overall
recoverability.
Although
the
concept
of
swapping
out
inexpensive
server

nodes—for
example,
to
replace
failing
internal
hard
drives—may
sound
quite
trivial,
as
organizations
expand
to

larger
cluster
environments,
they
may
find
that
this
administrative
approach
gets
more
difficult
and
costly
in

6

Source:
ESG
Research
Report,
Enterprise
Data
Analytics
Trends,
May
2014.

10%

11%

13%

13%

14%

15%

16%

18%

18%

20%

21%

22%

26%

26%

0%
5%
10%
15%
20%
25%
30%

Public
cloud
hoskng
opkons

Open
standards-‐based

Reporkng
and/or
visualizakon

Eﬃcient
use
of
server
resources

Ease
of
administrakon

Scalability

Eﬃcient
use
of
storage
resources

Flexibility

Built-‐in
high
availability,
backup,
disaster
recovery

capabilikes

Ease
of
integrakon
with
other
applicakons,
APIs

Performance

Reliability

Security

Cost,
ROI
and/or
TCO

Which
of
the
following
aZributes
are
most
important
to
your
organizaMon
when
considering

technology
soluMons
in
the
area
of
business
intelligence,
analyMcs,
and
big
data?
(Percent
of

respondents,
N=375,
three
responses
accepted)

White
Paper:
Achieving
Flexible
Scalability
of
Hadoop
to
Meet
Enterprise
Workload
Requirements

9

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

practice.
Moving
the
storage
tier
to
a
well-‐designed,
shared
storage
infrastructure
can
streamline
some
of
these

management
and
administrative
tasks.

When
Scaling
Environments,
“How
to
Host
Hadoop”
Is
as
Important
as

“Where
to
Host
Hadoop”

The
idea
of
independent
scalability
of
servers
and
storage
may
be
new,
but
the
deployment
options
organizations

are
already
choosing
for
net-‐new
big
data
instances
reflect
the
advantages
of
this
approach.
This
is
shown
in

organizations’
choices
about
how
and
where
to
host
the
analytics
solutions,
as
discovered
by
ESG
research.
Figure
5

depicts
a
wide
distribution
of
infrastructure
preferences
for
BI/analytics
environments.7

Some
18%
of
respondents

indicated
that
they
are
looking
for
simple,
one-‐to-‐one
unvirtualized
hardware
on-‐premises.
Joining
them
in
the

dedicated-‐resources-‐camp
is
another
21%
of
IT
teams
who
reported
choosing
appliances
to
meet
their
specific

needs.
Appliances
are
usually
selected
for
their
purpose-‐built
high
performance
and
massive
size,
but
they
come
at

a
correspondingly
high
initial
price
point
and
with
the
lumpier
granularity
of
obliging
users
to
add
more
servers
in

full-‐
or
half-‐rack
configurations.

Source:
Enterprise
Strategy
Group,
2014.

Supporting
the
idea
of
more
flexible
scaling,
30%
of
companies
said
that
they
are
virtualizing
on-‐premises
servers

and/or
storage
for
more
flexible
resource
allocation
and
independence.
Add
to
that
group
the
31%
who
prefer
this

virtualization
value
delivered
as
public
or
private
cloud,
making
the
consumption
and
billing
also
an
integral
part
of

the
exercise.
These
implementations
are
interesting
in
that
they
essentially
take
scalability
to
a
much
finer-‐grained

level;
if
not
truly
infinite,
they
are
certainly
more
smoothly
extensible.

External
public
cloud
service
provider
options
such
as
Google
or
Amazon
may
serve
as
proof
points
for
detaching

the
scale
of
servers
and
storage.
Consider
how
Amazon
offers
EC2
(compute)
and
S3
(storage)
as
distinct

components.

7

Source:
ESG
Research
Report,
Enterprise
Big
Data,
Business
Intelligence,
and
Analytics
Trends,
to
be
published
December
2014.

Figure
5.
Deployment
Models
Vary
for
New
Big
Data
Solutions

White
Paper:
Achieving
Flexible
Scalability
of
Hadoop
to
Meet
Enterprise
Workload
Requirements

10

©
2014
by
The
Enterprise
Strategy
Group,
Inc.
All
Rights
Reserved.

Not
all
organizations
are
comfortable
going
off-‐premises
often
for
reasons
of
control,
liability,
or
cost;
they
instead

look
to
achieve
this
model
of
fluidity
on-‐premises.
In
some
ways,
the
public
cloud
is
analogous
to
the
concept
of

scaling
Hadoop
in
the
manner
already
discussed
in
detail.
It
represents
what
can
be
potentially
achieved.

Part
of
the
reason
for
this
variety
of
preferences
in
hosting
Hadoop
may
be
related
to
the
variety
of
data
sources

inside
and
outside
of
the
organization.
Increasingly,
organizations
are
looking
to
perform
analytics
as
close
as

possible
to
the
data
source
to
avoid
the
overhead
and
delay
of
ETL
operations.
Sometimes,
the
data
and
analytics

activity
is
transient
in
nature
too,
needing
to
be
handled
for
only
a
short
time.
In
these
cases,
more
flexibility
and

the
ability
to
dynamically
spin
up
a
Hadoop
cluster,
run
a
job,
and
then
dismiss
the
resources
could
be
quite

valuable.

The
Bigger
Truth

Big
data
is
by
definition
about
analytics
operations
on
large
quantities
of
data.
To
be
successful,
companies
will

need
to
design
their
computing
environments
to
meet
the
high
demands
of
business
users
and
their
specific

applications
and
workloads,
many
of
which
will
have
different
profiles
in
terms
of
storage,
processor,
and
memory

requirements.
Failure
to
perform
at
scale
will
at
best
introduce
significant
delays
to
analytics
tasks,
and
at
worst,
if

results
are
not
returned
in
a
timely
manner,
that
failure
will
negate
the
value
of
the
big
data
initiative
overall.

The
economics
of
new
Hadoop
implementations
based
on
open
source
software
and
commodity
hardware
promise

lower
initial
cost,
but
the
linear
scaling
paradigm
of
adding
interlocked
servers
and
embedded
storage
could

unintentionally
lead
to
higher
costs
and
inefficient
resource
utilization.
This
consequence
can
come
from
rigidly

tying
compute
capacity
to
storage
volumes
in
each
server.

New
approaches
that
offer
a
more
flexible
scaling
of
environments
are
promising,
and
they
suggest
that

performance
can
be
improved
while
simultaneously
reducing
costs.
The
best
model
for
some
may
be
the

independent
scaling
of
server
and
storage
resources.
The
primary
benefits
of
this
model
can
include
an
increased

ability
to
handle
larger
workloads,
an
increased
ability
to
answer
complex
analytics
and
queries
in
a
shorter
amount

of
time,
and
in
the
right
circumstances,
a
lower
total
cost
of
ownership.

Leading
storage
companies
are
articulating
an
enticing
vision
of
a
more
flexible,
adaptive
future
for
Hadoop-‐based

big
data
environments.
Complementing
their
core
high-‐performance
storage
array
offerings
with
extreme
scale-‐out

and
multi-‐protocol
access
and
virtualization
may
provide
the
abstraction
necessary
to
support
scaling
Hadoop

environments.

The
potential
advantages
of
decoupling
storage
capacity
and
computing
processing
power
are
real
and
should
be

recognized
and
considered
by
customers
looking
to
avoid
the
common
challenge
of
having
mismatched
or

inadequate
resources
for
the
ever-‐changing
requirements
of
modern
big
data
environments.
Customers
should

work
with
vendors
at
the
forefront
of
this
approach
to
identify
how
they
can
benefit
from
an
architecture
that

allows
independent
scalability
of
servers
and
storage
for
a
modern
big
data
architecture.

20
Asylum
Street

|

Milford,
MA
01757

|

Tel:
508.482.0188

Fax:
508.482.0218

|

www.esg-‐global.com

Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirements

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirements

Similar to Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirements (20)

More from EMC

More from EMC (20)

Recently uploaded

Recently uploaded (20)

Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirements