SlideShare a Scribd company logo
 
White	
  	
  
Paper	
  
	
  
	
  
	
   	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  
to	
  Meet	
  Enterprise	
  Workload	
  
Requirements	
  
	
  
	
  
By	
  Nik	
  Rouda,	
  Senior	
  Analyst	
  
	
  
	
  
December	
  2014	
  
	
   	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
This	
  ESG	
  White	
  Paper	
  was	
  commissioned	
  by	
  EMC	
  
and	
  is	
  distributed	
  under	
  license	
  from	
  ESG.	
  
	
  
	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
White	
  Paper:	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  to	
  Meet	
  Enterprise	
  Workload	
  Requirements	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  2	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
Contents	
  
Big	
  Data	
  Environments	
  Have	
  Varying	
  Goals	
  and	
  Requirements	
  for	
  Scaling	
  ...............................................	
  3	
  
Big	
  Data	
  Is	
  Big	
  By	
  Definition	
  .....................................................................................................................................	
  3	
  
Hadoop	
  Scales	
  for	
  Big	
  Data,	
  But	
  Scale	
  Isn’t	
  Always	
  Easy	
  .........................................................................................	
  3	
  
Big	
  Data	
  Needs	
  Scale	
  in	
  Multiple	
  Dimensions	
  ..........................................................................................................	
  4	
  
Emerging	
  Choices	
  for	
  Implementation	
  and	
  Scaling	
  of	
  Hadoop	
  Environments	
  ...........................................	
  6	
  
Independent	
  Scaling	
  of	
  Servers	
  and	
  Storage	
  ...........................................................................................................	
  6	
  
Evaluation	
  of	
  Big	
  Data	
  Solutions	
  ..............................................................................................................................	
  8	
  
When	
  Scaling	
  Environments,	
  “How	
  to	
  Host	
  Hadoop”	
  Is	
  as	
  Important	
  as	
  “Where	
  to	
  Host	
  Hadoop”	
  ..........	
  9	
  
The	
  Bigger	
  Truth	
  .......................................................................................................................................	
  10	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The
Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are
subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of
this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the
express consent of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil damages and,
if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188. 	
  
White	
  Paper:	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  to	
  Meet	
  Enterprise	
  Workload	
  Requirements	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  3	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
Big	
  Data	
  Environments	
  Have	
  Varying	
  Goals	
  and	
  Requirements	
  for	
  Scaling	
  
Big	
  Data	
  Is	
  Big	
  By	
  Definition	
  
The	
  popularity	
  of	
  big	
  data	
  continues	
  to	
  grow	
  with	
  new	
  applications	
  and	
  new	
  enthusiasm	
  in	
  almost	
  every	
  industry.	
  
Many	
  organizations	
  are	
  looking	
  for	
  opportunities	
  to	
  transform	
  their	
  business	
  using	
  the	
  possibilities	
  afforded	
  by	
  new	
  
data	
  processing	
  and	
  analytics	
  technologies	
  and	
  their	
  promise	
  of	
  new	
  capabilities	
  and	
  improved	
  economics.	
  
The	
  broad	
  Hadoop	
  ecosystem	
  that	
  has	
  developed	
  around	
  the	
  Apache	
  open-­‐source	
  project,	
  along	
  with	
  commercial	
  
distributions,	
  is	
  one	
  of	
  the	
  most	
  instrumental	
  forces	
  powering	
  this	
  change	
  in	
  IT.	
  This	
  change	
  is	
  not	
  a	
  result	
  of	
  
marketing	
  hype.	
  It	
  is	
  due	
  to	
  Hadoop’s	
  suitability	
  for	
  accommodating	
  large,	
  intricate	
  data	
  volumes.	
  Indeed,	
  recently	
  
conducted	
  ESG	
  research	
  revealed	
  that	
  the	
  ability	
  to	
  process	
  (54%),	
  store	
  (49%),	
  and	
  run	
  complex	
  queries	
  on	
  (47%)	
  
large	
  volumes	
  of	
  diverse	
  data	
  are	
  the	
  three	
  most	
  commonly	
  identified	
  terms	
  or	
  criteria	
  that	
  organizations	
  use	
  to	
  
define	
  “big	
  data”	
  (see	
  Figure	
  1).1
	
  
Figure	
  1.	
  Top	
  Three	
  Terms	
  or	
  Criteria	
  that	
  Best	
  Align	
  with	
  Organizations’	
  Definitions	
  of	
  “Big	
  Data”	
  
	
  
	
  Source:	
  Enterprise	
  Strategy	
  Group,	
  2014.	
  
Hadoop	
  Scales	
  for	
  Big	
  Data,	
  But	
  Scale	
  Isn’t	
  Always	
  Easy	
  
As	
  Hadoop	
  becomes	
  more	
  popular,	
  a	
  more	
  nuanced	
  understanding	
  of	
  its	
  operational	
  requirements	
  is	
  growing.	
  
Some	
  of	
  the	
  more	
  common	
  considerations	
  are	
  the	
  advantages	
  of	
  the	
  platform	
  compared	
  with	
  more	
  traditional	
  data	
  
management	
  tools.	
  These	
  advantages	
  include	
  nominally	
  low	
  entry	
  costs,	
  a	
  range	
  of	
  analytics	
  options,	
  support	
  for	
  
distributed	
  and	
  parallel	
  jobs,	
  and	
  a	
  simple	
  mechanism	
  for	
  extreme	
  scale-­‐out	
  based	
  on	
  generic	
  hardware.	
  
These	
  pluses	
  are	
  spawning	
  a	
  new	
  community	
  of	
  Hadoop	
  champions,	
  including	
  data	
  scientists	
  and	
  architects	
  who	
  
are	
  looking	
  for	
  new	
  technologies	
  capable	
  of	
  supporting	
  their	
  specific	
  big	
  data	
  initiatives.	
  Reflecting	
  this	
  popularity,	
  
ESG	
  survey	
  respondents	
  indicated	
  that	
  39%	
  of	
  organizations	
  are	
  now	
  planning	
  to	
  deploy	
  a	
  new	
  Hadoop	
  
environment	
  within	
  the	
  next	
  12	
  to	
  18	
  months.2
	
  
With	
  the	
  top	
  three	
  definitions	
  of	
  big	
  data	
  all	
  referencing	
  large	
  volumes	
  of	
  data,	
  it	
  follows	
  that	
  scalability	
  is	
  a	
  priority	
  
for	
  most	
  deployments.	
  Fortunately,	
  Hadoop	
  has	
  been	
  developed	
  with	
  the	
  inherent	
  need	
  for	
  scale	
  in	
  mind.	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
1
	
  Source:	
  ESG	
  Research	
  Report,	
  Enterprise	
  Data	
  Analytics	
  Trends,	
  May	
  2014.	
  	
  
2
	
  Source:	
  ESG	
  Research	
  Report,	
  Enterprise	
  Big	
  Data,	
  Business	
  Intelligence,	
  and	
  Analytics	
  Trends,	
  to	
  be	
  published	
  December	
  2014.	
  
47%	
  
49%	
  
54%	
  
42%	
   44%	
   46%	
   48%	
   50%	
   52%	
   54%	
   56%	
  
Ability	
  to	
  run	
  complex	
  queries	
  against	
  large	
  
datasets	
  
Ability	
  to	
  store	
  large	
  volumes	
  of	
  diverse	
  data	
  
Ability	
  to	
  process	
  large	
  volumes	
  of	
  diverse	
  data	
  
Which	
  of	
  the	
  following	
  terms	
  or	
  criteria	
  best	
  align	
  with	
  your	
  definiMon	
  of	
  “big	
  
data”?	
  (Percent	
  of	
  respondents,	
  N=375,	
  five	
  responses	
  accepted)	
  
White	
  Paper:	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  to	
  Meet	
  Enterprise	
  Workload	
  Requirements	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  4	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
However,	
  some	
  organizations	
  are	
  discovering	
  that	
  scalability	
  comes	
  with	
  trade-­‐offs,	
  falling	
  generally	
  into	
  three	
  
areas:	
  
• Storing	
  and	
  processing	
  extremely	
  large	
  volumes	
  of	
  data.	
  The	
  value	
  of	
  that	
  data—and	
  the	
  insights	
  it	
  may	
  
provide—may	
  not	
  always	
  be	
  clear.	
  More	
  data	
  isn’t	
  necessarily	
  better.	
  And	
  even	
  with	
  the	
  market	
  expecting	
  
that	
  managing	
  this	
  data	
  on	
  open	
  source	
  software	
  and	
  commodity	
  hardware	
  will	
  be	
  nominally	
  less	
  
expensive,	
  the	
  cost	
  per	
  gigabyte	
  can	
  still	
  quickly	
  add	
  up.	
  
• Performance	
  at	
  scale.	
  If	
  all	
  the	
  data	
  sets	
  can	
  be	
  stored,	
  can	
  they	
  be	
  analyzed	
  fast	
  enough	
  on	
  demand?	
  Big	
  
data	
  won’t	
  be	
  particularly	
  helpful	
  if	
  processing	
  a	
  large	
  amount	
  of	
  data	
  takes	
  an	
  unreasonably	
  long	
  time	
  to	
  
accomplish.	
  Batch	
  analytics	
  need	
  to	
  return	
  results	
  while	
  the	
  question	
  is	
  relevant	
  and	
  appropriate	
  actions	
  
can	
  be	
  taken.	
  
• Diversity	
  of	
  data.	
  The	
  various	
  data	
  sets	
  need	
  to	
  be	
  recognized	
  and	
  reconciled	
  and	
  be	
  made	
  conducive	
  to	
  
intricate	
  calculations	
  and	
  advanced	
  modeling	
  techniques.	
  Again,	
  scale	
  can	
  complicate	
  matters	
  in	
  terms	
  of	
  
logical	
  design	
  and	
  computation	
  demands.	
  	
  
Failure	
  to	
  address	
  those	
  issues	
  could	
  mean	
  that	
  the	
  big	
  data	
  environment	
  won’t	
  live	
  up	
  to	
  the	
  high	
  expectations	
  of	
  
IT	
  departments	
  and	
  line	
  of	
  business	
  users.	
  Although	
  these	
  challenges	
  can	
  seem	
  trivial	
  or	
  distant	
  during	
  proof-­‐of-­‐
concept	
  and	
  pilot	
  programs,	
  most	
  organizations	
  will	
  eventually	
  discover	
  the	
  limitations	
  of	
  their	
  approach	
  after	
  an	
  
enterprise-­‐wide	
  production	
  deployment	
  starts	
  to	
  expand	
  in	
  earnest.	
  
Big	
  Data	
  Needs	
  Scale	
  in	
  Multiple	
  Dimensions	
  
Today,	
  the	
  most	
  common	
  model	
  of	
  Hadoop	
  deployments	
  involves	
  clusters	
  of	
  commodity	
  servers	
  with	
  embedded	
  
storage.	
  This	
  is	
  a	
  fairly	
  standard	
  approach	
  that	
  would	
  seem	
  to	
  provide	
  incremental	
  scalability	
  for	
  the	
  inevitable	
  
increases	
  in	
  data	
  processing,	
  storage,	
  and	
  analysis.	
  However,	
  Hadoop	
  use	
  cases	
  vary	
  tremendously	
  even	
  within	
  a	
  
single	
  company	
  or	
  environment.	
  IT	
  infrastructure	
  architects,	
  data	
  scientists,	
  and	
  analytic	
  staff	
  in	
  lines	
  of	
  business	
  
need	
  to	
  collaborate	
  to	
  define	
  likely	
  demands	
  and	
  prioritize	
  their	
  design	
  choices.	
  (Figure	
  2	
  on	
  page	
  5	
  shows	
  how	
  
changing/different	
  workloads	
  need	
  differing	
  types	
  of	
  system	
  resources	
  to	
  be	
  handled	
  most	
  effectively.)	
  
Business	
  adoption	
  of	
  Hadoop	
  is	
  relatively	
  new;	
  many	
  organizations	
  begin	
  their	
  initial	
  experimentation	
  with	
  lower-­‐
risk	
  use	
  cases	
  in	
  spite	
  of	
  their	
  enthusiasm	
  for	
  the	
  technology.	
  They	
  may	
  begin	
  with	
  a	
  small	
  cluster	
  by	
  copying	
  
multiple	
  internal	
  data	
  sources	
  and	
  capturing	
  external	
  public	
  data	
  as	
  well.	
  Generally,	
  as	
  they	
  gain	
  familiarity	
  and	
  
confidence	
  and	
  can	
  start	
  demonstrating	
  success,	
  organizations	
  will	
  expand	
  the	
  environment	
  and	
  realize	
  increasing	
  
value—they	
  will	
  find	
  a	
  growing	
  number	
  of	
  use	
  cases	
  and	
  win	
  more	
  converts	
  among	
  their	
  analysts	
  and	
  business	
  
users.	
  
Scaling	
  Hadoop	
  for	
  Storage-­‐intensive	
  Use	
  Cases	
  
Many	
  organizations	
  in	
  the	
  early	
  stages	
  of	
  Hadoop	
  implementations	
  may	
  be	
  saving	
  massive	
  quantities	
  of	
  data	
  that	
  
are	
  rarely	
  used	
  or	
  that	
  will	
  be	
  valuable	
  only	
  in	
  the	
  future.	
  
This	
  scenario	
  can	
  be	
  seen	
  in	
  geophysical	
  remote	
  sensing	
  and	
  surveying	
  for	
  oil	
  and	
  gas	
  exploration,	
  where	
  seismic,	
  
gravimetric,	
  and	
  electrical	
  conductive	
  spatial	
  maps	
  have	
  been	
  defined	
  and	
  captured	
  down	
  to	
  a	
  fine	
  granularity	
  
across	
  a	
  vast	
  territory.	
  Today’s	
  natural	
  resource	
  extraction	
  methods	
  might	
  not	
  make	
  it	
  economical	
  for	
  an	
  energy	
  
company	
  to	
  mine	
  a	
  given	
  oil	
  or	
  gas	
  deposit	
  identified	
  during	
  the	
  exploration	
  process.	
  However,	
  that	
  situation	
  is	
  
subject	
  to	
  changes	
  in	
  market	
  conditions	
  (e.g.,	
  rising	
  energy	
  prices)	
  or	
  advances	
  in	
  resource	
  extraction	
  technology	
  
(e.g.,	
  hydraulic	
  fracturing).	
  If	
  conditions	
  change	
  and	
  the	
  energy	
  firm	
  decides	
  to	
  initiate	
  operations	
  on	
  a	
  new	
  deposit,	
  
this	
  may	
  require	
  deep	
  storage	
  yet	
  minimal	
  server	
  effort	
  (see	
  Figure	
  2).	
  	
  
White	
  Paper:	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  to	
  Meet	
  Enterprise	
  Workload	
  Requirements	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  5	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
Figure	
  2.	
  Different	
  Workloads	
  Need	
  Different	
  Resources	
  to	
  Be	
  Handled	
  Most	
  Effectively	
  
	
  
Source:	
  Enterprise	
  Strategy	
  Group,	
  2014.	
  
Yet,	
  28%	
  of	
  IT	
  professionals	
  surveyed	
  by	
  ESG	
  have	
  said	
  that	
  storage	
  costs	
  alone	
  remain	
  too	
  high	
  for	
  the	
  big	
  data	
  
archive	
  they’d	
  like	
  to	
  build.3
	
  	
  This	
  problem	
  remains	
  even	
  when	
  using	
  inexpensive,	
  “slow”	
  internal	
  server	
  hard	
  disk	
  
drives	
  instead	
  of	
  more	
  robust	
  external	
  high-­‐capacity,	
  high-­‐performance	
  storage	
  arrays.	
  
Scaling	
  Hadoop	
  for	
  Memory-­‐intensive	
  Use	
  Cases	
  
Some	
  analytics	
  workloads,	
  such	
  as	
  real-­‐time	
  security	
  analytics,	
  need	
  large	
  pools	
  of	
  memory	
  for	
  the	
  fastest	
  possible	
  
search,	
  read,	
  and	
  write	
  performance.	
  It	
  is	
  often	
  dictated	
  by	
  the	
  trend	
  toward	
  delivering	
  the	
  fastest	
  possible	
  
response	
  for	
  complex	
  analytics	
  jobs.	
  
Think	
  of	
  on-­‐the-­‐fly	
  customization	
  of	
  displayed	
  products	
  and	
  promotional	
  offers	
  for	
  web-­‐scale	
  e-­‐commerce	
  sites,	
  
where	
  a	
  360-­‐degree	
  understanding	
  of	
  the	
  shopper,	
  inventory,	
  regional	
  pricing,	
  and	
  competitors’	
  activity	
  must	
  all	
  be	
  
instantly	
  and	
  simultaneously	
  considered.	
  
In-­‐memory	
  operations	
  will	
  almost	
  certainly	
  outperform	
  those	
  that	
  need	
  to	
  conduct	
  I/O	
  from	
  traditional	
  hard	
  disk	
  or	
  
even	
  server-­‐embedded	
  flash	
  storage	
  or	
  SSD	
  drives	
  simply	
  due	
  to	
  reduced	
  data	
  movement.	
  However,	
  in	
  this	
  
scenario,	
  any	
  individual	
  commodity	
  server	
  will	
  have	
  a	
  specific	
  limit	
  on	
  the	
  amount	
  of	
  memory	
  available	
  for	
  
analytics.	
  
One	
  way	
  to	
  address	
  this	
  challenge	
  is	
  to	
  spread	
  the	
  workload	
  across	
  multiple	
  parallel	
  servers	
  to	
  increase	
  the	
  total	
  
available	
  memory.	
  However,	
  this	
  approach	
  would	
  still	
  add	
  some	
  job	
  coordination	
  overhead	
  (and	
  therefore	
  delay),	
  
which	
  could	
  result	
  in	
  unacceptable	
  application	
  performance	
  for	
  business	
  users.	
  
To	
  identify	
  the	
  memory	
  requirements	
  of	
  typical	
  big	
  data	
  environments,	
  ESG	
  surveyed	
  many	
  large	
  enterprises	
  and	
  
found	
  that	
  nearly	
  two-­‐thirds	
  of	
  respondents	
  expect	
  to	
  process	
  more	
  than	
  5TB	
  of	
  data	
  as	
  part	
  of	
  a	
  typical	
  analytics	
  
job.4
	
  This	
  threshold	
  is	
  well	
  above	
  the	
  memory	
  a	
  typical	
  low-­‐end	
  server	
  can	
  handle,	
  even	
  if	
  the	
  right	
  mix	
  of	
  data	
  just	
  
happened	
  to	
  be	
  concentrated	
  on	
  that	
  particular	
  node.	
  And	
  often,	
  organizations’	
  queries	
  will	
  need	
  to	
  work	
  with	
  data	
  
sets	
  several	
  orders	
  of	
  magnitude	
  larger.	
  Finding	
  a	
  way	
  to	
  bridge	
  the	
  server	
  memory	
  requirements	
  to	
  storage	
  
without	
  wasting	
  resources	
  is	
  imperative.	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
3
	
  Source:	
  ESG	
  Research	
  Report,	
  Enterprise	
  Data	
  Analytics	
  Trends,	
  May	
  2014.	
  
4
	
  Source:	
  ESG	
  Research	
  Report,	
  Enterprise	
  Big	
  Data,	
  Business	
  Intelligence,	
  and	
  Analytics	
  Trends,	
  to	
  be	
  published	
  December	
  2014.	
  
White	
  Paper:	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  to	
  Meet	
  Enterprise	
  Workload	
  Requirements	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  6	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
Scaling	
  Hadoop	
  for	
  Compute-­‐intensive	
  Use	
  Cases	
  
A	
  third	
  category	
  of	
  use	
  cases	
  will	
  need	
  more	
  compute	
  power	
  for	
  advanced	
  analytics	
  and	
  complex	
  transformations.	
  
These	
  jobs	
  can	
  be	
  very	
  processor	
  intensive.	
  
Take,	
  for	
  example,	
  a	
  pharmaceutical	
  research	
  and	
  design	
  study	
  that	
  completes	
  full	
  DNA	
  analyses	
  alongside	
  a	
  variety	
  
of	
  environmental	
  factors	
  and	
  unique	
  patient	
  healthcare	
  histories—and	
  much	
  of	
  the	
  data	
  is	
  in	
  unstructured	
  format	
  
such	
  as	
  nurses’	
  patient	
  notes.	
  The	
  vast	
  number	
  of	
  discrete	
  calculations	
  involved	
  in	
  discovery,	
  model	
  fitting,	
  and	
  
assessing	
  accuracy	
  can	
  tax	
  even	
  the	
  largest	
  appliances	
  or	
  mainframes.	
  Yet,	
  this	
  is	
  now	
  being	
  done	
  instead	
  on	
  
clusters	
  of	
  low-­‐cost	
  servers.	
  
Large	
  data	
  sets	
  are	
  involved,	
  but	
  they	
  are	
  perhaps	
  more	
  transient	
  in	
  nature,	
  translating	
  into	
  scalability	
  
requirements	
  that	
  are	
  focused	
  less	
  on	
  storage	
  capacity	
  and	
  more	
  on	
  handling	
  the	
  demands	
  of	
  real-­‐time	
  streaming.	
  
ESG	
  research	
  found	
  that	
  26%	
  of	
  companies	
  say	
  that	
  data	
  set	
  sizes	
  are	
  now	
  limiting	
  their	
  ability	
  to	
  perform	
  the	
  
requisite	
  analytics	
  exercises.5
	
  
Scaling	
  Hadoop	
  for	
  Geographically	
  Dispersed	
  Use	
  Cases	
  
Another	
  dimension	
  of	
  scaling	
  is	
  focused	
  on	
  how	
  to	
  use	
  Hadoop	
  in	
  a	
  geographically	
  distributed	
  environment.	
  Data	
  
sets	
  may	
  be	
  collected	
  and	
  hosted	
  in	
  different	
  data	
  centers	
  globally	
  or	
  around	
  a	
  specific	
  region.	
  For	
  example,	
  clicks	
  
and	
  purchase	
  activity	
  from	
  a	
  public	
  cloud-­‐based	
  web	
  application	
  may	
  be	
  retained	
  in	
  the	
  cloud	
  rather	
  than	
  copied	
  
back	
  to	
  a	
  company’s	
  on-­‐premises	
  cluster.	
  Alternately,	
  human	
  resources	
  data	
  may	
  be	
  managed	
  at	
  corporate	
  
headquarters	
  while	
  financial	
  account	
  information	
  is	
  kept	
  in	
  other	
  countries	
  due	
  to	
  local	
  government	
  regulations	
  
tied	
  to	
  the	
  exporting	
  of	
  private	
  information.	
  
In	
  these	
  cases,	
  the	
  addition	
  of	
  a	
  replication	
  capability	
  between	
  Hadoop	
  clusters	
  could	
  prove	
  valuable—provided	
  the	
  
sensitive	
  data	
  is	
  appropriately	
  masked	
  and	
  encrypted.	
  This	
  functionality	
  is	
  not	
  innately	
  present	
  in	
  Hadoop	
  today,	
  
but	
  it	
  could	
  be	
  managed	
  through	
  proprietary	
  scale-­‐out	
  storage	
  features.	
  
Whether	
  for	
  analytics	
  or	
  transaction	
  workloads,	
  a	
  level	
  of	
  concern	
  about	
  disaster	
  recovery	
  and	
  business	
  continuity	
  
may	
  also	
  require	
  geographic	
  dispersion,	
  particularly	
  as	
  Hadoop	
  begins	
  to	
  support	
  more	
  mission-­‐critical	
  operations	
  
in	
  the	
  enterprise.	
  
Each	
  of	
  the	
  scenarios	
  above	
  is	
  fundamentally	
  related	
  to	
  the	
  scalability	
  of	
  big	
  data,	
  but	
  each	
  shows	
  a	
  different	
  
dimension	
  of	
  scalability	
  that	
  may	
  be	
  required	
  to	
  achieve	
  the	
  goals	
  of	
  the	
  initiative.	
  Any	
  given	
  Hadoop	
  cluster	
  may	
  
have	
  a	
  different	
  blend	
  of	
  these	
  needs	
  when	
  initially	
  deployed.	
  More	
  challenging	
  is	
  the	
  fact	
  that	
  the	
  characteristics	
  
may	
  change	
  radically	
  over	
  time.	
  
Emerging	
  Choices	
  for	
  Implementation	
  and	
  Scaling	
  of	
  Hadoop	
  Environments	
  
As	
  noted,	
  a	
  typical	
  Hadoop	
  environment	
  is	
  a	
  cluster	
  of	
  commodity	
  servers	
  with	
  internal	
  storage	
  that	
  is	
  self-­‐
managing	
  and	
  built	
  to	
  the	
  most	
  common	
  denominator	
  of	
  system	
  specifications.	
  For	
  generic	
  workloads,	
  this	
  cost-­‐
first	
  approach	
  can	
  be	
  adequate,	
  especially	
  if	
  the	
  analytics	
  aren’t	
  particularly	
  intensive	
  or	
  deemed	
  to	
  be	
  mission-­‐
critical	
  for	
  the	
  business.	
  However,	
  a	
  small-­‐scale	
  test	
  of	
  the	
  capabilities	
  may	
  not	
  demonstrate	
  the	
  limitations	
  that	
  will	
  
emerge	
  as	
  scale	
  inevitably	
  increases.	
  Finding	
  the	
  optimal	
  server	
  specifications	
  for	
  each	
  node	
  in	
  the	
  cluster	
  (or	
  
clusters)	
  can	
  be	
  a	
  non-­‐trivial	
  exercise	
  that	
  may	
  not	
  be	
  addressed	
  with	
  one	
  correct	
  solution.	
  
Independent	
  Scaling	
  of	
  Servers	
  and	
  Storage	
  
Given	
  the	
  variety	
  and	
  changing	
  nature	
  of	
  Hadoop	
  implementation	
  requirements	
  (often	
  unknown	
  at	
  the	
  time	
  of	
  
deployment),	
  the	
  assumed	
  infrastructure	
  model	
  of	
  commodity	
  servers	
  with	
  embedded	
  storage	
  doesn’t	
  always	
  
make	
  sense.	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
5
	
  Ibid.	
  
	
  
White	
  Paper:	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  to	
  Meet	
  Enterprise	
  Workload	
  Requirements	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  7	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
This	
  model,	
  in	
  which	
  servers	
  and	
  storage	
  capacity	
  are	
  embedded	
  together	
  as	
  “one	
  size	
  fits	
  all”	
  units	
  and	
  forced	
  to	
  
scale	
  linearly	
  in	
  a	
  homogenous	
  cluster,	
  can	
  potentially	
  waste	
  a	
  lot	
  of	
  resources.	
  No	
  single	
  configuration	
  of	
  server	
  
may	
  be	
  able	
  to	
  handle	
  the	
  various	
  workloads	
  (again	
  noting	
  that	
  particular	
  jobs	
  can	
  be	
  limited	
  by	
  system	
  memory,	
  
processor	
  speed,	
  or	
  storage	
  capacity).	
  And	
  of	
  course,	
  increasing	
  any	
  one	
  of	
  these	
  components	
  will	
  have	
  an	
  effect	
  on	
  
the	
  overall	
  price	
  of	
  each	
  node.	
  Over-­‐provisioning	
  all	
  three	
  may	
  sound	
  good	
  for	
  performance,	
  but	
  this	
  “bigger	
  
hammer”	
  approach	
  will	
  adversely	
  and	
  dramatically	
  increase	
  the	
  total	
  costs	
  of	
  the	
  environment	
  and	
  obviate	
  the	
  
commodity-­‐scale	
  benefits	
  of	
  Hadoop.	
  
A	
  promising	
  approach	
  is	
  the	
  separation	
  of	
  servers	
  and	
  storage	
  into	
  pools	
  of	
  resources	
  to	
  be	
  drawn	
  on	
  as	
  needed	
  
(see	
  Figure	
  3).	
  This	
  division	
  between	
  scaling	
  storage	
  capacity	
  and	
  computing	
  power	
  enables	
  more	
  targeted	
  
scalability	
  to	
  satisfy	
  specific	
  demands	
  associated	
  with	
  different	
  workloads.	
  The	
  approach	
  may	
  also	
  require	
  the	
  
adoption	
  of	
  a	
  shared	
  storage	
  platform,	
  which	
  is	
  not	
  the	
  most	
  common	
  model	
  for	
  Hadoop,	
  but	
  it	
  brings	
  with	
  it	
  many	
  
of	
  the	
  key	
  qualities	
  organizations	
  say	
  they	
  want	
  (such	
  as	
  a	
  balance	
  across	
  cost,	
  performance,	
  flexibility,	
  and	
  data	
  
protection	
  considerations).	
  And,	
  even	
  in	
  a	
  basic	
  MapReduce	
  operation,	
  data	
  is	
  often	
  migrated	
  and	
  joined	
  in	
  
performing	
  typical	
  jobs	
  so	
  the	
  facility	
  can	
  be	
  accommodated,	
  just	
  with	
  a	
  different	
  route	
  for	
  accessing	
  storage.	
  
An	
  independent	
  server-­‐scaling	
  strategy	
  complements	
  the	
  many	
  advantages	
  of	
  centralized,	
  shared	
  storage,	
  which	
  
ESG	
  previously	
  outlined	
  in	
  a	
  white	
  paper	
  titled	
  EMC	
  Isilon:	
  A	
  Scalable	
  Storage	
  Platform	
  for	
  Big	
  Data	
  (April	
  2014).	
  
Some	
  benefits	
  of	
  this	
  approach	
  include	
  (but	
  are	
  not	
  limited	
  to)	
  multi-­‐protocol	
  access,	
  in-­‐place	
  analytics	
  (i.e.,	
  no	
  
extract,	
  transform,	
  load	
  [ETL]),	
  and	
  better	
  efficiency	
  and	
  safety.	
  
If	
  one	
  views	
  the	
  Hadoop	
  cluster	
  servers	
  essentially	
  as	
  virtualized	
  resources,	
  this	
  computing	
  power	
  then	
  can	
  be	
  used	
  
to	
  access	
  different	
  storage	
  as	
  needed.	
  Effectively,	
  this	
  model	
  can	
  be	
  viewed	
  as	
  a	
  logical	
  independence	
  versus	
  a	
  
necessarily	
  physical	
  distinction,	
  and	
  one	
  needn’t	
  assume	
  the	
  physical	
  layer	
  itself	
  is	
  virtualized	
  for	
  the	
  solution	
  to	
  be	
  
workable.	
  In	
  fact,	
  this	
  pattern	
  has	
  arisen	
  in	
  computing	
  history	
  before,	
  with	
  isolated,	
  locally	
  embedded	
  storage.	
  The	
  
storage	
  is	
  eventually	
  replaced	
  or	
  augmented	
  with	
  much	
  larger	
  centralized	
  pools	
  of	
  resources	
  shared	
  between	
  
servers,	
  proving	
  that	
  hyper-­‐convergence	
  doesn’t	
  always	
  lead	
  to	
  optimal	
  utilization.	
  
Figure	
  3.	
  Diagram	
  of	
  Storage	
  Hosting	
  Options	
  for	
  Hadoop	
  
	
   	
  	
  	
  
	
  Source:	
  Enterprise	
  Strategy	
  Group,	
  2014.	
  
White	
  Paper:	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  to	
  Meet	
  Enterprise	
  Workload	
  Requirements	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  8	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
Evaluation	
  of	
  Big	
  Data	
  Solutions	
  
ESG	
  research	
  has	
  found	
  that	
  financial	
  considerations,	
  performance,	
  flexibility,	
  the	
  efficient	
  use	
  of	
  storage	
  resources,	
  
and	
  scalability	
  are	
  among	
  the	
  most	
  commonly	
  identified	
  attributes	
  organizations	
  use	
  to	
  evaluate	
  potential	
  big	
  data	
  
solutions	
  (see	
  Figure	
  4).6
	
  Yet	
  these	
  criteria	
  can	
  be	
  contradictory	
  or	
  even	
  mutually	
  exclusive	
  in	
  the	
  real	
  world.	
  
Perhaps	
  the	
  most	
  valuable	
  effect	
  of	
  separately	
  scaling	
  compute	
  resources	
  and	
  storage	
  resources	
  is	
  that	
  it	
  enables	
  
organizations	
  to	
  achieve	
  a	
  more	
  optimal	
  blend	
  of	
  the	
  attributes	
  they	
  say	
  they	
  are	
  looking	
  for	
  in	
  a	
  big	
  data	
  solution.	
  	
  
Figure	
  4.	
  Most	
  Important	
  Solution	
  Evaluation	
  Criteria	
  for	
  New	
  Big	
  Data	
  Solutions	
  
	
  
Source:	
  Enterprise	
  Strategy	
  Group,	
  2014.	
  
For	
  example,	
  an	
  important	
  benefit	
  of	
  separately	
  scaling	
  compute	
  and	
  storage	
  resources	
  is	
  the	
  ability	
  to	
  group	
  
differing	
  classes	
  of	
  servers	
  to	
  meet	
  specific	
  workload	
  requirements:	
  It	
  provides	
  a	
  more	
  cost-­‐effective	
  solution	
  with	
  
better	
  overall	
  performance.	
  In	
  theory,	
  separating	
  compute	
  from	
  storage	
  should	
  also	
  simplify	
  administration	
  while	
  
increasing	
  system	
  reliability	
  and	
  overall	
  recoverability.	
  Although	
  the	
  concept	
  of	
  swapping	
  out	
  inexpensive	
  server	
  
nodes—for	
  example,	
  to	
  replace	
  failing	
  internal	
  hard	
  drives—may	
  sound	
  quite	
  trivial,	
  as	
  organizations	
  expand	
  to	
  
larger	
  cluster	
  environments,	
  they	
  may	
  find	
  that	
  this	
  administrative	
  approach	
  gets	
  more	
  difficult	
  and	
  costly	
  in	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
6
	
  Source:	
  ESG	
  Research	
  Report,	
  Enterprise	
  Data	
  Analytics	
  Trends,	
  May	
  2014.	
  
10%	
  
11%	
  
13%	
  
13%	
  
14%	
  
15%	
  
16%	
  
18%	
  
18%	
  
20%	
  
21%	
  
22%	
  
26%	
  
26%	
  
0%	
   5%	
   10%	
   15%	
   20%	
   25%	
   30%	
  
Public	
  cloud	
  hoskng	
  opkons	
  
Open	
  standards-­‐based	
  
Reporkng	
  and/or	
  visualizakon	
  
Efficient	
  use	
  of	
  server	
  resources	
  
Ease	
  of	
  administrakon	
  
Scalability	
  
Efficient	
  use	
  of	
  storage	
  resources	
  
Flexibility	
  
Built-­‐in	
  high	
  availability,	
  backup,	
  disaster	
  recovery	
  
capabilikes	
  
Ease	
  of	
  integrakon	
  with	
  other	
  applicakons,	
  APIs	
  
Performance	
  
Reliability	
  
Security	
  
Cost,	
  ROI	
  and/or	
  TCO	
  
Which	
  of	
  the	
  following	
  aZributes	
  are	
  most	
  important	
  to	
  your	
  organizaMon	
  when	
  considering	
  
technology	
  soluMons	
  in	
  the	
  area	
  of	
  business	
  intelligence,	
  analyMcs,	
  and	
  big	
  data?	
  (Percent	
  of	
  
respondents,	
  N=375,	
  three	
  responses	
  accepted)	
  
White	
  Paper:	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  to	
  Meet	
  Enterprise	
  Workload	
  Requirements	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  9	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
practice.	
  Moving	
  the	
  storage	
  tier	
  to	
  a	
  well-­‐designed,	
  shared	
  storage	
  infrastructure	
  can	
  streamline	
  some	
  of	
  these	
  
management	
  and	
  administrative	
  tasks.	
  
When	
  Scaling	
  Environments,	
  “How	
  to	
  Host	
  Hadoop”	
  Is	
  as	
  Important	
  as	
  
“Where	
  to	
  Host	
  Hadoop”	
  
The	
  idea	
  of	
  independent	
  scalability	
  of	
  servers	
  and	
  storage	
  may	
  be	
  new,	
  but	
  the	
  deployment	
  options	
  organizations	
  
are	
  already	
  choosing	
  for	
  net-­‐new	
  big	
  data	
  instances	
  reflect	
  the	
  advantages	
  of	
  this	
  approach.	
  This	
  is	
  shown	
  in	
  
organizations’	
  choices	
  about	
  how	
  and	
  where	
  to	
  host	
  the	
  analytics	
  solutions,	
  as	
  discovered	
  by	
  ESG	
  research.	
  Figure	
  5	
  
depicts	
  a	
  wide	
  distribution	
  of	
  infrastructure	
  preferences	
  for	
  BI/analytics	
  environments.7
	
  Some	
  18%	
  of	
  respondents	
  
indicated	
  that	
  they	
  are	
  looking	
  for	
  simple,	
  one-­‐to-­‐one	
  unvirtualized	
  hardware	
  on-­‐premises.	
  Joining	
  them	
  in	
  the	
  
dedicated-­‐resources-­‐camp	
  is	
  another	
  21%	
  of	
  IT	
  teams	
  who	
  reported	
  choosing	
  appliances	
  to	
  meet	
  their	
  specific	
  
needs.	
  Appliances	
  are	
  usually	
  selected	
  for	
  their	
  purpose-­‐built	
  high	
  performance	
  and	
  massive	
  size,	
  but	
  they	
  come	
  at	
  
a	
  correspondingly	
  high	
  initial	
  price	
  point	
  and	
  with	
  the	
  lumpier	
  granularity	
  of	
  obliging	
  users	
  to	
  add	
  more	
  servers	
  in	
  
full-­‐	
  or	
  half-­‐rack	
  configurations.	
  
Source:	
  Enterprise	
  Strategy	
  Group,	
  2014.	
  
Supporting	
  the	
  idea	
  of	
  more	
  flexible	
  scaling,	
  30%	
  of	
  companies	
  said	
  that	
  they	
  are	
  virtualizing	
  on-­‐premises	
  servers	
  
and/or	
  storage	
  for	
  more	
  flexible	
  resource	
  allocation	
  and	
  independence.	
  Add	
  to	
  that	
  group	
  the	
  31%	
  who	
  prefer	
  this	
  
virtualization	
  value	
  delivered	
  as	
  public	
  or	
  private	
  cloud,	
  making	
  the	
  consumption	
  and	
  billing	
  also	
  an	
  integral	
  part	
  of	
  
the	
  exercise.	
  These	
  implementations	
  are	
  interesting	
  in	
  that	
  they	
  essentially	
  take	
  scalability	
  to	
  a	
  much	
  finer-­‐grained	
  
level;	
  if	
  not	
  truly	
  infinite,	
  they	
  are	
  certainly	
  more	
  smoothly	
  extensible.	
  
External	
  public	
  cloud	
  service	
  provider	
  options	
  such	
  as	
  Google	
  or	
  Amazon	
  may	
  serve	
  as	
  proof	
  points	
  for	
  detaching	
  
the	
  scale	
  of	
  servers	
  and	
  storage.	
  Consider	
  how	
  Amazon	
  offers	
  EC2	
  (compute)	
  and	
  S3	
  (storage)	
  as	
  distinct	
  
components.	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
7
	
  Source:	
  ESG	
  Research	
  Report,	
  Enterprise	
  Big	
  Data,	
  Business	
  Intelligence,	
  and	
  Analytics	
  Trends,	
  to	
  be	
  published	
  December	
  2014.	
  
Figure	
  5.	
  Deployment	
  Models	
  Vary	
  for	
  New	
  Big	
  Data	
  Solutions	
  
	
  
White	
  Paper:	
  Achieving	
  Flexible	
  Scalability	
  of	
  Hadoop	
  to	
  Meet	
  Enterprise	
  Workload	
  Requirements	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  10	
  
©	
  2014	
  by	
  The	
  Enterprise	
  Strategy	
  Group,	
  Inc.	
  All	
  Rights	
  Reserved.	
  
Not	
  all	
  organizations	
  are	
  comfortable	
  going	
  off-­‐premises	
  often	
  for	
  reasons	
  of	
  control,	
  liability,	
  or	
  cost;	
  they	
  instead	
  
look	
  to	
  achieve	
  this	
  model	
  of	
  fluidity	
  on-­‐premises.	
  In	
  some	
  ways,	
  the	
  public	
  cloud	
  is	
  analogous	
  to	
  the	
  concept	
  of	
  
scaling	
  Hadoop	
  in	
  the	
  manner	
  already	
  discussed	
  in	
  detail.	
  It	
  represents	
  what	
  can	
  be	
  potentially	
  achieved.	
  
Part	
  of	
  the	
  reason	
  for	
  this	
  variety	
  of	
  preferences	
  in	
  hosting	
  Hadoop	
  may	
  be	
  related	
  to	
  the	
  variety	
  of	
  data	
  sources	
  
inside	
  and	
  outside	
  of	
  the	
  organization.	
  Increasingly,	
  organizations	
  are	
  looking	
  to	
  perform	
  analytics	
  as	
  close	
  as	
  
possible	
  to	
  the	
  data	
  source	
  to	
  avoid	
  the	
  overhead	
  and	
  delay	
  of	
  ETL	
  operations.	
  Sometimes,	
  the	
  data	
  and	
  analytics	
  
activity	
  is	
  transient	
  in	
  nature	
  too,	
  needing	
  to	
  be	
  handled	
  for	
  only	
  a	
  short	
  time.	
  In	
  these	
  cases,	
  more	
  flexibility	
  and	
  
the	
  ability	
  to	
  dynamically	
  spin	
  up	
  a	
  Hadoop	
  cluster,	
  run	
  a	
  job,	
  and	
  then	
  dismiss	
  the	
  resources	
  could	
  be	
  quite	
  
valuable.	
  
The	
  Bigger	
  Truth	
  
Big	
  data	
  is	
  by	
  definition	
  about	
  analytics	
  operations	
  on	
  large	
  quantities	
  of	
  data.	
  To	
  be	
  successful,	
  companies	
  will	
  
need	
  to	
  design	
  their	
  computing	
  environments	
  to	
  meet	
  the	
  high	
  demands	
  of	
  business	
  users	
  and	
  their	
  specific	
  
applications	
  and	
  workloads,	
  many	
  of	
  which	
  will	
  have	
  different	
  profiles	
  in	
  terms	
  of	
  storage,	
  processor,	
  and	
  memory	
  
requirements.	
  Failure	
  to	
  perform	
  at	
  scale	
  will	
  at	
  best	
  introduce	
  significant	
  delays	
  to	
  analytics	
  tasks,	
  and	
  at	
  worst,	
  if	
  
results	
  are	
  not	
  returned	
  in	
  a	
  timely	
  manner,	
  that	
  failure	
  will	
  negate	
  the	
  value	
  of	
  the	
  big	
  data	
  initiative	
  overall.	
  
The	
  economics	
  of	
  new	
  Hadoop	
  implementations	
  based	
  on	
  open	
  source	
  software	
  and	
  commodity	
  hardware	
  promise	
  
lower	
  initial	
  cost,	
  but	
  the	
  linear	
  scaling	
  paradigm	
  of	
  adding	
  interlocked	
  servers	
  and	
  embedded	
  storage	
  could	
  
unintentionally	
  lead	
  to	
  higher	
  costs	
  and	
  inefficient	
  resource	
  utilization.	
  This	
  consequence	
  can	
  come	
  from	
  rigidly	
  
tying	
  compute	
  capacity	
  to	
  storage	
  volumes	
  in	
  each	
  server.	
  
New	
  approaches	
  that	
  offer	
  a	
  more	
  flexible	
  scaling	
  of	
  environments	
  are	
  promising,	
  and	
  they	
  suggest	
  that	
  
performance	
  can	
  be	
  improved	
  while	
  simultaneously	
  reducing	
  costs.	
  The	
  best	
  model	
  for	
  some	
  may	
  be	
  the	
  
independent	
  scaling	
  of	
  server	
  and	
  storage	
  resources.	
  The	
  primary	
  benefits	
  of	
  this	
  model	
  can	
  include	
  an	
  increased	
  
ability	
  to	
  handle	
  larger	
  workloads,	
  an	
  increased	
  ability	
  to	
  answer	
  complex	
  analytics	
  and	
  queries	
  in	
  a	
  shorter	
  amount	
  
of	
  time,	
  and	
  in	
  the	
  right	
  circumstances,	
  a	
  lower	
  total	
  cost	
  of	
  ownership.	
  
Leading	
  storage	
  companies	
  are	
  articulating	
  an	
  enticing	
  vision	
  of	
  a	
  more	
  flexible,	
  adaptive	
  future	
  for	
  Hadoop-­‐based	
  
big	
  data	
  environments.	
  Complementing	
  their	
  core	
  high-­‐performance	
  storage	
  array	
  offerings	
  with	
  extreme	
  scale-­‐out	
  
and	
  multi-­‐protocol	
  access	
  and	
  virtualization	
  may	
  provide	
  the	
  abstraction	
  necessary	
  to	
  support	
  scaling	
  Hadoop	
  
environments.	
  
The	
  potential	
  advantages	
  of	
  decoupling	
  storage	
  capacity	
  and	
  computing	
  processing	
  power	
  are	
  real	
  and	
  should	
  be	
  
recognized	
  and	
  considered	
  by	
  customers	
  looking	
  to	
  avoid	
  the	
  common	
  challenge	
  of	
  having	
  mismatched	
  or	
  
inadequate	
  resources	
  for	
  the	
  ever-­‐changing	
  requirements	
  of	
  modern	
  big	
  data	
  environments.	
  Customers	
  should	
  
work	
  with	
  vendors	
  at	
  the	
  forefront	
  of	
  this	
  approach	
  to	
  identify	
  how	
  they	
  can	
  benefit	
  from	
  an	
  architecture	
  that	
  
allows	
  independent	
  scalability	
  of	
  servers	
  and	
  storage	
  for	
  a	
  modern	
  big	
  data	
  architecture.	
  
	
  
 
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  
20	
  Asylum	
  Street	
  	
  |	
  	
  Milford,	
  MA	
  01757	
  	
  |	
  	
  Tel:	
  508.482.0188	
  	
  Fax:	
  508.482.0218	
  	
  |	
  	
  www.esg-­‐global.com	
  

More Related Content

What's hot

bigdatasqloverview21jan2015-2408000
bigdatasqloverview21jan2015-2408000bigdatasqloverview21jan2015-2408000
bigdatasqloverview21jan2015-2408000Kartik Padmanabhan
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
Teradata Aster
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries cia
Kevin Pledge
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Kevin Pledge
 
What are the 6 elements of a project
What are the 6 elements of a projectWhat are the 6 elements of a project
What are the 6 elements of a project
RichardPierce28
 
Hadoop for Finance - sample chapter
Hadoop for Finance - sample chapterHadoop for Finance - sample chapter
Hadoop for Finance - sample chapter
Rajiv Tiwari
 
Turning Big Data Analytics To Knowledge PowerPoint Presentation Slides
Turning Big Data Analytics To Knowledge PowerPoint Presentation SlidesTurning Big Data Analytics To Knowledge PowerPoint Presentation Slides
Turning Big Data Analytics To Knowledge PowerPoint Presentation Slides
SlideTeam
 
DataOps: Nine steps to transform your data science impact Strata London May 18
DataOps: Nine steps to transform your data science impact  Strata London May 18DataOps: Nine steps to transform your data science impact  Strata London May 18
DataOps: Nine steps to transform your data science impact Strata London May 18
Harvinder Atwal
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analyticsThe Marketing Distillery
 
the-real-world-of-the-database-administrator-white-paper-15623
the-real-world-of-the-database-administrator-white-paper-15623the-real-world-of-the-database-administrator-white-paper-15623
the-real-world-of-the-database-administrator-white-paper-15623Gustavo Carneiro
 
Business Analytics & Big Data Trends and Predictions 2014 - 2015
Business Analytics & Big Data Trends and Predictions 2014 - 2015Business Analytics & Big Data Trends and Predictions 2014 - 2015
Business Analytics & Big Data Trends and Predictions 2014 - 2015
Brad Culbert
 
Making the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British AirwaysMaking the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British Airways
DataWorks Summit
 
Big agendas for big data analytics projects
Big agendas for big data analytics projectsBig agendas for big data analytics projects
Big agendas for big data analytics projectsThe Marketing Distillery
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
Robert Smith
 
EMC Isilon: A Scalable Storage Platform for Big Data
EMC Isilon: A Scalable Storage Platform for Big DataEMC Isilon: A Scalable Storage Platform for Big Data
EMC Isilon: A Scalable Storage Platform for Big Data
EMC
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
NICSA
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
Harvinder Atwal
 
Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsRackspace
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured Data
DATAVERSITY
 
Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitDataWorks Summit
 

What's hot (20)

bigdatasqloverview21jan2015-2408000
bigdatasqloverview21jan2015-2408000bigdatasqloverview21jan2015-2408000
bigdatasqloverview21jan2015-2408000
 
Evaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics PlatformsEvaluating Big Data Predictive Analytics Platforms
Evaluating Big Data Predictive Analytics Platforms
 
Analytics for actuaries cia
Analytics for actuaries ciaAnalytics for actuaries cia
Analytics for actuaries cia
 
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
Advanced Business Analytics for Actuaries - Canadian Institute of Actuaries J...
 
What are the 6 elements of a project
What are the 6 elements of a projectWhat are the 6 elements of a project
What are the 6 elements of a project
 
Hadoop for Finance - sample chapter
Hadoop for Finance - sample chapterHadoop for Finance - sample chapter
Hadoop for Finance - sample chapter
 
Turning Big Data Analytics To Knowledge PowerPoint Presentation Slides
Turning Big Data Analytics To Knowledge PowerPoint Presentation SlidesTurning Big Data Analytics To Knowledge PowerPoint Presentation Slides
Turning Big Data Analytics To Knowledge PowerPoint Presentation Slides
 
DataOps: Nine steps to transform your data science impact Strata London May 18
DataOps: Nine steps to transform your data science impact  Strata London May 18DataOps: Nine steps to transform your data science impact  Strata London May 18
DataOps: Nine steps to transform your data science impact Strata London May 18
 
Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analytics
 
the-real-world-of-the-database-administrator-white-paper-15623
the-real-world-of-the-database-administrator-white-paper-15623the-real-world-of-the-database-administrator-white-paper-15623
the-real-world-of-the-database-administrator-white-paper-15623
 
Business Analytics & Big Data Trends and Predictions 2014 - 2015
Business Analytics & Big Data Trends and Predictions 2014 - 2015Business Analytics & Big Data Trends and Predictions 2014 - 2015
Business Analytics & Big Data Trends and Predictions 2014 - 2015
 
Making the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British AirwaysMaking the Case for Hadoop in a Large Enterprise-British Airways
Making the Case for Hadoop in a Large Enterprise-British Airways
 
Big agendas for big data analytics projects
Big agendas for big data analytics projectsBig agendas for big data analytics projects
Big agendas for big data analytics projects
 
Top 10 renowned big data companies
Top 10 renowned big data companiesTop 10 renowned big data companies
Top 10 renowned big data companies
 
EMC Isilon: A Scalable Storage Platform for Big Data
EMC Isilon: A Scalable Storage Platform for Big DataEMC Isilon: A Scalable Storage Platform for Big Data
EMC Isilon: A Scalable Storage Platform for Big Data
 
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
Webinar | Using Big Data and Predictive Analytics to Empower Distribution and...
 
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder AtwalDataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
DataOps - Big Data and AI World London - March 2020 - Harvinder Atwal
 
Making Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High ExpectationsMaking Sense of NoSQL and Big Data Amidst High Expectations
Making Sense of NoSQL and Big Data Amidst High Expectations
 
Integrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured DataIntegrating Structure and Analytics with Unstructured Data
Integrating Structure and Analytics with Unstructured Data
 
Hadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business UnitHadoop: Making it work for the Business Unit
Hadoop: Making it work for the Business Unit
 

Viewers also liked

Tax types, reading and notes
Tax types, reading and notesTax types, reading and notes
Tax types, reading and notesTravis Klein
 
Formulario de identificación
Formulario de identificaciónFormulario de identificación
Formulario de identificaciónNathalia Sanchez
 
Hodočašće u lourdes i paray le-monial 2011 - kopija
Hodočašće u lourdes  i paray le-monial 2011 - kopijaHodočašće u lourdes  i paray le-monial 2011 - kopija
Hodočašće u lourdes i paray le-monial 2011 - kopija
dsrdoc
 
RSA Security Data Access Governance Infographic
RSA Security Data Access Governance InfographicRSA Security Data Access Governance Infographic
RSA Security Data Access Governance Infographic
EMC
 
Comparative analysis 98 sections of cos act 2013
Comparative analysis 98 sections of cos act 2013Comparative analysis 98 sections of cos act 2013
Comparative analysis 98 sections of cos act 2013
Mamta Binani
 
Whitepaper : EMC Federated Tiered Storage (FTS)
Whitepaper : EMC Federated Tiered Storage (FTS) Whitepaper : EMC Federated Tiered Storage (FTS)
Whitepaper : EMC Federated Tiered Storage (FTS)
EMC
 
конструктор українська мова_таблиці
конструктор українська мова_таблиціконструктор українська мова_таблиці
конструктор українська мова_таблиціТатьяна Глинская
 
EMC Hybrid Cloud Solution with VMware: Hadoop Applications Solution Guide 2.5
EMC Hybrid Cloud Solution with VMware: Hadoop Applications Solution Guide 2.5EMC Hybrid Cloud Solution with VMware: Hadoop Applications Solution Guide 2.5
EMC Hybrid Cloud Solution with VMware: Hadoop Applications Solution Guide 2.5
EMC
 
Thurs rus revolution
Thurs rus revolutionThurs rus revolution
Thurs rus revolutionTravis Klein
 
Intelligence Driven Threat Detection and Response
Intelligence Driven Threat Detection and ResponseIntelligence Driven Threat Detection and Response
Intelligence Driven Threat Detection and Response
EMC
 
Fenice display system
Fenice display systemFenice display system
Fenice display system
gruppofallani
 
Linguistic landscape
Linguistic landscapeLinguistic landscape
Linguistic landscapeayurkosky
 
Beautiful quotestoliveby
Beautiful quotestolivebyBeautiful quotestoliveby
Beautiful quotestolivebyChandan Dubey
 
April Webinar: Sample Balancing in 2012
April Webinar: Sample Balancing in 2012April Webinar: Sample Balancing in 2012
April Webinar: Sample Balancing in 2012
Research Now
 
Virtualization 2.0: The Next Generation of Virtualization
Virtualization 2.0: The Next Generation of VirtualizationVirtualization 2.0: The Next Generation of Virtualization
Virtualization 2.0: The Next Generation of Virtualization
EMC
 

Viewers also liked (20)

Anti stresssong
Anti stresssongAnti stresssong
Anti stresssong
 
Tax types, reading and notes
Tax types, reading and notesTax types, reading and notes
Tax types, reading and notes
 
Formulario de identificación
Formulario de identificaciónFormulario de identificación
Formulario de identificación
 
Hodočašće u lourdes i paray le-monial 2011 - kopija
Hodočašće u lourdes  i paray le-monial 2011 - kopijaHodočašće u lourdes  i paray le-monial 2011 - kopija
Hodočašće u lourdes i paray le-monial 2011 - kopija
 
RSA Security Data Access Governance Infographic
RSA Security Data Access Governance InfographicRSA Security Data Access Governance Infographic
RSA Security Data Access Governance Infographic
 
Comparative analysis 98 sections of cos act 2013
Comparative analysis 98 sections of cos act 2013Comparative analysis 98 sections of cos act 2013
Comparative analysis 98 sections of cos act 2013
 
Whitepaper : EMC Federated Tiered Storage (FTS)
Whitepaper : EMC Federated Tiered Storage (FTS) Whitepaper : EMC Federated Tiered Storage (FTS)
Whitepaper : EMC Federated Tiered Storage (FTS)
 
конструктор українська мова_таблиці
конструктор українська мова_таблиціконструктор українська мова_таблиці
конструктор українська мова_таблиці
 
EMC Hybrid Cloud Solution with VMware: Hadoop Applications Solution Guide 2.5
EMC Hybrid Cloud Solution with VMware: Hadoop Applications Solution Guide 2.5EMC Hybrid Cloud Solution with VMware: Hadoop Applications Solution Guide 2.5
EMC Hybrid Cloud Solution with VMware: Hadoop Applications Solution Guide 2.5
 
Thurs rus revolution
Thurs rus revolutionThurs rus revolution
Thurs rus revolution
 
Wed militarism
Wed militarismWed militarism
Wed militarism
 
Intelligence Driven Threat Detection and Response
Intelligence Driven Threat Detection and ResponseIntelligence Driven Threat Detection and Response
Intelligence Driven Threat Detection and Response
 
Editing photos
Editing photosEditing photos
Editing photos
 
Thur child labor
Thur child laborThur child labor
Thur child labor
 
Fenice display system
Fenice display systemFenice display system
Fenice display system
 
Linguistic landscape
Linguistic landscapeLinguistic landscape
Linguistic landscape
 
Beautiful quotestoliveby
Beautiful quotestolivebyBeautiful quotestoliveby
Beautiful quotestoliveby
 
April Webinar: Sample Balancing in 2012
April Webinar: Sample Balancing in 2012April Webinar: Sample Balancing in 2012
April Webinar: Sample Balancing in 2012
 
Batatgal
BatatgalBatatgal
Batatgal
 
Virtualization 2.0: The Next Generation of Virtualization
Virtualization 2.0: The Next Generation of VirtualizationVirtualization 2.0: The Next Generation of Virtualization
Virtualization 2.0: The Next Generation of Virtualization
 

Similar to Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirements

Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analyticsThe Marketing Distillery
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Jennifer Walker
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
FredReynolds2
 
Building Smarter, Faster, and Scalable Data-Rich Application
Building Smarter, Faster, and Scalable Data-Rich ApplicationBuilding Smarter, Faster, and Scalable Data-Rich Application
Building Smarter, Faster, and Scalable Data-Rich Application
Robert Bira
 
Mighty Guides- Data Disruption
Mighty Guides- Data DisruptionMighty Guides- Data Disruption
Mighty Guides- Data Disruption
Mighty Guides, Inc.
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects
IJMER
 
Intel Big Data Analysis Peer Research Slideshare 2013
Intel Big Data Analysis Peer Research Slideshare 2013Intel Big Data Analysis Peer Research Slideshare 2013
Intel Big Data Analysis Peer Research Slideshare 2013
Intel IT Center
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
ijeei-iaes
 
Future Market Trend Of Big Data Hadoop
Future Market Trend Of Big Data HadoopFuture Market Trend Of Big Data Hadoop
Future Market Trend Of Big Data Hadoop
Cognixia
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
Mohamed Magdy
 
Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?
Dell World
 
IRJET- Survey of Big Data with Hadoop
IRJET-  	  Survey of Big Data with HadoopIRJET-  	  Survey of Big Data with Hadoop
IRJET- Survey of Big Data with Hadoop
IRJET Journal
 
Big data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and HadoopBig data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and Hadoop
SamiraChandan
 
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillDOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
ClaraZara1
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCE
ijsptm
 
Big Data Hadoop
Big Data HadoopBig Data Hadoop
Big Data Hadoop
Techsparks
 
Accelerating Time to Success for Your Big Data Initiatives
Accelerating Time to Success for Your Big Data InitiativesAccelerating Time to Success for Your Big Data Initiatives
Accelerating Time to Success for Your Big Data Initiatives☁Jake Weaver ☁
 
xGem BigData
xGem BigDataxGem BigData
xGem BigData
Julio Castro
 
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docxHow Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
pooleavelina
 

Similar to Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirements (20)

Getting down to business on Big Data analytics
Getting down to business on Big Data analyticsGetting down to business on Big Data analytics
Getting down to business on Big Data analytics
 
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
Hadoop: Data Storage Locker or Agile Analytics Platform? It’s Up to You.
 
Big Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential ToolsBig Data Tools: A Deep Dive into Essential Tools
Big Data Tools: A Deep Dive into Essential Tools
 
Building Smarter, Faster, and Scalable Data-Rich Application
Building Smarter, Faster, and Scalable Data-Rich ApplicationBuilding Smarter, Faster, and Scalable Data-Rich Application
Building Smarter, Faster, and Scalable Data-Rich Application
 
Mighty Guides- Data Disruption
Mighty Guides- Data DisruptionMighty Guides- Data Disruption
Mighty Guides- Data Disruption
 
Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects Influence of Hadoop in Big Data Analysis and Its Aspects
Influence of Hadoop in Big Data Analysis and Its Aspects
 
Intel Big Data Analysis Peer Research Slideshare 2013
Intel Big Data Analysis Peer Research Slideshare 2013Intel Big Data Analysis Peer Research Slideshare 2013
Intel Big Data Analysis Peer Research Slideshare 2013
 
Big Data-Survey
Big Data-SurveyBig Data-Survey
Big Data-Survey
 
Future Market Trend Of Big Data Hadoop
Future Market Trend Of Big Data HadoopFuture Market Trend Of Big Data Hadoop
Future Market Trend Of Big Data Hadoop
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?Are You Prepared For The Future Of Data Technologies?
Are You Prepared For The Future Of Data Technologies?
 
IRJET- Survey of Big Data with Hadoop
IRJET-  	  Survey of Big Data with HadoopIRJET-  	  Survey of Big Data with Hadoop
IRJET- Survey of Big Data with Hadoop
 
Big data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and HadoopBig data analytics - Introduction to Big Data and Hadoop
Big data analytics - Introduction to Big Data and Hadoop
 
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond HillDOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
DOCUMENT SELECTION USING MAPREDUCE Yenumula B Reddy and Desmond Hill
 
DOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCEDOCUMENT SELECTION USING MAPREDUCE
DOCUMENT SELECTION USING MAPREDUCE
 
Big Data Hadoop
Big Data HadoopBig Data Hadoop
Big Data Hadoop
 
Accelerating Time to Success for Your Big Data Initiatives
Accelerating Time to Success for Your Big Data InitiativesAccelerating Time to Success for Your Big Data Initiatives
Accelerating Time to Success for Your Big Data Initiatives
 
xGem BigData
xGem BigDataxGem BigData
xGem BigData
 
Big data
Big dataBig data
Big data
 
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docxHow Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
How Analytics Has Changed in the Last 10 Years (and How It’s Staye.docx
 

More from EMC

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
EMC
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
EMC
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
EMC
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
EMC
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
EMC
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
EMC
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lakeEMC
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
EMC
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
EMC
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
EMC
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
EMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic EMC
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
EMC
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
EMC
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
EMC
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesEMC
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
EMC
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
EMC
 

More from EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Recently uploaded

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

Achieving Flexible Scalability of Hadoop to Meet Enterprise Workload Requirements

  • 1.   White     Paper          Achieving  Flexible  Scalability  of  Hadoop   to  Meet  Enterprise  Workload   Requirements       By  Nik  Rouda,  Senior  Analyst       December  2014                               This  ESG  White  Paper  was  commissioned  by  EMC   and  is  distributed  under  license  from  ESG.       ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.  
  • 2. White  Paper:  Achieving  Flexible  Scalability  of  Hadoop  to  Meet  Enterprise  Workload  Requirements                                    2   ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.   Contents   Big  Data  Environments  Have  Varying  Goals  and  Requirements  for  Scaling  ...............................................  3   Big  Data  Is  Big  By  Definition  .....................................................................................................................................  3   Hadoop  Scales  for  Big  Data,  But  Scale  Isn’t  Always  Easy  .........................................................................................  3   Big  Data  Needs  Scale  in  Multiple  Dimensions  ..........................................................................................................  4   Emerging  Choices  for  Implementation  and  Scaling  of  Hadoop  Environments  ...........................................  6   Independent  Scaling  of  Servers  and  Storage  ...........................................................................................................  6   Evaluation  of  Big  Data  Solutions  ..............................................................................................................................  8   When  Scaling  Environments,  “How  to  Host  Hadoop”  Is  as  Important  as  “Where  to  Host  Hadoop”  ..........  9   The  Bigger  Truth  .......................................................................................................................................  10                                                                                             All trademark names are property of their respective companies. Information contained in this publication has been obtained by sources The Enterprise Strategy Group (ESG) considers to be reliable but is not warranted by ESG. This publication may contain opinions of ESG, which are subject to change from time to time. This publication is copyrighted by The Enterprise Strategy Group, Inc. Any reproduction or redistribution of this publication, in whole or in part, whether in hard-copy format, electronically, or otherwise to persons not authorized to receive it, without the express consent of The Enterprise Strategy Group, Inc., is in violation of U.S. copyright law and will be subject to an action for civil damages and, if applicable, criminal prosecution. Should you have any questions, please contact ESG Client Relations at 508.482.0188.  
  • 3. White  Paper:  Achieving  Flexible  Scalability  of  Hadoop  to  Meet  Enterprise  Workload  Requirements                                    3   ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.   Big  Data  Environments  Have  Varying  Goals  and  Requirements  for  Scaling   Big  Data  Is  Big  By  Definition   The  popularity  of  big  data  continues  to  grow  with  new  applications  and  new  enthusiasm  in  almost  every  industry.   Many  organizations  are  looking  for  opportunities  to  transform  their  business  using  the  possibilities  afforded  by  new   data  processing  and  analytics  technologies  and  their  promise  of  new  capabilities  and  improved  economics.   The  broad  Hadoop  ecosystem  that  has  developed  around  the  Apache  open-­‐source  project,  along  with  commercial   distributions,  is  one  of  the  most  instrumental  forces  powering  this  change  in  IT.  This  change  is  not  a  result  of   marketing  hype.  It  is  due  to  Hadoop’s  suitability  for  accommodating  large,  intricate  data  volumes.  Indeed,  recently   conducted  ESG  research  revealed  that  the  ability  to  process  (54%),  store  (49%),  and  run  complex  queries  on  (47%)   large  volumes  of  diverse  data  are  the  three  most  commonly  identified  terms  or  criteria  that  organizations  use  to   define  “big  data”  (see  Figure  1).1   Figure  1.  Top  Three  Terms  or  Criteria  that  Best  Align  with  Organizations’  Definitions  of  “Big  Data”      Source:  Enterprise  Strategy  Group,  2014.   Hadoop  Scales  for  Big  Data,  But  Scale  Isn’t  Always  Easy   As  Hadoop  becomes  more  popular,  a  more  nuanced  understanding  of  its  operational  requirements  is  growing.   Some  of  the  more  common  considerations  are  the  advantages  of  the  platform  compared  with  more  traditional  data   management  tools.  These  advantages  include  nominally  low  entry  costs,  a  range  of  analytics  options,  support  for   distributed  and  parallel  jobs,  and  a  simple  mechanism  for  extreme  scale-­‐out  based  on  generic  hardware.   These  pluses  are  spawning  a  new  community  of  Hadoop  champions,  including  data  scientists  and  architects  who   are  looking  for  new  technologies  capable  of  supporting  their  specific  big  data  initiatives.  Reflecting  this  popularity,   ESG  survey  respondents  indicated  that  39%  of  organizations  are  now  planning  to  deploy  a  new  Hadoop   environment  within  the  next  12  to  18  months.2   With  the  top  three  definitions  of  big  data  all  referencing  large  volumes  of  data,  it  follows  that  scalability  is  a  priority   for  most  deployments.  Fortunately,  Hadoop  has  been  developed  with  the  inherent  need  for  scale  in  mind.                                                                                                                   1  Source:  ESG  Research  Report,  Enterprise  Data  Analytics  Trends,  May  2014.     2  Source:  ESG  Research  Report,  Enterprise  Big  Data,  Business  Intelligence,  and  Analytics  Trends,  to  be  published  December  2014.   47%   49%   54%   42%   44%   46%   48%   50%   52%   54%   56%   Ability  to  run  complex  queries  against  large   datasets   Ability  to  store  large  volumes  of  diverse  data   Ability  to  process  large  volumes  of  diverse  data   Which  of  the  following  terms  or  criteria  best  align  with  your  definiMon  of  “big   data”?  (Percent  of  respondents,  N=375,  five  responses  accepted)  
  • 4. White  Paper:  Achieving  Flexible  Scalability  of  Hadoop  to  Meet  Enterprise  Workload  Requirements                                    4   ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.   However,  some  organizations  are  discovering  that  scalability  comes  with  trade-­‐offs,  falling  generally  into  three   areas:   • Storing  and  processing  extremely  large  volumes  of  data.  The  value  of  that  data—and  the  insights  it  may   provide—may  not  always  be  clear.  More  data  isn’t  necessarily  better.  And  even  with  the  market  expecting   that  managing  this  data  on  open  source  software  and  commodity  hardware  will  be  nominally  less   expensive,  the  cost  per  gigabyte  can  still  quickly  add  up.   • Performance  at  scale.  If  all  the  data  sets  can  be  stored,  can  they  be  analyzed  fast  enough  on  demand?  Big   data  won’t  be  particularly  helpful  if  processing  a  large  amount  of  data  takes  an  unreasonably  long  time  to   accomplish.  Batch  analytics  need  to  return  results  while  the  question  is  relevant  and  appropriate  actions   can  be  taken.   • Diversity  of  data.  The  various  data  sets  need  to  be  recognized  and  reconciled  and  be  made  conducive  to   intricate  calculations  and  advanced  modeling  techniques.  Again,  scale  can  complicate  matters  in  terms  of   logical  design  and  computation  demands.     Failure  to  address  those  issues  could  mean  that  the  big  data  environment  won’t  live  up  to  the  high  expectations  of   IT  departments  and  line  of  business  users.  Although  these  challenges  can  seem  trivial  or  distant  during  proof-­‐of-­‐ concept  and  pilot  programs,  most  organizations  will  eventually  discover  the  limitations  of  their  approach  after  an   enterprise-­‐wide  production  deployment  starts  to  expand  in  earnest.   Big  Data  Needs  Scale  in  Multiple  Dimensions   Today,  the  most  common  model  of  Hadoop  deployments  involves  clusters  of  commodity  servers  with  embedded   storage.  This  is  a  fairly  standard  approach  that  would  seem  to  provide  incremental  scalability  for  the  inevitable   increases  in  data  processing,  storage,  and  analysis.  However,  Hadoop  use  cases  vary  tremendously  even  within  a   single  company  or  environment.  IT  infrastructure  architects,  data  scientists,  and  analytic  staff  in  lines  of  business   need  to  collaborate  to  define  likely  demands  and  prioritize  their  design  choices.  (Figure  2  on  page  5  shows  how   changing/different  workloads  need  differing  types  of  system  resources  to  be  handled  most  effectively.)   Business  adoption  of  Hadoop  is  relatively  new;  many  organizations  begin  their  initial  experimentation  with  lower-­‐ risk  use  cases  in  spite  of  their  enthusiasm  for  the  technology.  They  may  begin  with  a  small  cluster  by  copying   multiple  internal  data  sources  and  capturing  external  public  data  as  well.  Generally,  as  they  gain  familiarity  and   confidence  and  can  start  demonstrating  success,  organizations  will  expand  the  environment  and  realize  increasing   value—they  will  find  a  growing  number  of  use  cases  and  win  more  converts  among  their  analysts  and  business   users.   Scaling  Hadoop  for  Storage-­‐intensive  Use  Cases   Many  organizations  in  the  early  stages  of  Hadoop  implementations  may  be  saving  massive  quantities  of  data  that   are  rarely  used  or  that  will  be  valuable  only  in  the  future.   This  scenario  can  be  seen  in  geophysical  remote  sensing  and  surveying  for  oil  and  gas  exploration,  where  seismic,   gravimetric,  and  electrical  conductive  spatial  maps  have  been  defined  and  captured  down  to  a  fine  granularity   across  a  vast  territory.  Today’s  natural  resource  extraction  methods  might  not  make  it  economical  for  an  energy   company  to  mine  a  given  oil  or  gas  deposit  identified  during  the  exploration  process.  However,  that  situation  is   subject  to  changes  in  market  conditions  (e.g.,  rising  energy  prices)  or  advances  in  resource  extraction  technology   (e.g.,  hydraulic  fracturing).  If  conditions  change  and  the  energy  firm  decides  to  initiate  operations  on  a  new  deposit,   this  may  require  deep  storage  yet  minimal  server  effort  (see  Figure  2).    
  • 5. White  Paper:  Achieving  Flexible  Scalability  of  Hadoop  to  Meet  Enterprise  Workload  Requirements                                    5   ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.   Figure  2.  Different  Workloads  Need  Different  Resources  to  Be  Handled  Most  Effectively     Source:  Enterprise  Strategy  Group,  2014.   Yet,  28%  of  IT  professionals  surveyed  by  ESG  have  said  that  storage  costs  alone  remain  too  high  for  the  big  data   archive  they’d  like  to  build.3    This  problem  remains  even  when  using  inexpensive,  “slow”  internal  server  hard  disk   drives  instead  of  more  robust  external  high-­‐capacity,  high-­‐performance  storage  arrays.   Scaling  Hadoop  for  Memory-­‐intensive  Use  Cases   Some  analytics  workloads,  such  as  real-­‐time  security  analytics,  need  large  pools  of  memory  for  the  fastest  possible   search,  read,  and  write  performance.  It  is  often  dictated  by  the  trend  toward  delivering  the  fastest  possible   response  for  complex  analytics  jobs.   Think  of  on-­‐the-­‐fly  customization  of  displayed  products  and  promotional  offers  for  web-­‐scale  e-­‐commerce  sites,   where  a  360-­‐degree  understanding  of  the  shopper,  inventory,  regional  pricing,  and  competitors’  activity  must  all  be   instantly  and  simultaneously  considered.   In-­‐memory  operations  will  almost  certainly  outperform  those  that  need  to  conduct  I/O  from  traditional  hard  disk  or   even  server-­‐embedded  flash  storage  or  SSD  drives  simply  due  to  reduced  data  movement.  However,  in  this   scenario,  any  individual  commodity  server  will  have  a  specific  limit  on  the  amount  of  memory  available  for   analytics.   One  way  to  address  this  challenge  is  to  spread  the  workload  across  multiple  parallel  servers  to  increase  the  total   available  memory.  However,  this  approach  would  still  add  some  job  coordination  overhead  (and  therefore  delay),   which  could  result  in  unacceptable  application  performance  for  business  users.   To  identify  the  memory  requirements  of  typical  big  data  environments,  ESG  surveyed  many  large  enterprises  and   found  that  nearly  two-­‐thirds  of  respondents  expect  to  process  more  than  5TB  of  data  as  part  of  a  typical  analytics   job.4  This  threshold  is  well  above  the  memory  a  typical  low-­‐end  server  can  handle,  even  if  the  right  mix  of  data  just   happened  to  be  concentrated  on  that  particular  node.  And  often,  organizations’  queries  will  need  to  work  with  data   sets  several  orders  of  magnitude  larger.  Finding  a  way  to  bridge  the  server  memory  requirements  to  storage   without  wasting  resources  is  imperative.                                                                                                                   3  Source:  ESG  Research  Report,  Enterprise  Data  Analytics  Trends,  May  2014.   4  Source:  ESG  Research  Report,  Enterprise  Big  Data,  Business  Intelligence,  and  Analytics  Trends,  to  be  published  December  2014.  
  • 6. White  Paper:  Achieving  Flexible  Scalability  of  Hadoop  to  Meet  Enterprise  Workload  Requirements                                    6   ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.   Scaling  Hadoop  for  Compute-­‐intensive  Use  Cases   A  third  category  of  use  cases  will  need  more  compute  power  for  advanced  analytics  and  complex  transformations.   These  jobs  can  be  very  processor  intensive.   Take,  for  example,  a  pharmaceutical  research  and  design  study  that  completes  full  DNA  analyses  alongside  a  variety   of  environmental  factors  and  unique  patient  healthcare  histories—and  much  of  the  data  is  in  unstructured  format   such  as  nurses’  patient  notes.  The  vast  number  of  discrete  calculations  involved  in  discovery,  model  fitting,  and   assessing  accuracy  can  tax  even  the  largest  appliances  or  mainframes.  Yet,  this  is  now  being  done  instead  on   clusters  of  low-­‐cost  servers.   Large  data  sets  are  involved,  but  they  are  perhaps  more  transient  in  nature,  translating  into  scalability   requirements  that  are  focused  less  on  storage  capacity  and  more  on  handling  the  demands  of  real-­‐time  streaming.   ESG  research  found  that  26%  of  companies  say  that  data  set  sizes  are  now  limiting  their  ability  to  perform  the   requisite  analytics  exercises.5   Scaling  Hadoop  for  Geographically  Dispersed  Use  Cases   Another  dimension  of  scaling  is  focused  on  how  to  use  Hadoop  in  a  geographically  distributed  environment.  Data   sets  may  be  collected  and  hosted  in  different  data  centers  globally  or  around  a  specific  region.  For  example,  clicks   and  purchase  activity  from  a  public  cloud-­‐based  web  application  may  be  retained  in  the  cloud  rather  than  copied   back  to  a  company’s  on-­‐premises  cluster.  Alternately,  human  resources  data  may  be  managed  at  corporate   headquarters  while  financial  account  information  is  kept  in  other  countries  due  to  local  government  regulations   tied  to  the  exporting  of  private  information.   In  these  cases,  the  addition  of  a  replication  capability  between  Hadoop  clusters  could  prove  valuable—provided  the   sensitive  data  is  appropriately  masked  and  encrypted.  This  functionality  is  not  innately  present  in  Hadoop  today,   but  it  could  be  managed  through  proprietary  scale-­‐out  storage  features.   Whether  for  analytics  or  transaction  workloads,  a  level  of  concern  about  disaster  recovery  and  business  continuity   may  also  require  geographic  dispersion,  particularly  as  Hadoop  begins  to  support  more  mission-­‐critical  operations   in  the  enterprise.   Each  of  the  scenarios  above  is  fundamentally  related  to  the  scalability  of  big  data,  but  each  shows  a  different   dimension  of  scalability  that  may  be  required  to  achieve  the  goals  of  the  initiative.  Any  given  Hadoop  cluster  may   have  a  different  blend  of  these  needs  when  initially  deployed.  More  challenging  is  the  fact  that  the  characteristics   may  change  radically  over  time.   Emerging  Choices  for  Implementation  and  Scaling  of  Hadoop  Environments   As  noted,  a  typical  Hadoop  environment  is  a  cluster  of  commodity  servers  with  internal  storage  that  is  self-­‐ managing  and  built  to  the  most  common  denominator  of  system  specifications.  For  generic  workloads,  this  cost-­‐ first  approach  can  be  adequate,  especially  if  the  analytics  aren’t  particularly  intensive  or  deemed  to  be  mission-­‐ critical  for  the  business.  However,  a  small-­‐scale  test  of  the  capabilities  may  not  demonstrate  the  limitations  that  will   emerge  as  scale  inevitably  increases.  Finding  the  optimal  server  specifications  for  each  node  in  the  cluster  (or   clusters)  can  be  a  non-­‐trivial  exercise  that  may  not  be  addressed  with  one  correct  solution.   Independent  Scaling  of  Servers  and  Storage   Given  the  variety  and  changing  nature  of  Hadoop  implementation  requirements  (often  unknown  at  the  time  of   deployment),  the  assumed  infrastructure  model  of  commodity  servers  with  embedded  storage  doesn’t  always   make  sense.                                                                                                                   5  Ibid.    
  • 7. White  Paper:  Achieving  Flexible  Scalability  of  Hadoop  to  Meet  Enterprise  Workload  Requirements                                    7   ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.   This  model,  in  which  servers  and  storage  capacity  are  embedded  together  as  “one  size  fits  all”  units  and  forced  to   scale  linearly  in  a  homogenous  cluster,  can  potentially  waste  a  lot  of  resources.  No  single  configuration  of  server   may  be  able  to  handle  the  various  workloads  (again  noting  that  particular  jobs  can  be  limited  by  system  memory,   processor  speed,  or  storage  capacity).  And  of  course,  increasing  any  one  of  these  components  will  have  an  effect  on   the  overall  price  of  each  node.  Over-­‐provisioning  all  three  may  sound  good  for  performance,  but  this  “bigger   hammer”  approach  will  adversely  and  dramatically  increase  the  total  costs  of  the  environment  and  obviate  the   commodity-­‐scale  benefits  of  Hadoop.   A  promising  approach  is  the  separation  of  servers  and  storage  into  pools  of  resources  to  be  drawn  on  as  needed   (see  Figure  3).  This  division  between  scaling  storage  capacity  and  computing  power  enables  more  targeted   scalability  to  satisfy  specific  demands  associated  with  different  workloads.  The  approach  may  also  require  the   adoption  of  a  shared  storage  platform,  which  is  not  the  most  common  model  for  Hadoop,  but  it  brings  with  it  many   of  the  key  qualities  organizations  say  they  want  (such  as  a  balance  across  cost,  performance,  flexibility,  and  data   protection  considerations).  And,  even  in  a  basic  MapReduce  operation,  data  is  often  migrated  and  joined  in   performing  typical  jobs  so  the  facility  can  be  accommodated,  just  with  a  different  route  for  accessing  storage.   An  independent  server-­‐scaling  strategy  complements  the  many  advantages  of  centralized,  shared  storage,  which   ESG  previously  outlined  in  a  white  paper  titled  EMC  Isilon:  A  Scalable  Storage  Platform  for  Big  Data  (April  2014).   Some  benefits  of  this  approach  include  (but  are  not  limited  to)  multi-­‐protocol  access,  in-­‐place  analytics  (i.e.,  no   extract,  transform,  load  [ETL]),  and  better  efficiency  and  safety.   If  one  views  the  Hadoop  cluster  servers  essentially  as  virtualized  resources,  this  computing  power  then  can  be  used   to  access  different  storage  as  needed.  Effectively,  this  model  can  be  viewed  as  a  logical  independence  versus  a   necessarily  physical  distinction,  and  one  needn’t  assume  the  physical  layer  itself  is  virtualized  for  the  solution  to  be   workable.  In  fact,  this  pattern  has  arisen  in  computing  history  before,  with  isolated,  locally  embedded  storage.  The   storage  is  eventually  replaced  or  augmented  with  much  larger  centralized  pools  of  resources  shared  between   servers,  proving  that  hyper-­‐convergence  doesn’t  always  lead  to  optimal  utilization.   Figure  3.  Diagram  of  Storage  Hosting  Options  for  Hadoop            Source:  Enterprise  Strategy  Group,  2014.  
  • 8. White  Paper:  Achieving  Flexible  Scalability  of  Hadoop  to  Meet  Enterprise  Workload  Requirements                                    8   ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.   Evaluation  of  Big  Data  Solutions   ESG  research  has  found  that  financial  considerations,  performance,  flexibility,  the  efficient  use  of  storage  resources,   and  scalability  are  among  the  most  commonly  identified  attributes  organizations  use  to  evaluate  potential  big  data   solutions  (see  Figure  4).6  Yet  these  criteria  can  be  contradictory  or  even  mutually  exclusive  in  the  real  world.   Perhaps  the  most  valuable  effect  of  separately  scaling  compute  resources  and  storage  resources  is  that  it  enables   organizations  to  achieve  a  more  optimal  blend  of  the  attributes  they  say  they  are  looking  for  in  a  big  data  solution.     Figure  4.  Most  Important  Solution  Evaluation  Criteria  for  New  Big  Data  Solutions     Source:  Enterprise  Strategy  Group,  2014.   For  example,  an  important  benefit  of  separately  scaling  compute  and  storage  resources  is  the  ability  to  group   differing  classes  of  servers  to  meet  specific  workload  requirements:  It  provides  a  more  cost-­‐effective  solution  with   better  overall  performance.  In  theory,  separating  compute  from  storage  should  also  simplify  administration  while   increasing  system  reliability  and  overall  recoverability.  Although  the  concept  of  swapping  out  inexpensive  server   nodes—for  example,  to  replace  failing  internal  hard  drives—may  sound  quite  trivial,  as  organizations  expand  to   larger  cluster  environments,  they  may  find  that  this  administrative  approach  gets  more  difficult  and  costly  in                                                                                                                   6  Source:  ESG  Research  Report,  Enterprise  Data  Analytics  Trends,  May  2014.   10%   11%   13%   13%   14%   15%   16%   18%   18%   20%   21%   22%   26%   26%   0%   5%   10%   15%   20%   25%   30%   Public  cloud  hoskng  opkons   Open  standards-­‐based   Reporkng  and/or  visualizakon   Efficient  use  of  server  resources   Ease  of  administrakon   Scalability   Efficient  use  of  storage  resources   Flexibility   Built-­‐in  high  availability,  backup,  disaster  recovery   capabilikes   Ease  of  integrakon  with  other  applicakons,  APIs   Performance   Reliability   Security   Cost,  ROI  and/or  TCO   Which  of  the  following  aZributes  are  most  important  to  your  organizaMon  when  considering   technology  soluMons  in  the  area  of  business  intelligence,  analyMcs,  and  big  data?  (Percent  of   respondents,  N=375,  three  responses  accepted)  
  • 9. White  Paper:  Achieving  Flexible  Scalability  of  Hadoop  to  Meet  Enterprise  Workload  Requirements                                    9   ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.   practice.  Moving  the  storage  tier  to  a  well-­‐designed,  shared  storage  infrastructure  can  streamline  some  of  these   management  and  administrative  tasks.   When  Scaling  Environments,  “How  to  Host  Hadoop”  Is  as  Important  as   “Where  to  Host  Hadoop”   The  idea  of  independent  scalability  of  servers  and  storage  may  be  new,  but  the  deployment  options  organizations   are  already  choosing  for  net-­‐new  big  data  instances  reflect  the  advantages  of  this  approach.  This  is  shown  in   organizations’  choices  about  how  and  where  to  host  the  analytics  solutions,  as  discovered  by  ESG  research.  Figure  5   depicts  a  wide  distribution  of  infrastructure  preferences  for  BI/analytics  environments.7  Some  18%  of  respondents   indicated  that  they  are  looking  for  simple,  one-­‐to-­‐one  unvirtualized  hardware  on-­‐premises.  Joining  them  in  the   dedicated-­‐resources-­‐camp  is  another  21%  of  IT  teams  who  reported  choosing  appliances  to  meet  their  specific   needs.  Appliances  are  usually  selected  for  their  purpose-­‐built  high  performance  and  massive  size,  but  they  come  at   a  correspondingly  high  initial  price  point  and  with  the  lumpier  granularity  of  obliging  users  to  add  more  servers  in   full-­‐  or  half-­‐rack  configurations.   Source:  Enterprise  Strategy  Group,  2014.   Supporting  the  idea  of  more  flexible  scaling,  30%  of  companies  said  that  they  are  virtualizing  on-­‐premises  servers   and/or  storage  for  more  flexible  resource  allocation  and  independence.  Add  to  that  group  the  31%  who  prefer  this   virtualization  value  delivered  as  public  or  private  cloud,  making  the  consumption  and  billing  also  an  integral  part  of   the  exercise.  These  implementations  are  interesting  in  that  they  essentially  take  scalability  to  a  much  finer-­‐grained   level;  if  not  truly  infinite,  they  are  certainly  more  smoothly  extensible.   External  public  cloud  service  provider  options  such  as  Google  or  Amazon  may  serve  as  proof  points  for  detaching   the  scale  of  servers  and  storage.  Consider  how  Amazon  offers  EC2  (compute)  and  S3  (storage)  as  distinct   components.                                                                                                                     7  Source:  ESG  Research  Report,  Enterprise  Big  Data,  Business  Intelligence,  and  Analytics  Trends,  to  be  published  December  2014.   Figure  5.  Deployment  Models  Vary  for  New  Big  Data  Solutions    
  • 10. White  Paper:  Achieving  Flexible  Scalability  of  Hadoop  to  Meet  Enterprise  Workload  Requirements                                    10   ©  2014  by  The  Enterprise  Strategy  Group,  Inc.  All  Rights  Reserved.   Not  all  organizations  are  comfortable  going  off-­‐premises  often  for  reasons  of  control,  liability,  or  cost;  they  instead   look  to  achieve  this  model  of  fluidity  on-­‐premises.  In  some  ways,  the  public  cloud  is  analogous  to  the  concept  of   scaling  Hadoop  in  the  manner  already  discussed  in  detail.  It  represents  what  can  be  potentially  achieved.   Part  of  the  reason  for  this  variety  of  preferences  in  hosting  Hadoop  may  be  related  to  the  variety  of  data  sources   inside  and  outside  of  the  organization.  Increasingly,  organizations  are  looking  to  perform  analytics  as  close  as   possible  to  the  data  source  to  avoid  the  overhead  and  delay  of  ETL  operations.  Sometimes,  the  data  and  analytics   activity  is  transient  in  nature  too,  needing  to  be  handled  for  only  a  short  time.  In  these  cases,  more  flexibility  and   the  ability  to  dynamically  spin  up  a  Hadoop  cluster,  run  a  job,  and  then  dismiss  the  resources  could  be  quite   valuable.   The  Bigger  Truth   Big  data  is  by  definition  about  analytics  operations  on  large  quantities  of  data.  To  be  successful,  companies  will   need  to  design  their  computing  environments  to  meet  the  high  demands  of  business  users  and  their  specific   applications  and  workloads,  many  of  which  will  have  different  profiles  in  terms  of  storage,  processor,  and  memory   requirements.  Failure  to  perform  at  scale  will  at  best  introduce  significant  delays  to  analytics  tasks,  and  at  worst,  if   results  are  not  returned  in  a  timely  manner,  that  failure  will  negate  the  value  of  the  big  data  initiative  overall.   The  economics  of  new  Hadoop  implementations  based  on  open  source  software  and  commodity  hardware  promise   lower  initial  cost,  but  the  linear  scaling  paradigm  of  adding  interlocked  servers  and  embedded  storage  could   unintentionally  lead  to  higher  costs  and  inefficient  resource  utilization.  This  consequence  can  come  from  rigidly   tying  compute  capacity  to  storage  volumes  in  each  server.   New  approaches  that  offer  a  more  flexible  scaling  of  environments  are  promising,  and  they  suggest  that   performance  can  be  improved  while  simultaneously  reducing  costs.  The  best  model  for  some  may  be  the   independent  scaling  of  server  and  storage  resources.  The  primary  benefits  of  this  model  can  include  an  increased   ability  to  handle  larger  workloads,  an  increased  ability  to  answer  complex  analytics  and  queries  in  a  shorter  amount   of  time,  and  in  the  right  circumstances,  a  lower  total  cost  of  ownership.   Leading  storage  companies  are  articulating  an  enticing  vision  of  a  more  flexible,  adaptive  future  for  Hadoop-­‐based   big  data  environments.  Complementing  their  core  high-­‐performance  storage  array  offerings  with  extreme  scale-­‐out   and  multi-­‐protocol  access  and  virtualization  may  provide  the  abstraction  necessary  to  support  scaling  Hadoop   environments.   The  potential  advantages  of  decoupling  storage  capacity  and  computing  processing  power  are  real  and  should  be   recognized  and  considered  by  customers  looking  to  avoid  the  common  challenge  of  having  mismatched  or   inadequate  resources  for  the  ever-­‐changing  requirements  of  modern  big  data  environments.  Customers  should   work  with  vendors  at  the  forefront  of  this  approach  to  identify  how  they  can  benefit  from  an  architecture  that   allows  independent  scalability  of  servers  and  storage  for  a  modern  big  data  architecture.    
  • 11.                                                                                                 20  Asylum  Street    |    Milford,  MA  01757    |    Tel:  508.482.0188    Fax:  508.482.0218    |    www.esg-­‐global.com