Is Your Cloud Ready for Big Data?
Richard McDougall
CTO, Storage and Application Services

© 2009 VMware Inc. All rights reserved
Not Just for the Web Giants – The Intelligent Enterprise

2
Real-time analysis allows
instant understanding of
market dynamics.
Retailers can have intimate
understanding of their
customers needs and use
direct targeted marketing.

Market Segment Analysis ! Personalized Customer Targeting`
3
The Emerging Pattern of Big Data Systems: Retail Example

Real-Time
Processing
Data Science

Real-Time
Streams

Machine
Learning

Parallel Data
Processing

Exa-scale Data Store
Cloud Infrastructure

4
Storage: Plan for Peta-scale Data Storage and Processing
Analytics Rapidly Outgrows Traditional Data Size
by 100x

1000

100

10

Online Apps
PB of
Data

Analytics
1

0.1

0.01
2000

5

2003

2006

2009

2012

2015
Unprecedented Scale

We are creating an Exabyte
of data every minute in 2013

Yottabyte by 2030

“Data transparency,
amplified by Social Networks
generates data at a
scale never seen before”
- The Human Face of Big Data
6
A single GE Jet Engine produces
10 Terabytes of data in one hour
– 90 Petabytes per year.
Enabling early detection of
faults, common mode failures,
product engineering feedback.

Post Mortem ! Proactively Maintained Connected Product

7
The Emerging Pattern of Big Data Systems: Manufacturing
Real-Time
Sensor
Analytics

Product
Engineering

Support

Data Science
Real-Time
Processing

Machine
Learning

Parallel Data
Processing

Exa-scale Data Store
Cloud Infrastructure

8
Cloud Platform

© 2009 VMware Inc. All rights reserved
Cloud Platform: Supporting Mixed Big Data Workloads

Machine
Learning
Machine
Learning

Real-Time
Analytics

Real-Time
Analytics

Hadoop

Cloud Infrastructure

Compute
Hadoop

Management

Storage/Availability
Network/Security

10
Cloud Platform: Supporting Multiple Tenants

Web User
Analytics

Historical Customer
Behavior

Financial
Analysis

Cloud Infrastructure

Compute
Management

Storage/Availability
Network/Security

11
What if you can…
Recommendation engine

Production

Test

Ad targeting

Production

One physical platform to support multiple virtual big
data clusters

Test
Experimentation

Production
recommendation engine

Experimentation

12

Experimentation

Test/Dev
Production
Ad Targeting
Values of a Cloud Platform for Big Data

1

Agility / Rapid deployment

2

Isolation for resource control
and security

3

Lower Capex

4

Operational efficiency

Compute
Management

Storage/Availability
Network/Security

13
Hadoop as a Service

Operational Simplicity

High Availability

Elasticity & Multi-tenancy

!  Rapid deployment

!  High availability for
entire Hadoop stack

!  Shrink and expand
cluster on demand

!  One click to setup

!  Independent scaling of
Compute and data

!  One stop command
center
!  Easy to configure/
reconfigure

14

!  Battle-tested
!  Strong multi-tenancy
Self Service Access to Big Data Environments

Developer

Data Scientist

High priority

…

•  3 Hadoop nodes
•  Cloudera, Pivotal
MapR
•  Small VM
•  Local storage
•  No HA
•  …

• 
• 
• 
• 
• 
• 

• 
• 
• 
• 
• 
• 

•  …
•  …

5 Hadoop nodes
Cloudera, Pivotal
Hive, Pig
Medium VM
HA
…

50 Hadoop nodes
Cloudera
Hive, Pig
Large VM
HA
…

Templates for Different Cloud Users

15
Big Data needs a Mix of Workloads

Other

Hadoop
batch analysis

Compute
layer

NoSQL
Big SQL

HBase
real-time queries

Impala,
Pivotal HawQ

Cassandra,
Mongo, etc

Spark,
Shark,
Solr,
Platfora,
Etc,…

File System/Data Store
Platform Virtualization Technology
Host

16

Host

Host

Host

Host

Host

Host
Strong Isolation between Workloads is Key

Hungry
Workload 1

Reckless
Workload 2

Virtualization Platform

17

Nosy
Workload 3
Community activity in Isolation and Resource Management

!  YARN
•  Goal: Support workloads other than M-R on Hadoop
•  Initial need is for MPI/M-R from Yahoo
•  Non-posix File system self selects workload types

!  Mesos
•  Distributed Resource Broker
•  Mixed Workloads with some RM
•  Active project, in use at Twitter
•  Leverages OS Virtualization – e.g. cgroups

!  Virtualization
•  Virtual machine as the primary isolation, resource management and
versioned deployment container

•  Basis for Project Serengeti

18
Use case: Elastic Hadoop with Tiered SLA

•  Production workloads has high priority
•  Experimentation workloads has lower priority
Experimentation
Mapreduce

Compute layer

Dynamic resourcepool

Compute
VM

Compute
VM

Experimentation
Experimentation

Production
Mapreduce
Compute
VM

Compute
VM

Compute
VM

19

Compute
VM

Compute
VM

Production
recommendation engine

Production

vSphere

Data layer

Compute
VM
Cloud Enabled Auto-elastic Hadoop

J
T

N
N

T
T

VHM

T
T

T
T

J
T

T
T

T
T

DATA VM

DATA VM

DATA VM

ESX

ESX

ESX

Local Disks

JT: JobTracker
TT: TaskTracker
NN: NameNode
VHM: Virtual Hadoop Manager

20

T
T

Hadoop HDFS VMs
Hadoop Compute VMs

SAN/NAS

Non-Hadoop VMs
Hadoop Performance with Virtualization
32 hosts/3.6GHz 8 cores/15K RPM 146GB SAS disks/10GbE/72-96GB RAM

(lower is better)

[http://www.vmware.com/resources/techresources/10360, Jeff Buell, Apr 2013]
21
Network Platform

© 2009 VMware Inc. All rights reserved
A Typical Network Architecture
Aggrega7ng%
Switch%

Aggrega7ng%
Switch%

L2%Switch%

L2%Switch%

L2%Switch%

L2%Switch%

Top%of%Rack%Switch%

Top%of%Rack%Switch%

Top%of%Rack%Switch%

Top%of%Rack%Switch%

Host%

Host%

Host%

Host%

Host%

Host%

Host%

Host%

Host%

Host%

Host%

Host%

23

Host%

Host%

Host%

Host%
Traditional Networks: Core Switch is the Choke Point

Core

Aggregation

Network
Topology

Rack

Hosts

Hosts

Non Uniform Bandwidth
10s of Gbits
100s of Gbits
24

Modeled
Bandwidth
Modern Networks: Great for Big Data

Spine

Network
Topology

Leaf

Hosts

Uniform Bandwidth

25

Modeled
Bandwidth
Flat Networks Allow for New Infrastructure Models

Converged
Storage

Separated
Storage

Top%of%Rack%Switch%

Top%of%Rack%Switch%

Host%
Host%

Host%

Compute
Top%of%Rack%Switch%

Host%

Storage
Top%of%Rack%Switch%

Host%

Storage%

Host%

Storage%

Host%

Storage%

Host%

Storage%

Host%
Host%

Host%

Separated
Storage

Host%
Host%
Host%

26

Storage%
Storage Platform

© 2009 VMware Inc. All rights reserved
Use Local Disk where it’s Needed

SAN Storage

Local Storage

$2 - $10/Gigabyte

$1 - $5/Gigabyte

$0.05/Gigabyte

$1M gets:
0.5Petabytes
200,000 IOPS
8Gbyte/sec
28

NAS Filers

$1M gets:
1 Petabyte
200,000 IOPS
10Gbyte/sec

$1M gets:
10 Petabytes
400,000 IOPS
250 Gbytes/sec
Storage Economics: Traditional vs. Scale-out

$5.50

Traditional
SAN/NAS

$5.00
$4.50
$4.00
$3.50
$3.00
Cost per GB

$2.50
$2.00
$1.50

Distributed
Object
Storage

$1.00
$0.50
$0.5

1

2

4

8

16

Petabytes Deployed

Scale-out NAS
Isilon

29

32

64

128

HDFS
MAPR
CEPH
Gluster
Scality
Big Data Storage Architectures
External SAN
With HDFS

Local Disks
With HDFS or
Other

HDFS,
CEPH,
MAPR,
Gluster,
Scality,
…
30

External
Scale-out NAS
Features from New Storage Solutions

Snapshots

Universal File Store

Clones

Geo Replication

Erasure Coding
NFS Access
Posix Support
31

QoS Controls
SSD Capability
Summary

© 2009 VMware Inc. All rights reserved
Customers Winning from Consolidated Big Data Platforms
“Dedicated hardware makes no
sense”
“Software-defined Datacenter
enables rapid deployment
multiple tenants and labs”
“Our mixed workloads include
Hadoop, Database, ETL and
App-servers”

Compute
Management

Storage/Availability
Network/Security

33

“Any performance penalties are
minor”
Cloud Infrastructure is Ready for Big Data – Are you?

Cloud Infrastructure

34
Is Your Cloud Ready for Big Data?
Richard McDougall
CTO, Storage and Application Services
@richardmcdougll

© 2009 VMware Inc. All rights reserved

Is your cloud ready for Big Data? Strata NY 2013

  • 1.
    Is Your CloudReady for Big Data? Richard McDougall CTO, Storage and Application Services © 2009 VMware Inc. All rights reserved
  • 2.
    Not Just forthe Web Giants – The Intelligent Enterprise 2
  • 3.
    Real-time analysis allows instantunderstanding of market dynamics. Retailers can have intimate understanding of their customers needs and use direct targeted marketing. Market Segment Analysis ! Personalized Customer Targeting` 3
  • 4.
    The Emerging Patternof Big Data Systems: Retail Example Real-Time Processing Data Science Real-Time Streams Machine Learning Parallel Data Processing Exa-scale Data Store Cloud Infrastructure 4
  • 5.
    Storage: Plan forPeta-scale Data Storage and Processing Analytics Rapidly Outgrows Traditional Data Size by 100x 1000 100 10 Online Apps PB of Data Analytics 1 0.1 0.01 2000 5 2003 2006 2009 2012 2015
  • 6.
    Unprecedented Scale We arecreating an Exabyte of data every minute in 2013 Yottabyte by 2030 “Data transparency, amplified by Social Networks generates data at a scale never seen before” - The Human Face of Big Data 6
  • 7.
    A single GEJet Engine produces 10 Terabytes of data in one hour – 90 Petabytes per year. Enabling early detection of faults, common mode failures, product engineering feedback. Post Mortem ! Proactively Maintained Connected Product 7
  • 8.
    The Emerging Patternof Big Data Systems: Manufacturing Real-Time Sensor Analytics Product Engineering Support Data Science Real-Time Processing Machine Learning Parallel Data Processing Exa-scale Data Store Cloud Infrastructure 8
  • 9.
    Cloud Platform © 2009VMware Inc. All rights reserved
  • 10.
    Cloud Platform: SupportingMixed Big Data Workloads Machine Learning Machine Learning Real-Time Analytics Real-Time Analytics Hadoop Cloud Infrastructure Compute Hadoop Management Storage/Availability Network/Security 10
  • 11.
    Cloud Platform: SupportingMultiple Tenants Web User Analytics Historical Customer Behavior Financial Analysis Cloud Infrastructure Compute Management Storage/Availability Network/Security 11
  • 12.
    What if youcan… Recommendation engine Production Test Ad targeting Production One physical platform to support multiple virtual big data clusters Test Experimentation Production recommendation engine Experimentation 12 Experimentation Test/Dev Production Ad Targeting
  • 13.
    Values of aCloud Platform for Big Data 1 Agility / Rapid deployment 2 Isolation for resource control and security 3 Lower Capex 4 Operational efficiency Compute Management Storage/Availability Network/Security 13
  • 14.
    Hadoop as aService Operational Simplicity High Availability Elasticity & Multi-tenancy !  Rapid deployment !  High availability for entire Hadoop stack !  Shrink and expand cluster on demand !  One click to setup !  Independent scaling of Compute and data !  One stop command center !  Easy to configure/ reconfigure 14 !  Battle-tested !  Strong multi-tenancy
  • 15.
    Self Service Accessto Big Data Environments Developer Data Scientist High priority … •  3 Hadoop nodes •  Cloudera, Pivotal MapR •  Small VM •  Local storage •  No HA •  … •  •  •  •  •  •  •  •  •  •  •  •  •  … •  … 5 Hadoop nodes Cloudera, Pivotal Hive, Pig Medium VM HA … 50 Hadoop nodes Cloudera Hive, Pig Large VM HA … Templates for Different Cloud Users 15
  • 16.
    Big Data needsa Mix of Workloads Other Hadoop batch analysis Compute layer NoSQL Big SQL HBase real-time queries Impala, Pivotal HawQ Cassandra, Mongo, etc Spark, Shark, Solr, Platfora, Etc,… File System/Data Store Platform Virtualization Technology Host 16 Host Host Host Host Host Host
  • 17.
    Strong Isolation betweenWorkloads is Key Hungry Workload 1 Reckless Workload 2 Virtualization Platform 17 Nosy Workload 3
  • 18.
    Community activity inIsolation and Resource Management !  YARN •  Goal: Support workloads other than M-R on Hadoop •  Initial need is for MPI/M-R from Yahoo •  Non-posix File system self selects workload types !  Mesos •  Distributed Resource Broker •  Mixed Workloads with some RM •  Active project, in use at Twitter •  Leverages OS Virtualization – e.g. cgroups !  Virtualization •  Virtual machine as the primary isolation, resource management and versioned deployment container •  Basis for Project Serengeti 18
  • 19.
    Use case: ElasticHadoop with Tiered SLA •  Production workloads has high priority •  Experimentation workloads has lower priority Experimentation Mapreduce Compute layer Dynamic resourcepool Compute VM Compute VM Experimentation Experimentation Production Mapreduce Compute VM Compute VM Compute VM 19 Compute VM Compute VM Production recommendation engine Production vSphere Data layer Compute VM
  • 20.
    Cloud Enabled Auto-elasticHadoop J T N N T T VHM T T T T J T T T T T DATA VM DATA VM DATA VM ESX ESX ESX Local Disks JT: JobTracker TT: TaskTracker NN: NameNode VHM: Virtual Hadoop Manager 20 T T Hadoop HDFS VMs Hadoop Compute VMs SAN/NAS Non-Hadoop VMs
  • 21.
    Hadoop Performance withVirtualization 32 hosts/3.6GHz 8 cores/15K RPM 146GB SAS disks/10GbE/72-96GB RAM (lower is better) [http://www.vmware.com/resources/techresources/10360, Jeff Buell, Apr 2013] 21
  • 22.
    Network Platform © 2009VMware Inc. All rights reserved
  • 23.
    A Typical NetworkArchitecture Aggrega7ng% Switch% Aggrega7ng% Switch% L2%Switch% L2%Switch% L2%Switch% L2%Switch% Top%of%Rack%Switch% Top%of%Rack%Switch% Top%of%Rack%Switch% Top%of%Rack%Switch% Host% Host% Host% Host% Host% Host% Host% Host% Host% Host% Host% Host% 23 Host% Host% Host% Host%
  • 24.
    Traditional Networks: CoreSwitch is the Choke Point Core Aggregation Network Topology Rack Hosts Hosts Non Uniform Bandwidth 10s of Gbits 100s of Gbits 24 Modeled Bandwidth
  • 25.
    Modern Networks: Greatfor Big Data Spine Network Topology Leaf Hosts Uniform Bandwidth 25 Modeled Bandwidth
  • 26.
    Flat Networks Allowfor New Infrastructure Models Converged Storage Separated Storage Top%of%Rack%Switch% Top%of%Rack%Switch% Host% Host% Host% Compute Top%of%Rack%Switch% Host% Storage Top%of%Rack%Switch% Host% Storage% Host% Storage% Host% Storage% Host% Storage% Host% Host% Host% Separated Storage Host% Host% Host% 26 Storage%
  • 27.
    Storage Platform © 2009VMware Inc. All rights reserved
  • 28.
    Use Local Diskwhere it’s Needed SAN Storage Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: 0.5Petabytes 200,000 IOPS 8Gbyte/sec 28 NAS Filers $1M gets: 1 Petabyte 200,000 IOPS 10Gbyte/sec $1M gets: 10 Petabytes 400,000 IOPS 250 Gbytes/sec
  • 29.
    Storage Economics: Traditionalvs. Scale-out $5.50 Traditional SAN/NAS $5.00 $4.50 $4.00 $3.50 $3.00 Cost per GB $2.50 $2.00 $1.50 Distributed Object Storage $1.00 $0.50 $0.5 1 2 4 8 16 Petabytes Deployed Scale-out NAS Isilon 29 32 64 128 HDFS MAPR CEPH Gluster Scality
  • 30.
    Big Data StorageArchitectures External SAN With HDFS Local Disks With HDFS or Other HDFS, CEPH, MAPR, Gluster, Scality, … 30 External Scale-out NAS
  • 31.
    Features from NewStorage Solutions Snapshots Universal File Store Clones Geo Replication Erasure Coding NFS Access Posix Support 31 QoS Controls SSD Capability
  • 32.
    Summary © 2009 VMwareInc. All rights reserved
  • 33.
    Customers Winning fromConsolidated Big Data Platforms “Dedicated hardware makes no sense” “Software-defined Datacenter enables rapid deployment multiple tenants and labs” “Our mixed workloads include Hadoop, Database, ETL and App-servers” Compute Management Storage/Availability Network/Security 33 “Any performance penalties are minor”
  • 34.
    Cloud Infrastructure isReady for Big Data – Are you? Cloud Infrastructure 34
  • 35.
    Is Your CloudReady for Big Data? Richard McDougall CTO, Storage and Application Services @richardmcdougll © 2009 VMware Inc. All rights reserved