Virtualize Big Data to
Make the Elephant
Dance
June Yang, Senior Director of Product Management, VMWare
Dan Baskett, Senior Consultant Technologist, Pivotal

© Copyright 2013 EMC Corporation. All rights reserved.

1
Unstructured Data is exploding… Hadoop is driving growth
Hadoop adoption is ramping

Unstructured data driving growth

Don't know Other
2%
2%
Testing
2%

Complex unstructured data
forecasted to outpace structured
relational data by 10x by 2020

Piloting
18%
Inproduction
23%

2011

2012

2013

2014

2015

2016

Structured

2017

2018

Unstructured

2019

Evaluating
53%

2020

Source: Forrester Survey of 60 CIOs , September 2011

• Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider
Enterprise data strategy
•
•

Gartner predicts +800% data growth over next 5 years
Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs

© Copyright 2013 EMC Corporation. All rights reserved.

2
Broad Application of Hadoop Technology
Use Cases

Vertical Industries

Log Processing / Click
Stream Analytics

Financial Services

Machine Learning /
sophisticated data mining

Internet Retailer

Web crawling / text
processing

Pharmaceutical / Drug
Discovery

Extract Transform Load
(ETL) replacement

Mobile / Telecom

Image / XML message
processing

Scientific Research

General archiving /
compliance

Social Media

Hadoop is a platform that will revolutionize how Enterprises handle data

© Copyright 2013 EMC Corporation. All rights reserved.

3
The Big Data Journey in the Enterprise
Integrated

Stage 3: Cloud Analytics Platform
• Serve many departments
• Often part of mission critical workflow
• Fully integrated with analytics/BI tools
Stage 2: Hadoop Production
• Serve a few departments
• More use cases
• Growing # and size of clusters
• Core Hadoop + components

Stage1: Hadoop Piloting
• Often start with line of business
• Try 1 or 2 use cases to explore
the value of Hadoop

0 node

© Copyright 2013 EMC Corporation. All rights reserved.

10’s

100’s

Scale
4
Deploy Hadoop Clusters in Minutes

© Copyright 2013 EMC Corporation. All rights reserved.

5
One click to scale out your cluster on the fly

© Copyright 2013 EMC Corporation. All rights reserved.

6
Customize your Hadoop/Hbase Cluster
Customize with Cluster
Specification File

© Copyright 2013 EMC Corporation. All rights reserved.

7
Cluster Spec File Details
Storage configuration

Choice of shared storage or Local disk

High availability option

# of Hadoop nodes
Resource configuration

© Copyright 2013 EMC Corporation. All rights reserved.

Cluster Specification File
"groups":[
{ "name":"master",
"roles":[
"hadoop_namenode",
"hadoop_jobtracker”],
"storage": {
"type": "SHARED”, sizeGB": 20},
"instance_type":MEDIUM,
"instance_num":1,
"ha":true},
{"name":"worker",
"roles":[
"hadoop_datanode",
"hadoop_tasktracker"
],
"instance_type":SMALL,
"instance_num":5,
"ha":false
…

8
Your Choice of Hadoop Distributions and Tools
Distributions

Community Projects

• Flexibility to choose and try out major distributions
• Support for multiple projects
• Open architecture to welcome industry participation
• Contributing Hadoop Virtualization Extensions (HVE) to open source
community
© Copyright 2013 EMC Corporation. All rights reserved.

9
Proactive monitoring with VCOPs
 Proactively monitoring through VCOPs
 Gain comprehensive visibility
 Eliminate manual processes with intelligent automation
 Proactively manage operations
 Alternatively, use monitoring tools like Nagios, Ganglia

© Copyright 2013 EMC Corporation. All rights reserved.

10
Beyond day 1 - Automation of Hadoop Cluster lifecycle management

…

Deploy

Custo
mize

Scaling

Tune
config
uration

Load
data
Execut
e jobs

© Copyright 2013 EMC Corporation. All rights reserved.

11
The Big Data Journey in the Enterprise
Integrated
Stage 2: Hadoop Production
• Serve a few departments
• More use cases
• Growing # and size of clusters
• Core Hadoop + components

Stage1: Hadoop Piloting
 Rapid deployment
 On the fly cluster resizing
 Choice of Hadoop distros
 Automation of cluster lifecycle

0 node

© Copyright 2013 EMC Corporation. All rights reserved.

10’s

100’s

Scale
12
Achieve HA for the Entire Hadoop Stack

Zookeepr

(Coordination)

Pig

(Data Flow)

BI Reporting
Hive

(SQL)

RDBMS
Hive MetaDB

HCatalog

Hcatalog MDB

MapReduce (Job Scheduling/Execution System)
HBase (Key-Value store)
HDFS

(Hadoop Distributed File System)

Jobtracker
Namenode

Management Server

ETL Tools

Server

• vSphere HA is battle-tested high availability technology
• Single mechanism to achieve HA for the entire Hadoop stack
• One click to enable HA and/or FT

© Copyright 2013 EMC Corporation. All rights reserved.

13
Challenges of Running Hadoop in Enterprises
Dept A: recommendation engine

Production

Production

Test

Log files

Experimentation

Transaction data

Dept B: ad targeting

Social data

© Copyright 2013 EMC Corporation. All rights reserved.

On the horizon…
NoSQL

Real time SQL

…

Test

Experimentation

Historical cust behavior

Pain Points:
1. Cluster sprawling
2. Redundant common data in
separate clusters
3. Difficult use the right tool for
the right problem
4. Peak compute and I/O
resource is limited to number
of nodes in each independent
cluster
14
What if you can…
Recommendation engine

Ad targeting

Production

Production

Test

Experimentation

Test

Experimentation

© Copyright 2013 EMC Corporation. All rights reserved.

One physical platform to support multiple virtual
big data clusters

Experimentation
Production
recommendation engine

Test/Dev
Production
Ad Targeting

15
Bigger is Better
 Hadoop is linearly scalable, more nodes, better performance,
for the same job, it will take
– 2 hour to complete on a 50 node cluster
– 1 hour to complete on a 100 node cluster
– 30 min to complete on a 200 node cluster

© Copyright 2013 EMC Corporation. All rights reserved.

16
You may ask


What about differentiated SLAs
–
–



For production Hadoop jobs, need to ensure high priority
Lower priority of experimental Hadoop jobs.

Will I have a noisy neighbor problems with shared infrastructure
approach?

© Copyright 2013 EMC Corporation. All rights reserved.

17
VM Containers with Isolation are a Tried and Tested
Approach
Reckless Workload 2

Hungry Workload 1

Noisy
Workload 3
VMware vSphere + Serengeti
Host

Host

© Copyright 2013 EMC Corporation. All rights reserved.

Host

Host

Host

Host

Host

18
Shared infrastructure: Three big types of Isolation are Required

 Resource Isolation
• Control the greedy noisy neighbor
• Reserve resources to meet needs
 Version Isolation
• Allow concurrent OS, App, Distro versions
 Security Isolation
• Provide privacy between users/groups
• Runtime and data privacy required

VMware vSphere + Serengeti
Host

Host

© Copyright 2013 EMC Corporation. All rights reserved.

Host

Host

Host

Host

Host

19
With virtualization, you can have your cake and eat it
too
 One physical platform to support
multiple virtual big data clusters

Experimentation

Compute
layer
Data
layer

Production
recommendation engine

Test/Dev
Production
Ad Targeting

VMware vSphere + Serengeti

–
–

Low Priority
High Priority

–
–

Share data to minimize copying
Single infrastructure to
maintain
Bigger cluster for better
performance
Share hardware resource to
achieve higher utilization

 Virtualization ensures strong
isolation between clusters.
–
–
–
–

© Copyright 2013 EMC Corporation. All rights reserved.

Resource isolation.
Failure isolation
Configure isolation
Security isolation

20
Elastic Hadoop with Virtualization
VM

Hadoop Node

Combined
Storage/Com
pute

Unmodified Hadoop
node in a VM
 VM lifecycle
determined
by Datanode
 Limited elasticity

© Copyright 2013 EMC Corporation. All rights reserved.

VM

VM

T1

Compute
VM

Storage
Separate Compute from
Storage
 Separate compute
from data
 Stateless compute
 Elastic compute

VM

VM

T2

Storage

Separate Virtual Compute Clusters
per tenant
 Separate virtual compute
 Compute cluster per tenant
 Stronger VM-grade security
and resource isolation

21
Scale in/out Hadoop dynamically
 Deploy separate compute clusters for different tenants sharing HDFS.
 Commission/decommission task trackers according to priority and
available resources
Job
Tracker

Job
Tracker

Compute layer

Compute
VM

Compute
VM

Dynamic resourcepool

Experimentation
Experimentation

Compute
VM

Compute
VM

Compute
VM

Compute
VM

Compute
VM

Compute
VM

Production
recommendation engine

Production
VMware vSphere + Serengeti

Data layer

© Copyright 2013 EMC Corporation. All rights reserved.

22
The Big Data Journey in the Enterprise
Integrated

Stage 3: Cloud Analytics Platform
• Serve many departments
• Often part of mission critical workflow
• Fully integrated with analytics/BI tools
Stage 2: Hadoop Production
 High Availability
 Consolidation
 Differentiated SLAs
 Elastic Scaling

Stage1: Hadoop Piloting
 Rapid deployment
 On the fly cluster resizing
 Choice of Hadoop distros
 Automation of cluster lifecycle

0 node

© Copyright 2013 EMC Corporation. All rights reserved.

10’s

100’s

Scale
23
Business
Intelligence

Cloud Analytics Platform

Machine
Learning

Real Time
Streams

CETAS

Automated
Models
Stream
Processing

E
T
L

Data Visualization
…

Real Time
Structured
Database

Data
Warehouse

Unstructured
and Batch
Processing

HDFS
Compute

© Copyright 2013 EMC Corporation. All rights reserved.

Cloud Infrastructure
Storage

Networking

24
Big Data Tools and Characteristics
Framework

Scale of
data

Scale of
Cluster

Computable
Data?

Local Disks?

Map-reduce:

100s PB

10s to 1,000s

Yes

Yes, for cost,
bandwidth and
availability

Big-SQL:

PB’s

10s to 100s

Some

Yes, for cost and
bandwidth

No-SQL:

Cassandra, hBase, …

Trilions
Of rows

10s to 100s

Some

Yes, for cost and
availability

In-Memory:

Billions of rows

10s-100s

Yes

Primarily
Memory

Hadoop

HawQ,, Aster Data, Impala,
…

Redis, Gemfire, Membase,
…

© Copyright 2013 EMC Corporation. All rights reserved.

25
Choose a platform that…
Allows user to pick the right tools at the right
time
Put resources where needed based on SLA policy

© Copyright 2013 EMC Corporation. All rights reserved.

26
In-house Hadoop as a Service – (Hadoop + Hadoop)
Production
ETL of log files

Ad hoc
data mining

Compute
layer
Data
layer

Production
recommendation engine
HDFS

HDFS

VMware vSphere + Serengeti
Host

© Copyright 2013 EMC Corporation. All rights reserved.

Host

Host

Host

Host

Host

27
Integrated Big Data Production – (Mixed big data workloads)
Hadoop
batch analysis

Compute
layer
Data
layer

HBase
real-time queries
HDFS

NoSQL –
Cassandra
key-value
store

MPP DBMS –
Analysis of
structured data

VMware vSphere + Serengeti
Host

© Copyright 2013 EMC Corporation. All rights reserved.

Host

Host

Host

Host

Host

28
Integrated Hadoop and Webapps – (Big Data + Other Workloads)
Short-lived
Hadoop compute cluster

Compute
layer
Data
layer

Hadoop
compute cluster

Web servers
for ecommerce site

HDFS
VMware vSphere + Serengeti
Host

© Copyright 2013 EMC Corporation. All rights reserved.

Host

Host

Host

Host

Host

29
The Big Data Journey in the Enterprise
Stage 3: Cloud Analytics Platform
 Mixed workloads
 Right tool at the right time
 Flexible and elastic infrastrure

Integrated

Stage 2: Hadoop Production
 High Availability
 Consolidation
 Differentiated SLAs
 Elastic Scaling

Stage1: Hadoop Piloting
 Rapid deployment
 On the fly cluster resizing
 Choice of Hadoop distros
 Automation of cluster lifecycle

0 node

© Copyright 2013 EMC Corporation. All rights reserved.

10’s

100’s

Scale
30
Learn More
 Download and try Serengeti
–

projectserengeti.org

• VMware Hadoop site
–

vmware.com/hadoop

• Hadoop performance on vSphere white
paper
–

http://www.vmware.com/files/pdf/techpaper
/hadoop-vsphere51-32hosts.pdf

• Hadoop virtualization extensions (HVE)
Whitepaper
–

© Copyright 2013 EMC Corporation. All rights reserved.

http://www.vmware.com/files/pdf/techpaper
/hadoop-vsphere51-32hosts.pdf

31
Thank You!
June Yang

Senior Director, VMware
juneyang@vmware.com

© Copyright 2013 EMC Corporation. All rights reserved.

Dan Baskette

Senior Consultant Technologist
dan.baskette@emc.com

32
Pivotal Sessions at EMC World
Session

Presenter

Dates/Times

The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications

Josh Klahr

Tue 5:30 - 6:30, Palazzo E Wed
11:30 - 12:30, Delfino 4005

Pivotal: Data Scientists on the Front Line: Examples of
Data Science in Action

Noelle Sio

Tue 10:00 - 11:00, Lando 4205
Thu 8:30 - 9:30, Palazzo F

Pivotal: Operationalizing 1000-node Hadoop Cluster –
Analytics Workbench

Clinton Ooi
Bhavin Modi

Tue 11:30 - 12:30, Palazzo L Thu
10:00- 11:00 am, Delfino 4001A

Pivotal: for Powerful Processing of Unstructured Data For
Valuable Insights

SK
Krishnamurthy

Mon 4:00 - 5:00, Lando 4201 A
Tue 4:00 - 5:00, Palazzo M

Pivotal: Big & Fast data – merging real-time data and deep
analytics

Michael
Crutcher

Mon 1:00 - 2:00, Lando 4201 A
Wed 10:00 - 11:00, Palazzo M

Pivotal: Virtualize Big Data to Make The Elephant Dance

June Yang
Dan Baskette

Mon 11:30 - 12:30, Marcello
4401A Wed 4:00 - 5:00, Palazzo
E

Hadoop Design Patterns

Don Miner

Mon 2:30 - 3:30, Palazzo F Wed
8:30 - 9:30, Delfino 4005

© Copyright 2013 EMC Corporation. All rights reserved.

33
Pivotal: Virtualize Big Data to Make the Elephant Dance

Pivotal: Virtualize Big Data to Make the Elephant Dance

  • 1.
    Virtualize Big Datato Make the Elephant Dance June Yang, Senior Director of Product Management, VMWare Dan Baskett, Senior Consultant Technologist, Pivotal © Copyright 2013 EMC Corporation. All rights reserved. 1
  • 2.
    Unstructured Data isexploding… Hadoop is driving growth Hadoop adoption is ramping Unstructured data driving growth Don't know Other 2% 2% Testing 2% Complex unstructured data forecasted to outpace structured relational data by 10x by 2020 Piloting 18% Inproduction 23% 2011 2012 2013 2014 2015 2016 Structured 2017 2018 Unstructured 2019 Evaluating 53% 2020 Source: Forrester Survey of 60 CIOs , September 2011 • Unstructured data explosion and Hadoop capabilities causing CIOs to reconsider Enterprise data strategy • • Gartner predicts +800% data growth over next 5 years Hadoop’s ability to process raw data at cost presents intriguing value prop for CIOs © Copyright 2013 EMC Corporation. All rights reserved. 2
  • 3.
    Broad Application ofHadoop Technology Use Cases Vertical Industries Log Processing / Click Stream Analytics Financial Services Machine Learning / sophisticated data mining Internet Retailer Web crawling / text processing Pharmaceutical / Drug Discovery Extract Transform Load (ETL) replacement Mobile / Telecom Image / XML message processing Scientific Research General archiving / compliance Social Media Hadoop is a platform that will revolutionize how Enterprises handle data © Copyright 2013 EMC Corporation. All rights reserved. 3
  • 4.
    The Big DataJourney in the Enterprise Integrated Stage 3: Cloud Analytics Platform • Serve many departments • Often part of mission critical workflow • Fully integrated with analytics/BI tools Stage 2: Hadoop Production • Serve a few departments • More use cases • Growing # and size of clusters • Core Hadoop + components Stage1: Hadoop Piloting • Often start with line of business • Try 1 or 2 use cases to explore the value of Hadoop 0 node © Copyright 2013 EMC Corporation. All rights reserved. 10’s 100’s Scale 4
  • 5.
    Deploy Hadoop Clustersin Minutes © Copyright 2013 EMC Corporation. All rights reserved. 5
  • 6.
    One click toscale out your cluster on the fly © Copyright 2013 EMC Corporation. All rights reserved. 6
  • 7.
    Customize your Hadoop/HbaseCluster Customize with Cluster Specification File © Copyright 2013 EMC Corporation. All rights reserved. 7
  • 8.
    Cluster Spec FileDetails Storage configuration Choice of shared storage or Local disk High availability option # of Hadoop nodes Resource configuration © Copyright 2013 EMC Corporation. All rights reserved. Cluster Specification File "groups":[ { "name":"master", "roles":[ "hadoop_namenode", "hadoop_jobtracker”], "storage": { "type": "SHARED”, sizeGB": 20}, "instance_type":MEDIUM, "instance_num":1, "ha":true}, {"name":"worker", "roles":[ "hadoop_datanode", "hadoop_tasktracker" ], "instance_type":SMALL, "instance_num":5, "ha":false … 8
  • 9.
    Your Choice ofHadoop Distributions and Tools Distributions Community Projects • Flexibility to choose and try out major distributions • Support for multiple projects • Open architecture to welcome industry participation • Contributing Hadoop Virtualization Extensions (HVE) to open source community © Copyright 2013 EMC Corporation. All rights reserved. 9
  • 10.
    Proactive monitoring withVCOPs  Proactively monitoring through VCOPs  Gain comprehensive visibility  Eliminate manual processes with intelligent automation  Proactively manage operations  Alternatively, use monitoring tools like Nagios, Ganglia © Copyright 2013 EMC Corporation. All rights reserved. 10
  • 11.
    Beyond day 1- Automation of Hadoop Cluster lifecycle management … Deploy Custo mize Scaling Tune config uration Load data Execut e jobs © Copyright 2013 EMC Corporation. All rights reserved. 11
  • 12.
    The Big DataJourney in the Enterprise Integrated Stage 2: Hadoop Production • Serve a few departments • More use cases • Growing # and size of clusters • Core Hadoop + components Stage1: Hadoop Piloting  Rapid deployment  On the fly cluster resizing  Choice of Hadoop distros  Automation of cluster lifecycle 0 node © Copyright 2013 EMC Corporation. All rights reserved. 10’s 100’s Scale 12
  • 13.
    Achieve HA forthe Entire Hadoop Stack Zookeepr (Coordination) Pig (Data Flow) BI Reporting Hive (SQL) RDBMS Hive MetaDB HCatalog Hcatalog MDB MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) HDFS (Hadoop Distributed File System) Jobtracker Namenode Management Server ETL Tools Server • vSphere HA is battle-tested high availability technology • Single mechanism to achieve HA for the entire Hadoop stack • One click to enable HA and/or FT © Copyright 2013 EMC Corporation. All rights reserved. 13
  • 14.
    Challenges of RunningHadoop in Enterprises Dept A: recommendation engine Production Production Test Log files Experimentation Transaction data Dept B: ad targeting Social data © Copyright 2013 EMC Corporation. All rights reserved. On the horizon… NoSQL Real time SQL … Test Experimentation Historical cust behavior Pain Points: 1. Cluster sprawling 2. Redundant common data in separate clusters 3. Difficult use the right tool for the right problem 4. Peak compute and I/O resource is limited to number of nodes in each independent cluster 14
  • 15.
    What if youcan… Recommendation engine Ad targeting Production Production Test Experimentation Test Experimentation © Copyright 2013 EMC Corporation. All rights reserved. One physical platform to support multiple virtual big data clusters Experimentation Production recommendation engine Test/Dev Production Ad Targeting 15
  • 16.
    Bigger is Better Hadoop is linearly scalable, more nodes, better performance, for the same job, it will take – 2 hour to complete on a 50 node cluster – 1 hour to complete on a 100 node cluster – 30 min to complete on a 200 node cluster © Copyright 2013 EMC Corporation. All rights reserved. 16
  • 17.
    You may ask  Whatabout differentiated SLAs – –  For production Hadoop jobs, need to ensure high priority Lower priority of experimental Hadoop jobs. Will I have a noisy neighbor problems with shared infrastructure approach? © Copyright 2013 EMC Corporation. All rights reserved. 17
  • 18.
    VM Containers withIsolation are a Tried and Tested Approach Reckless Workload 2 Hungry Workload 1 Noisy Workload 3 VMware vSphere + Serengeti Host Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 18
  • 19.
    Shared infrastructure: Threebig types of Isolation are Required  Resource Isolation • Control the greedy noisy neighbor • Reserve resources to meet needs  Version Isolation • Allow concurrent OS, App, Distro versions  Security Isolation • Provide privacy between users/groups • Runtime and data privacy required VMware vSphere + Serengeti Host Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 19
  • 20.
    With virtualization, youcan have your cake and eat it too  One physical platform to support multiple virtual big data clusters Experimentation Compute layer Data layer Production recommendation engine Test/Dev Production Ad Targeting VMware vSphere + Serengeti – – Low Priority High Priority – – Share data to minimize copying Single infrastructure to maintain Bigger cluster for better performance Share hardware resource to achieve higher utilization  Virtualization ensures strong isolation between clusters. – – – – © Copyright 2013 EMC Corporation. All rights reserved. Resource isolation. Failure isolation Configure isolation Security isolation 20
  • 21.
    Elastic Hadoop withVirtualization VM Hadoop Node Combined Storage/Com pute Unmodified Hadoop node in a VM  VM lifecycle determined by Datanode  Limited elasticity © Copyright 2013 EMC Corporation. All rights reserved. VM VM T1 Compute VM Storage Separate Compute from Storage  Separate compute from data  Stateless compute  Elastic compute VM VM T2 Storage Separate Virtual Compute Clusters per tenant  Separate virtual compute  Compute cluster per tenant  Stronger VM-grade security and resource isolation 21
  • 22.
    Scale in/out Hadoopdynamically  Deploy separate compute clusters for different tenants sharing HDFS.  Commission/decommission task trackers according to priority and available resources Job Tracker Job Tracker Compute layer Compute VM Compute VM Dynamic resourcepool Experimentation Experimentation Compute VM Compute VM Compute VM Compute VM Compute VM Compute VM Production recommendation engine Production VMware vSphere + Serengeti Data layer © Copyright 2013 EMC Corporation. All rights reserved. 22
  • 23.
    The Big DataJourney in the Enterprise Integrated Stage 3: Cloud Analytics Platform • Serve many departments • Often part of mission critical workflow • Fully integrated with analytics/BI tools Stage 2: Hadoop Production  High Availability  Consolidation  Differentiated SLAs  Elastic Scaling Stage1: Hadoop Piloting  Rapid deployment  On the fly cluster resizing  Choice of Hadoop distros  Automation of cluster lifecycle 0 node © Copyright 2013 EMC Corporation. All rights reserved. 10’s 100’s Scale 23
  • 24.
    Business Intelligence Cloud Analytics Platform Machine Learning RealTime Streams CETAS Automated Models Stream Processing E T L Data Visualization … Real Time Structured Database Data Warehouse Unstructured and Batch Processing HDFS Compute © Copyright 2013 EMC Corporation. All rights reserved. Cloud Infrastructure Storage Networking 24
  • 25.
    Big Data Toolsand Characteristics Framework Scale of data Scale of Cluster Computable Data? Local Disks? Map-reduce: 100s PB 10s to 1,000s Yes Yes, for cost, bandwidth and availability Big-SQL: PB’s 10s to 100s Some Yes, for cost and bandwidth No-SQL: Cassandra, hBase, … Trilions Of rows 10s to 100s Some Yes, for cost and availability In-Memory: Billions of rows 10s-100s Yes Primarily Memory Hadoop HawQ,, Aster Data, Impala, … Redis, Gemfire, Membase, … © Copyright 2013 EMC Corporation. All rights reserved. 25
  • 26.
    Choose a platformthat… Allows user to pick the right tools at the right time Put resources where needed based on SLA policy © Copyright 2013 EMC Corporation. All rights reserved. 26
  • 27.
    In-house Hadoop asa Service – (Hadoop + Hadoop) Production ETL of log files Ad hoc data mining Compute layer Data layer Production recommendation engine HDFS HDFS VMware vSphere + Serengeti Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 27
  • 28.
    Integrated Big DataProduction – (Mixed big data workloads) Hadoop batch analysis Compute layer Data layer HBase real-time queries HDFS NoSQL – Cassandra key-value store MPP DBMS – Analysis of structured data VMware vSphere + Serengeti Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 28
  • 29.
    Integrated Hadoop andWebapps – (Big Data + Other Workloads) Short-lived Hadoop compute cluster Compute layer Data layer Hadoop compute cluster Web servers for ecommerce site HDFS VMware vSphere + Serengeti Host © Copyright 2013 EMC Corporation. All rights reserved. Host Host Host Host Host 29
  • 30.
    The Big DataJourney in the Enterprise Stage 3: Cloud Analytics Platform  Mixed workloads  Right tool at the right time  Flexible and elastic infrastrure Integrated Stage 2: Hadoop Production  High Availability  Consolidation  Differentiated SLAs  Elastic Scaling Stage1: Hadoop Piloting  Rapid deployment  On the fly cluster resizing  Choice of Hadoop distros  Automation of cluster lifecycle 0 node © Copyright 2013 EMC Corporation. All rights reserved. 10’s 100’s Scale 30
  • 31.
    Learn More  Downloadand try Serengeti – projectserengeti.org • VMware Hadoop site – vmware.com/hadoop • Hadoop performance on vSphere white paper – http://www.vmware.com/files/pdf/techpaper /hadoop-vsphere51-32hosts.pdf • Hadoop virtualization extensions (HVE) Whitepaper – © Copyright 2013 EMC Corporation. All rights reserved. http://www.vmware.com/files/pdf/techpaper /hadoop-vsphere51-32hosts.pdf 31
  • 32.
    Thank You! June Yang SeniorDirector, VMware juneyang@vmware.com © Copyright 2013 EMC Corporation. All rights reserved. Dan Baskette Senior Consultant Technologist dan.baskette@emc.com 32
  • 33.
    Pivotal Sessions atEMC World Session Presenter Dates/Times The Pivotal Platform: A Purpose-Built Platform for Big-DataDriven Applications Josh Klahr Tue 5:30 - 6:30, Palazzo E Wed 11:30 - 12:30, Delfino 4005 Pivotal: Data Scientists on the Front Line: Examples of Data Science in Action Noelle Sio Tue 10:00 - 11:00, Lando 4205 Thu 8:30 - 9:30, Palazzo F Pivotal: Operationalizing 1000-node Hadoop Cluster – Analytics Workbench Clinton Ooi Bhavin Modi Tue 11:30 - 12:30, Palazzo L Thu 10:00- 11:00 am, Delfino 4001A Pivotal: for Powerful Processing of Unstructured Data For Valuable Insights SK Krishnamurthy Mon 4:00 - 5:00, Lando 4201 A Tue 4:00 - 5:00, Palazzo M Pivotal: Big & Fast data – merging real-time data and deep analytics Michael Crutcher Mon 1:00 - 2:00, Lando 4201 A Wed 10:00 - 11:00, Palazzo M Pivotal: Virtualize Big Data to Make The Elephant Dance June Yang Dan Baskette Mon 11:30 - 12:30, Marcello 4401A Wed 4:00 - 5:00, Palazzo E Hadoop Design Patterns Don Miner Mon 2:30 - 3:30, Palazzo F Wed 8:30 - 9:30, Delfino 4005 © Copyright 2013 EMC Corporation. All rights reserved. 33