This talk explains the reasons why virtualizing Spark, in-house or elsewhere, is a requirement in today’s fast-moving and experimental world of data science and data engineering. Different teams want to spin up a Spark cluster “on the fly” to carry out some research and quickly answer business questions. They are not concerned with the availability of the server hardware – or with what any other team might be doing on it at the time. Virtualization provides the means of working within your own sandbox to try out the new query or Machine Learning algorithm. Deep performance test results will be shown that demonstrate that Spark and ML programs perform equally well on virtual machines just like native implementations do. An early introduction is given to the best practices you should adhere to when you do this.
2. Agenda
• Why Virtualize Spark?
• A Review of the Architectures
• What does Virtualized Spark look like?
• Virtualizing Spark in the Private Cloud and Public Cloud – using
the same infrastructure
• VMware Cloud on AWS – a platform for Spark in the public cloud
• Performance
• A Key Best Practice
• Conclusions
4. Use Cases : Virtualization of Big Data
•Enterprises have development, test, pre-prod staging and
production clusters that are required to be separated from each
other and provisioned independently
•Organizations need different versions of Spark to be available
to different teams - with possibly different services available
•Enterprises do not wish to dedicate a specific set of hardware to
each different requirement above, and want to reduce overall
costs
CONFIDENTIAL 4
12. Deploying Spark in the Private Cloud
Source: AWS presentation at VMworld 2017
ENI
Compute Tier- Spark
Virtual Machine
Storage Tier-
choice of NFS or HDFS or others for data
handling
File1
File2
Enterprise
Storage
System
13. ENI
NFS, GlusterFS or HDFS
Hadoop
Virtual
Node 1
Clustered File Services
Hadoop
Virtual
Node 1
vSAN
On-Premises vSphere
Application Cluster 1 –
Model Development
Hadoop
Virtual
Node 1
vSAN
On-premises vSphere
Application Cluster 2 –
Model Testing/Staging
Hadoop
Virtual
Node 1
vSAN
On-Premises vSphere
Application Cluster 3 -
Production
2. Store
trained
M
L
m
odel
1. Read
training
data
3. Read trained model and execute on test
data
4. Read trained model
and execute on
production data
Sharing ML Models and Data using Shared Storage
15. Evolving Cloud Storage for Big Data
Output
Local
Storage
Application
Input
Hadoop
Virtual
Node 1
Source: Steve Loughran, Hadoop Summit
presentation
S3
Caching
– S3 for Big Data
• S3 is an Object storage system rather than a file
system
• Moving existing HDFS data to S3 requires care (file->
object mapping)
• S3 is eventually consistent
• Guard mechanisms like S3Guard to ensure file
consistency
• Caching locally to improve performance of storage
access
16. Deploying Spark in the Public Cloud
Source: AWS presentation at VMworld 2017
S3 Object
Store
ENI
Other AWS Services
Compute Tier- Spark
Virtual Machine
Storage Tier- Cloud Storage
Object1
Object2
17. VMware Cloud on AWS
Powered by VMware Cloud Foundation
Customer Data Center
vSphere vSAN NSX
Operational
Management
Native AWS Services
…
…
…
…
vCentervCenter
VMware Cloud on AWS
• ESXi on Dedicated
Hardware
• Support for VMs and
Containers
• vSAN on Flash and EBS
Storage
• Replication and DR
Orchestration
• NSX Spanning on-
premises and cloud
• Advanced
Networking, and
Security Services
AWS Global Infrastructure
Amazon
EC2
Amazon
S3
Amazon
RDS
AWS Direct
Connect
Amazon
DynamoDB
Amazon
Redshift
18. Sold as a Service
§ VMware manages hypervisor and management components
§ AWS manages physical resources
§ Customer manages VMs
§ Customer decides how many VMs to run on vSphere
20. ENI
AWS
Hadoop
Virtual
Node 1
EC2 Instances
S3
Other AWS Services
Hadoop
Virtual
Node 1
vSAN
VMware Cloud on AWS
Application Cluster 1 –
Model Development
Hadoop
Virtual
Node 1
vSAN
VMware Cloud on AWS
Application Cluster 2 –
Testing/Staging
Hadoop
Virtual
Node 1
vSAN
On-premises vSphere
Application Cluster 3 -
Production
2. Store
trained
M
L
m
odel
1. Read
training
data
3. Read trained model and execute on test
data
4. Read trained model
and execute on
production data
Sharing ML models and Data using Cloud Storage
25. 1 TB RAM
on Server
Each NUMA
Node has 1024/2
512GB
482 GB RAM
for each VM
NUMA and Virtual
Machine Placement
26. Virtualizing Spark - conclusions
Agility
• Infrastructure on
demand
• Sharing of physical
resources – not
dedicated clusters
Simplified
Management
• Centralized
data center
management
• Apply virtualization
best practices
Efficiency
• Resource pooling
• Server and cluster
consolidation
Performance
• Equal to, or better
performance than
native Hadoop
• No significant
overhead
26
vSphere and VMware Cloud on AWS
31. Workloads - Spark
• Two standard analytic programs from the Spark MLLib (Machine Learning
Library)
• Driven using SparkBench (https://github.com/SparkTC/spark-bench)
– Support Vector Machine
– Logistic Regression
CONFIDENTIAL 31
34. Results - Spark
•Support Vector Machines workload, which stayed in memory, ran
about 10% faster in virtualized form than on bare metal
•Logistic Regression workload, which was written to disk at the
larger dataset sizes, showed a slight advantage to bare metal
•part of the dataset was cached to disk,
•larger memory of the bare metal Spark executors may help
•Both workloads showed linear scaling from 5 to 10 hosts and as
dataset size increased
CONFIDENTIAL 34
35. §Spark workloads work very well on VMware
vSphere
• Various performance studies have shown that any
difference between virtualized performance and native
performance is minimal
• Follow the general best practice guidelines that VMware
has published
• Design patterns such as data-compute separation can be
used to provide elasticity of your Spark cluster.
Conclusions
36. Add Slides as Necessary
• Supporting points go here.