SlideShare a Scribd company logo
1 of 36
Download to read offline
Justin Murray,
VMware
Virtualizing Spark
Agenda
• Why Virtualize Spark?
• A Review of the Architectures
• What does Virtualized Spark look like?
• Virtualizing Spark in the Private Cloud and Public Cloud – using
the same infrastructure
• VMware Cloud on AWS – a platform for Spark in the public cloud
• Performance
• A Key Best Practice
• Conclusions
Why Virtualize Spark?
Use Cases : Virtualization of Big Data
•Enterprises have development, test, pre-prod staging and
production clusters that are required to be separated from each
other and provisioned independently
•Organizations need different versions of Spark to be available
to different teams - with possibly different services available
•Enterprises do not wish to dedicate a specific set of hardware to
each different requirement above, and want to reduce overall
costs
CONFIDENTIAL 4
Worker Node 1 Worker Node 2 Worker Node 3
Input File
Traditional Hadoop YARN Architecture
ResourcemanagerJob
Datanode
Nodemanager
AppMaster - 1
Nodemanager Nodemanager
Datanode Datanode
Block 1 Block 2 Block 3
Container - 2 Container - 3
Master Roles
Namenode
Worker Node 1 Worker Node 2 Worker Node 3
Input File
Hadoop/YARN – in Virtual Machines
ResourceManagerJob
Datanode
Nodemanager
AppMaster - 1
Nodemanager Nodemanager
Datanode Datanode
Block 1 Block 2 Block 3
Container - 2 Container - 3
Namenode
Master Roles
High Level View of Spark
Worker Node 1 Worker Node 2 Worker Node 3
Spark Standalone - Virtualized
Driver
Job
Executor
JVM
Executor Executor
JVM JVM
Executor
JVM
Executor
JVM
Executor
JVM
Virtual
Machine
NodemanagerNodemanagerNodemanager
Worker Node 1 Worker Node 2 Worker Node 3
The Spark Architecture (on YARN)
Job
Datanode
AppMaster - 1
Datanode Datanode
Block 1 Block 2 Block 3
Container - 2 Container - 3
Namenode
Driver Executor Executor
Resourcemanager
Virtualizing Spark in the Private
Cloud and Public Cloud
Using the Same Infrastructure
High Level View of Spark
Deploying Spark in the Private Cloud
Source: AWS presentation at VMworld 2017
ENI
Compute Tier- Spark
Virtual Machine
Storage Tier-
choice of NFS or HDFS or others for data
handling
File1
File2
Enterprise
Storage
System
ENI
NFS, GlusterFS or HDFS
Hadoop
Virtual
Node 1
Clustered File Services
Hadoop
Virtual
Node 1
vSAN
On-Premises vSphere
Application Cluster 1 –
Model Development
Hadoop
Virtual
Node 1
vSAN
On-premises vSphere
Application Cluster 2 –
Model Testing/Staging
Hadoop
Virtual
Node 1
vSAN
On-Premises vSphere
Application Cluster 3 -
Production
2. Store
trained
M
L
m
odel
1. Read
training
data
3. Read trained model and execute on test
data
4. Read trained model
and execute on
production data
Sharing ML Models and Data using Shared Storage
Big Data - Storage Evolution
Hadoop
Virtual
Node 1
Input
Application
HDFS
Application
Output Input
Backup Restore Output
Copy
HDFS
HDFS
Application
tmpInput
Hadoop
Virtual
Node 1
Source: Steve Loughran, Hadoop Summit 2016
presentation
Output
Cloud Storage Cloud Storage
Evolving Cloud Storage for Big Data
Output
Local
Storage
Application
Input
Hadoop
Virtual
Node 1
Source: Steve Loughran, Hadoop Summit
presentation
S3
Caching
– S3 for Big Data
• S3 is an Object storage system rather than a file
system
• Moving existing HDFS data to S3 requires care (file->
object mapping)
• S3 is eventually consistent
• Guard mechanisms like S3Guard to ensure file
consistency
• Caching locally to improve performance of storage
access
Deploying Spark in the Public Cloud
Source: AWS presentation at VMworld 2017
S3 Object
Store
ENI
Other AWS Services
Compute Tier- Spark
Virtual Machine
Storage Tier- Cloud Storage
Object1
Object2
VMware Cloud on AWS
Powered by VMware Cloud Foundation
Customer Data Center
vSphere vSAN NSX
Operational
Management
Native AWS Services
…
…
…
…
vCentervCenter
VMware Cloud on AWS
• ESXi on Dedicated
Hardware
• Support for VMs and
Containers
• vSAN on Flash and EBS
Storage
• Replication and DR
Orchestration
• NSX Spanning on-
premises and cloud
• Advanced
Networking, and
Security Services
AWS Global Infrastructure
Amazon
EC2
Amazon
S3
Amazon
RDS
AWS Direct
Connect
Amazon
DynamoDB
Amazon
Redshift
Sold as a Service
§ VMware manages hypervisor and management components
§ AWS manages physical resources
§ Customer manages VMs
§ Customer decides how many VMs to run on vSphere
VMware Cloud on AWS : Integration to AWS Services
Hadoop
Virtual
Node 1
Customer-Owned AWS Account
Source: AWS presentation at VMworld 2017
VMware Cloud on AWS Account
Hadoop
Virtual
Node 1
Logical Network
172.31.20.0/24
VM VM
CGW
SDDC
NSX
VSAN
ESXi
Hadoop
Virtual
Node 1
VPC
AWS
Hadoop
Virtual
Node 1
EC2 Instances
Hadoop
Virtual
Node 1 EC2 Instances
S3 VPC
EndPoint
ENI
Other AWS Services
VPC Subnet –
10.1.1.0/24
VPC Subnet –
10.1.2.0/24
Org
Elastic
Network
Interface
Network
Backbone
Compute
Gateway
ENI
AWS
Hadoop
Virtual
Node 1
EC2 Instances
S3
Other AWS Services
Hadoop
Virtual
Node 1
vSAN
VMware Cloud on AWS
Application Cluster 1 –
Model Development
Hadoop
Virtual
Node 1
vSAN
VMware Cloud on AWS
Application Cluster 2 –
Testing/Staging
Hadoop
Virtual
Node 1
vSAN
On-premises vSphere
Application Cluster 3 -
Production
2. Store
trained
M
L
m
odel
1. Read
training
data
3. Read trained model and execute on test
data
4. Read trained model
and execute on
production data
Sharing ML models and Data using Cloud Storage
Spark Workers in Docker Containers on vSphere
21
Performance
Spark Random Forest Performance
#VIRT1445BU CONFIDENTIAL 23
Spark Logistic Regression Performance
#VIRT1445BU CONFIDENTIAL 24
1 TB RAM
on Server
Each NUMA
Node has 1024/2
512GB
482 GB RAM
for each VM
NUMA and Virtual
Machine Placement
Virtualizing Spark - conclusions
Agility
• Infrastructure on
demand
• Sharing of physical
resources – not
dedicated clusters
Simplified
Management
• Centralized
data center
management
• Apply virtualization
best practices
Efficiency
• Resource pooling
• Server and cluster
consolidation
Performance
• Equal to, or better
performance than
native Hadoop
• No significant
overhead
26
vSphere and VMware Cloud on AWS
Thank You.
Contact jmurray@vmware.com or
bigdata@vmware.com
Virtualization
Host Server
VMDK
Hadoop
Node 1
Virtual
Machine
Datanode
Ext4
Nodemanager
Ext4 Ext4 Ext4
Six or More Local DAS disks per Virtual Machine
VMDK VMDK VMDK VMDK VMDK VMDK VMDK
Hadoop
Node 2
Virtual
Machine
Datanode
Ext4
Nodemanager
Ext4 Ext4 Ext4Ext4
VMDKVMDK VMDKVMDK
Ext4Ext4Ext4
Combined Model: Two Virtual Machines on a Host
#1 Reference Architecture from
Cloudera
Performance
Workloads - Spark
• Two standard analytic programs from the Spark MLLib (Machine Learning
Library)
• Driven using SparkBench (https://github.com/SparkTC/spark-bench)
– Support Vector Machine
– Logistic Regression
CONFIDENTIAL 31
Spark Support Vector Machine
Performance
CONFIDENTIAL 32
Spark Logistic Regression
Performance
CONFIDENTIAL 33
Results - Spark
•Support Vector Machines workload, which stayed in memory, ran
about 10% faster in virtualized form than on bare metal
•Logistic Regression workload, which was written to disk at the
larger dataset sizes, showed a slight advantage to bare metal
•part of the dataset was cached to disk,
•larger memory of the bare metal Spark executors may help
•Both workloads showed linear scaling from 5 to 10 hosts and as
dataset size increased
CONFIDENTIAL 34
§Spark workloads work very well on VMware
vSphere
• Various performance studies have shown that any
difference between virtualized performance and native
performance is minimal
• Follow the general best practice guidelines that VMware
has published
• Design patterns such as data-compute separation can be
used to provide elasticity of your Spark cluster.
Conclusions
Add Slides as Necessary
• Supporting points go here.

More Related Content

What's hot

Jenkins, jclouds, CloudStack, and CentOS by David Nalley
Jenkins, jclouds, CloudStack, and CentOS by David NalleyJenkins, jclouds, CloudStack, and CentOS by David Nalley
Jenkins, jclouds, CloudStack, and CentOS by David Nalleybuildacloud
 
AWS and VMware: How to Architect and Manage Hybrid Environments
AWS and VMware: How to Architect and Manage Hybrid EnvironmentsAWS and VMware: How to Architect and Manage Hybrid Environments
AWS and VMware: How to Architect and Manage Hybrid EnvironmentsRightScale
 
Orchestration & provisioning
Orchestration & provisioningOrchestration & provisioning
Orchestration & provisioningbuildacloud
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Adrian Cockcroft
 
Hypervisor Selection in Apache CloudStack 4.4
Hypervisor Selection in Apache CloudStack 4.4Hypervisor Selection in Apache CloudStack 4.4
Hypervisor Selection in Apache CloudStack 4.4Tim Mackey
 
CloudStack-Developer-Day
CloudStack-Developer-DayCloudStack-Developer-Day
CloudStack-Developer-DayKimihiko Kitase
 
Docker Based Hadoop Provisioning
Docker Based Hadoop ProvisioningDocker Based Hadoop Provisioning
Docker Based Hadoop ProvisioningDataWorks Summit
 
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)Amazon Web Services
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practicesOmid Vahdaty
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerYahoo Developer Network
 
Becoming the master of disaster... with asr
Becoming the master of disaster... with asrBecoming the master of disaster... with asr
Becoming the master of disaster... with asrnj-azure
 
CloudStack Architecture Future
CloudStack Architecture FutureCloudStack Architecture Future
CloudStack Architecture FutureKimihiko Kitase
 
Introduction to Apache CloudStack by David Nalley
Introduction to Apache CloudStack by David NalleyIntroduction to Apache CloudStack by David Nalley
Introduction to Apache CloudStack by David Nalleybuildacloud
 
Simple, Scalable and Highly Durable NAS in the Cloud – Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud – Amazon EFSSimple, Scalable and Highly Durable NAS in the Cloud – Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud – Amazon EFSAmazon Web Services
 
Advanced data migration techniques for Amazon RDS
Advanced data migration techniques for Amazon RDSAdvanced data migration techniques for Amazon RDS
Advanced data migration techniques for Amazon RDSTom Laszewski
 
Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017
Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017
Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017Amazon Web Services
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSFAdrian Cockcroft
 
AWS vs. Azure vs. Google vs. SoftLayer: Network, Storage and DBaaS
AWS vs. Azure vs. Google vs. SoftLayer: Network, Storage and DBaaSAWS vs. Azure vs. Google vs. SoftLayer: Network, Storage and DBaaS
AWS vs. Azure vs. Google vs. SoftLayer: Network, Storage and DBaaSRightScale
 

What's hot (20)

Jenkins, jclouds, CloudStack, and CentOS by David Nalley
Jenkins, jclouds, CloudStack, and CentOS by David NalleyJenkins, jclouds, CloudStack, and CentOS by David Nalley
Jenkins, jclouds, CloudStack, and CentOS by David Nalley
 
AWS and VMware: How to Architect and Manage Hybrid Environments
AWS and VMware: How to Architect and Manage Hybrid EnvironmentsAWS and VMware: How to Architect and Manage Hybrid Environments
AWS and VMware: How to Architect and Manage Hybrid Environments
 
Orchestration & provisioning
Orchestration & provisioningOrchestration & provisioning
Orchestration & provisioning
 
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)Cloud Architecture Tutorial - Platform Component Architecture (2of3)
Cloud Architecture Tutorial - Platform Component Architecture (2of3)
 
Hypervisor Selection in Apache CloudStack 4.4
Hypervisor Selection in Apache CloudStack 4.4Hypervisor Selection in Apache CloudStack 4.4
Hypervisor Selection in Apache CloudStack 4.4
 
CloudStack-Developer-Day
CloudStack-Developer-DayCloudStack-Developer-Day
CloudStack-Developer-Day
 
Docker Based Hadoop Provisioning
Docker Based Hadoop ProvisioningDocker Based Hadoop Provisioning
Docker Based Hadoop Provisioning
 
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
SRV401 Deep Dive on Amazon Elastic File System (Amazon EFS)
 
Cloud Architecture best practices
Cloud Architecture best practicesCloud Architecture best practices
Cloud Architecture best practices
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with Docker
 
Disaster Recovery Synapse
Disaster Recovery SynapseDisaster Recovery Synapse
Disaster Recovery Synapse
 
Becoming the master of disaster... with asr
Becoming the master of disaster... with asrBecoming the master of disaster... with asr
Becoming the master of disaster... with asr
 
CloudStack Architecture Future
CloudStack Architecture FutureCloudStack Architecture Future
CloudStack Architecture Future
 
Introduction to Apache CloudStack by David Nalley
Introduction to Apache CloudStack by David NalleyIntroduction to Apache CloudStack by David Nalley
Introduction to Apache CloudStack by David Nalley
 
Simple, Scalable and Highly Durable NAS in the Cloud – Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud – Amazon EFSSimple, Scalable and Highly Durable NAS in the Cloud – Amazon EFS
Simple, Scalable and Highly Durable NAS in the Cloud – Amazon EFS
 
Advanced data migration techniques for Amazon RDS
Advanced data migration techniques for Amazon RDSAdvanced data migration techniques for Amazon RDS
Advanced data migration techniques for Amazon RDS
 
Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017
Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017
Deep Dive on Amazon EC2 Instances - AWS Summit Cape Town 2017
 
OpenStack Report
OpenStack ReportOpenStack Report
OpenStack Report
 
Architectures for High Availability - QConSF
Architectures for High Availability - QConSFArchitectures for High Availability - QConSF
Architectures for High Availability - QConSF
 
AWS vs. Azure vs. Google vs. SoftLayer: Network, Storage and DBaaS
AWS vs. Azure vs. Google vs. SoftLayer: Network, Storage and DBaaSAWS vs. Azure vs. Google vs. SoftLayer: Network, Storage and DBaaS
AWS vs. Azure vs. Google vs. SoftLayer: Network, Storage and DBaaS
 

Similar to Virtualizing Apache Spark and Machine Learning with Justin Murray

Getting started with MariaDB with Docker
Getting started with MariaDB with DockerGetting started with MariaDB with Docker
Getting started with MariaDB with DockerMariaDB plc
 
Getting Started with MariaDB with Docker
Getting Started with MariaDB with DockerGetting Started with MariaDB with Docker
Getting Started with MariaDB with DockerMariaDB plc
 
Morning Coffee - Windows Server 2016
Morning Coffee - Windows Server 2016Morning Coffee - Windows Server 2016
Morning Coffee - Windows Server 2016Primend
 
Virtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin MurrayVirtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin MurrayDatabricks
 
SharePoint on Microsoft Azure
SharePoint on Microsoft AzureSharePoint on Microsoft Azure
SharePoint on Microsoft AzureK.Mohamed Faizal
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Eran Gampel
 
Windows Server 2012 R2 Jump Start - Intro
Windows Server 2012 R2 Jump Start - IntroWindows Server 2012 R2 Jump Start - Intro
Windows Server 2012 R2 Jump Start - IntroPaulo Freitas
 
MariaDB on Docker
MariaDB on DockerMariaDB on Docker
MariaDB on DockerMariaDB plc
 
Cloud Foundry for Spring Developers
Cloud Foundry for Spring DevelopersCloud Foundry for Spring Developers
Cloud Foundry for Spring DevelopersGunnar Hillert
 
Docker - Portable Deployment
Docker - Portable DeploymentDocker - Portable Deployment
Docker - Portable Deploymentjavaonfly
 
Usman Shakeel - Cloud Rendering at Scale :: AWS Rendering Seminar
Usman Shakeel - Cloud Rendering at Scale :: AWS Rendering SeminarUsman Shakeel - Cloud Rendering at Scale :: AWS Rendering Seminar
Usman Shakeel - Cloud Rendering at Scale :: AWS Rendering SeminarAmazon Web Services Korea
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoopChiou-Nan Chen
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBlueData, Inc.
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container EcosystemVinay Rao
 
windows server 2012 R2
windows server 2012 R2windows server 2012 R2
windows server 2012 R2Gol D Roger
 
vCloud Technical deck - cb.ppt
vCloud Technical deck - cb.pptvCloud Technical deck - cb.ppt
vCloud Technical deck - cb.pptjuergenJaeckel
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...VMworld
 
(CMP404) Cloud Rendering at Walt Disney Animation Studios
(CMP404) Cloud Rendering at Walt Disney Animation Studios(CMP404) Cloud Rendering at Walt Disney Animation Studios
(CMP404) Cloud Rendering at Walt Disney Animation StudiosAmazon Web Services
 

Similar to Virtualizing Apache Spark and Machine Learning with Justin Murray (20)

Getting started with MariaDB with Docker
Getting started with MariaDB with DockerGetting started with MariaDB with Docker
Getting started with MariaDB with Docker
 
Getting Started with MariaDB with Docker
Getting Started with MariaDB with DockerGetting Started with MariaDB with Docker
Getting Started with MariaDB with Docker
 
Morning Coffee - Windows Server 2016
Morning Coffee - Windows Server 2016Morning Coffee - Windows Server 2016
Morning Coffee - Windows Server 2016
 
Virtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin MurrayVirtualizing Apache Spark with Justin Murray
Virtualizing Apache Spark with Justin Murray
 
SharePoint on Microsoft Azure
SharePoint on Microsoft AzureSharePoint on Microsoft Azure
SharePoint on Microsoft Azure
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
 
Windows Server 2012 R2 Jump Start - Intro
Windows Server 2012 R2 Jump Start - IntroWindows Server 2012 R2 Jump Start - Intro
Windows Server 2012 R2 Jump Start - Intro
 
MariaDB on Docker
MariaDB on DockerMariaDB on Docker
MariaDB on Docker
 
Cloud Foundry for Spring Developers
Cloud Foundry for Spring DevelopersCloud Foundry for Spring Developers
Cloud Foundry for Spring Developers
 
Docker - Portable Deployment
Docker - Portable DeploymentDocker - Portable Deployment
Docker - Portable Deployment
 
Usman Shakeel - Cloud Rendering at Scale :: AWS Rendering Seminar
Usman Shakeel - Cloud Rendering at Scale :: AWS Rendering SeminarUsman Shakeel - Cloud Rendering at Scale :: AWS Rendering Seminar
Usman Shakeel - Cloud Rendering at Scale :: AWS Rendering Seminar
 
1. beyond mission critical virtualizing big data and hadoop
1. beyond mission critical   virtualizing big data and hadoop1. beyond mission critical   virtualizing big data and hadoop
1. beyond mission critical virtualizing big data and hadoop
 
Best Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker ContainersBest Practices for Running Kafka on Docker Containers
Best Practices for Running Kafka on Docker Containers
 
Enterprise Workloads on AWS
Enterprise Workloads on AWSEnterprise Workloads on AWS
Enterprise Workloads on AWS
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
 
windows server 2012 R2
windows server 2012 R2windows server 2012 R2
windows server 2012 R2
 
vCloud Technical deck - cb.ppt
vCloud Technical deck - cb.pptvCloud Technical deck - cb.ppt
vCloud Technical deck - cb.ppt
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
VMworld 2013: Three Advantages of Running Cloud Foundry in a VMware Private C...
 
(CMP404) Cloud Rendering at Walt Disney Animation Studios
(CMP404) Cloud Rendering at Walt Disney Animation Studios(CMP404) Cloud Rendering at Walt Disney Animation Studios
(CMP404) Cloud Rendering at Walt Disney Animation Studios
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 

Recently uploaded (20)

Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 

Virtualizing Apache Spark and Machine Learning with Justin Murray

  • 2. Agenda • Why Virtualize Spark? • A Review of the Architectures • What does Virtualized Spark look like? • Virtualizing Spark in the Private Cloud and Public Cloud – using the same infrastructure • VMware Cloud on AWS – a platform for Spark in the public cloud • Performance • A Key Best Practice • Conclusions
  • 4. Use Cases : Virtualization of Big Data •Enterprises have development, test, pre-prod staging and production clusters that are required to be separated from each other and provisioned independently •Organizations need different versions of Spark to be available to different teams - with possibly different services available •Enterprises do not wish to dedicate a specific set of hardware to each different requirement above, and want to reduce overall costs CONFIDENTIAL 4
  • 5. Worker Node 1 Worker Node 2 Worker Node 3 Input File Traditional Hadoop YARN Architecture ResourcemanagerJob Datanode Nodemanager AppMaster - 1 Nodemanager Nodemanager Datanode Datanode Block 1 Block 2 Block 3 Container - 2 Container - 3 Master Roles Namenode
  • 6. Worker Node 1 Worker Node 2 Worker Node 3 Input File Hadoop/YARN – in Virtual Machines ResourceManagerJob Datanode Nodemanager AppMaster - 1 Nodemanager Nodemanager Datanode Datanode Block 1 Block 2 Block 3 Container - 2 Container - 3 Namenode Master Roles
  • 7. High Level View of Spark
  • 8. Worker Node 1 Worker Node 2 Worker Node 3 Spark Standalone - Virtualized Driver Job Executor JVM Executor Executor JVM JVM Executor JVM Executor JVM Executor JVM Virtual Machine
  • 9. NodemanagerNodemanagerNodemanager Worker Node 1 Worker Node 2 Worker Node 3 The Spark Architecture (on YARN) Job Datanode AppMaster - 1 Datanode Datanode Block 1 Block 2 Block 3 Container - 2 Container - 3 Namenode Driver Executor Executor Resourcemanager
  • 10. Virtualizing Spark in the Private Cloud and Public Cloud Using the Same Infrastructure
  • 11. High Level View of Spark
  • 12. Deploying Spark in the Private Cloud Source: AWS presentation at VMworld 2017 ENI Compute Tier- Spark Virtual Machine Storage Tier- choice of NFS or HDFS or others for data handling File1 File2 Enterprise Storage System
  • 13. ENI NFS, GlusterFS or HDFS Hadoop Virtual Node 1 Clustered File Services Hadoop Virtual Node 1 vSAN On-Premises vSphere Application Cluster 1 – Model Development Hadoop Virtual Node 1 vSAN On-premises vSphere Application Cluster 2 – Model Testing/Staging Hadoop Virtual Node 1 vSAN On-Premises vSphere Application Cluster 3 - Production 2. Store trained M L m odel 1. Read training data 3. Read trained model and execute on test data 4. Read trained model and execute on production data Sharing ML Models and Data using Shared Storage
  • 14. Big Data - Storage Evolution Hadoop Virtual Node 1 Input Application HDFS Application Output Input Backup Restore Output Copy HDFS HDFS Application tmpInput Hadoop Virtual Node 1 Source: Steve Loughran, Hadoop Summit 2016 presentation Output Cloud Storage Cloud Storage
  • 15. Evolving Cloud Storage for Big Data Output Local Storage Application Input Hadoop Virtual Node 1 Source: Steve Loughran, Hadoop Summit presentation S3 Caching – S3 for Big Data • S3 is an Object storage system rather than a file system • Moving existing HDFS data to S3 requires care (file-> object mapping) • S3 is eventually consistent • Guard mechanisms like S3Guard to ensure file consistency • Caching locally to improve performance of storage access
  • 16. Deploying Spark in the Public Cloud Source: AWS presentation at VMworld 2017 S3 Object Store ENI Other AWS Services Compute Tier- Spark Virtual Machine Storage Tier- Cloud Storage Object1 Object2
  • 17. VMware Cloud on AWS Powered by VMware Cloud Foundation Customer Data Center vSphere vSAN NSX Operational Management Native AWS Services … … … … vCentervCenter VMware Cloud on AWS • ESXi on Dedicated Hardware • Support for VMs and Containers • vSAN on Flash and EBS Storage • Replication and DR Orchestration • NSX Spanning on- premises and cloud • Advanced Networking, and Security Services AWS Global Infrastructure Amazon EC2 Amazon S3 Amazon RDS AWS Direct Connect Amazon DynamoDB Amazon Redshift
  • 18. Sold as a Service § VMware manages hypervisor and management components § AWS manages physical resources § Customer manages VMs § Customer decides how many VMs to run on vSphere
  • 19. VMware Cloud on AWS : Integration to AWS Services Hadoop Virtual Node 1 Customer-Owned AWS Account Source: AWS presentation at VMworld 2017 VMware Cloud on AWS Account Hadoop Virtual Node 1 Logical Network 172.31.20.0/24 VM VM CGW SDDC NSX VSAN ESXi Hadoop Virtual Node 1 VPC AWS Hadoop Virtual Node 1 EC2 Instances Hadoop Virtual Node 1 EC2 Instances S3 VPC EndPoint ENI Other AWS Services VPC Subnet – 10.1.1.0/24 VPC Subnet – 10.1.2.0/24 Org Elastic Network Interface Network Backbone Compute Gateway
  • 20. ENI AWS Hadoop Virtual Node 1 EC2 Instances S3 Other AWS Services Hadoop Virtual Node 1 vSAN VMware Cloud on AWS Application Cluster 1 – Model Development Hadoop Virtual Node 1 vSAN VMware Cloud on AWS Application Cluster 2 – Testing/Staging Hadoop Virtual Node 1 vSAN On-premises vSphere Application Cluster 3 - Production 2. Store trained M L m odel 1. Read training data 3. Read trained model and execute on test data 4. Read trained model and execute on production data Sharing ML models and Data using Cloud Storage
  • 21. Spark Workers in Docker Containers on vSphere 21
  • 23. Spark Random Forest Performance #VIRT1445BU CONFIDENTIAL 23
  • 24. Spark Logistic Regression Performance #VIRT1445BU CONFIDENTIAL 24
  • 25. 1 TB RAM on Server Each NUMA Node has 1024/2 512GB 482 GB RAM for each VM NUMA and Virtual Machine Placement
  • 26. Virtualizing Spark - conclusions Agility • Infrastructure on demand • Sharing of physical resources – not dedicated clusters Simplified Management • Centralized data center management • Apply virtualization best practices Efficiency • Resource pooling • Server and cluster consolidation Performance • Equal to, or better performance than native Hadoop • No significant overhead 26 vSphere and VMware Cloud on AWS
  • 27. Thank You. Contact jmurray@vmware.com or bigdata@vmware.com
  • 28. Virtualization Host Server VMDK Hadoop Node 1 Virtual Machine Datanode Ext4 Nodemanager Ext4 Ext4 Ext4 Six or More Local DAS disks per Virtual Machine VMDK VMDK VMDK VMDK VMDK VMDK VMDK Hadoop Node 2 Virtual Machine Datanode Ext4 Nodemanager Ext4 Ext4 Ext4Ext4 VMDKVMDK VMDKVMDK Ext4Ext4Ext4 Combined Model: Two Virtual Machines on a Host
  • 29. #1 Reference Architecture from Cloudera
  • 31. Workloads - Spark • Two standard analytic programs from the Spark MLLib (Machine Learning Library) • Driven using SparkBench (https://github.com/SparkTC/spark-bench) – Support Vector Machine – Logistic Regression CONFIDENTIAL 31
  • 32. Spark Support Vector Machine Performance CONFIDENTIAL 32
  • 34. Results - Spark •Support Vector Machines workload, which stayed in memory, ran about 10% faster in virtualized form than on bare metal •Logistic Regression workload, which was written to disk at the larger dataset sizes, showed a slight advantage to bare metal •part of the dataset was cached to disk, •larger memory of the bare metal Spark executors may help •Both workloads showed linear scaling from 5 to 10 hosts and as dataset size increased CONFIDENTIAL 34
  • 35. §Spark workloads work very well on VMware vSphere • Various performance studies have shown that any difference between virtualized performance and native performance is minimal • Follow the general best practice guidelines that VMware has published • Design patterns such as data-compute separation can be used to provide elasticity of your Spark cluster. Conclusions
  • 36. Add Slides as Necessary • Supporting points go here.