SlideShare a Scribd company logo
Spark Autotuning
Lawrence Spracklen
Alpine Data
Overview
•  Motivation
•  Spark Autotuning
•  Future enhancements
Motivation
We use Spark
•  End-2-end support
–  Problem inception through model operationalization
•  Extensive use of Spark at all stages
–  ETL
–  Feature engineering
–  Model training
–  Model scoring
Alpine UI
Spark Configuration
Data scientists must set:
•  Size of driver
•  Size of executors
•  Number of executors
•  Number of partitions
Inefficient
•  Manual Spark tuning is a time consuming and inefficient
process
–  Frequently results in “trial-and-error” approach
–  Can be hours (or more) before OOM fails occur
•  Less problematic for unchanging operationalized jobs
–  Run same analysis every hour/day/week on new data
–  Worth spending time & resources to fine tune
AutoML
Data Set
Alg #1
Alg #2
Alg #3
Alg #N
Alg #1
Alg #N
1)Investigate N
ML algorithms
2) Tune top
performing
algorithms
Feature
engineering
Alg #2
Alg #1
Alg #N
2) Feature
elimination
BI Integration
•  Sometimes there is no data scientist…
–  ML behind the scenes deep in the stack
Example Algorithm
Spark Architecture
•  Driver can be client or cluster resident
It depends….
•  Size & complexity of data
•  Complexity of algorithm(s)
•  Nuances of the implementation
•  Cluster size
•  YARN configuration
•  Cluster utilization
Default settings?
•  Spark defaults are targeted at toy data sets & small
clusters
–  Not applicable to real-world data science
•  Resource computations often assume a dedicated
cluster
–  Looking at total vCPU and memory resources
•  Enterprise clusters are typically shared
–  Applying dedicated computations is wasteful
Spark Overheads
https://0x0fff.com/spark-memory-management/
Challenges
•  Algorithm needs to
–  Determine compute requirements for current analysis
–  Reconcile requirements with available resources
–  Allow users to set specific variables
•  Changing one parameter will generally impact others
1. Increase memory per executor
èDecrease number of executors
2. Increase cores per executor
èDecrease memory per core
[N.B. Not just a single “correct” configuration!!!]
Understanding the inputs
•  Sample from input data set(s)
•  Estimate
–  Schema inference
–  Dimensionality
–  Cardinality
–  Row count
•  Remember about compressed formats
•  Offline background scans feasible
–  Retain snapshots in metastore
•  Take into account impact of downstream operations
Consider the entire flow
NRV Norm LoR
NRV PCA LoR
NRV 1-Hot LoR
csv
csv
csv
Algorithmic complexity
•  Each operation has unique requirements
–  Wide range in memory and compute requirements across
algorithms
•  Provide levers for algorithms to provide input on
resource asks
–  Resource heavy or resource light (Memory & CPU)
•  Per algorithm modifiers computed for all algorithms in
Application toolbox
Wrapping OSS
•  Autotuning APIs are made available to user & OSS
algorithms wrapped using our extensions SDK
–  H20
–  MLlib
–  Etc.
•  Provide input on relative memory requirements and
computational complexity
YARN Queues
•  Queues frequently used in enterprise settings
–  Control resource utilization
•  Variety of queuing strategies
–  Users bound to queues
–  Groups bound to queues
–  Applications bound to queues
•  Queues can have own maximum container sizes
•  Queues can have different scheduling polices
Dynamic allocation
•  Spark provides support for auto-scaling
–  Dynamically add/remove executors according to demand
•  Is preemption also enabled?
–  Minimize wait time for other users
•  Still need to determine optimal executor and driver
sizing
Query cluster
•  Query YARN Resource Manager to get cluster state
–  Kerberised REST API
•  Determine key resource considerations
–  Preemption enabled?
–  Queue mapping
–  Queue type
–  Queue resources
–  Real-time node utilization
Executor Sizing
•  Try to make efficient use of the available resources
–  Start with a balanced strategy
•  Consume resources based on their relative abundance
MemPerCore= QueueMem
QueueCores *MemHungry
CoresPerExecutor = min(MaxUsableMemoryPerContainer
MemoryPerCore ,5)
MemPerExecutor = max(1GB,MemPerCore*CoresPerExecutor)
Number of executors (1/2)
•  Determine how many executors can be accommodated
per node
•  Consider queue constraints
•  Consider testing per node with different core values
AvailableExecutors = Min( AvailableNodeMemory
MemoryPerExecutor+Overheads
n=0
ActiveNodes
∑ , AvailableNodeCores
CoresPerExecutor )
UseableExecutors = Min(AvailableExecutors, QueueMemory
MemoryPerExecutor , QueueCores
CoresPerExecutor )
Number of executors (2/2)
•  Compute executor count required for memory
requirements
•  Determine final executor count
NeededExecutors = CacheSize
SparkMemPerExecutor*SparkStorageFraction
FinalExecutors = min(UseableExecutors, NeededExecutors*ComputeHungry)
Data Partitioning
•  Minimally want a partition per executor core
•  Increase number of partitions if memory constrained
MinPartitions = NumExecutors*CoresPerExecutor
MemPerTask = SparkMemPerExecutor*(1−SparkStorageFraction)
CoresPerExecutor
MemPartitions = CacheSize
MemPerTask
FinalPartitions = max(MinPartitions,MemPartitions)
Driver sizing
•  Leverage executor size as baseline
•  Increase driver memory if analysis requires
–  Some algorithms very driver memory heavy
•  Reduce executor count if needed to reflect
presence of cluster-side driver
Sanity checks
•  Make sure none of the chosen parameters look
crazy!
–  Massive partition counts
–  Crazy resourcing for small files
–  Complex jobs squeezed for overwhelmed clusters
•  Think about edge cases
•  Iterate if necessary!
Future Improvements
Future opportunities
Cloud Scaling
•  Autoscale cloud EMR clusters
–  Leverage understanding of required resources to dynamically
manage the number of compute nodes
–  Significant performance benefits and cost savings
Iterative improvements
•  Retain decisions from prior runs in metastore
•  Harvest additional sizing and runtime information
–  Spark REST APIs
•  Leverage data to fine tune future runs
Conclusions
•  Enterprise clusters are noisy, shared environments
•  Configuring Spark can be labor intensive & error prone
–  Significant trial and error process
•  Important to automate the Spark configuration process
–  Remove requirement for data scientists to understand Spark and
cluster configuration details
•  Software automatically reconciles analysis requirements
with available resources
•  Even simple estimators can do a good job
More details?
•  Too little time…
–  Will blog more details, example resourcing and scala
code fragments
alpinedata.com/blog
Acknowledgements
•  Rachel Warren – Lead Developer
Thank You.
Questions? Come by Booth K3
lawrence@alpinenow.com
@spracklen

More Related Content

What's hot

Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
Spark Summit
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
Grigory Sapunov
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
Spark Summit
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
Jen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Jen Aman
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Databricks
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
Jen Aman
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Databricks
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Spark Summit
 
Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)
Ilya Ganelin
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Spark Summit
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Databricks
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
justinjleet
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Spark Summit
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Spark Summit
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
sparktc
 

What's hot (19)

Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
 
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd MostakLeveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
Leveraging GPU-Accelerated Analytics on top of Apache Spark with Todd Mostak
 
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
Supporting Highly Multitenant Spark Notebook Workloads with Craig Ingram and ...
 
Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)Stream Computing (The Engineer's Perspective)
Stream Computing (The Engineer's Perspective)
 
Feature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick PentreathFeature Hashing for Scalable Machine Learning with Nick Pentreath
Feature Hashing for Scalable Machine Learning with Nick Pentreath
 
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min ShenRandom Walks on Large Scale Graphs with Apache Spark with Min Shen
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
Monitoring the Dynamic Resource Usage of Scala and Python Spark Jobs in Yarn:...
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
 

Similar to Spark Autotuning - Spark Summit East 2017

Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
Oh Chan Kwon
 
Spark cep
Spark cepSpark cep
Spark cep
Byungjin Kim
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
Deepak Shankar
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?
Deepak Shankar
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
Deepak Shankar
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
Databricks
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
AkshitAgiwal1
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
Jatin Arora
 
Visual Studio 2013 Profiling
Visual Studio 2013 ProfilingVisual Studio 2013 Profiling
Visual Studio 2013 ProfilingDenis Dudaev
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Spark Summit
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
Roy Russo
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Spark Summit
 
Unit 2 processor&memory-organisation
Unit 2 processor&memory-organisationUnit 2 processor&memory-organisation
Unit 2 processor&memory-organisation
Pavithra S
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisation
Karunamoorthy B
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의DzH QWuynh
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 

Similar to Spark Autotuning - Spark Summit East 2017 (20)

Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Spark cep
Spark cepSpark cep
Spark cep
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
 
How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?How to create innovative architecture using ViualSim?
How to create innovative architecture using ViualSim?
 
How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?How to create innovative architecture using VisualSim?
How to create innovative architecture using VisualSim?
 
How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
 
Visual Studio 2013 Profiling
Visual Studio 2013 ProfilingVisual Studio 2013 Profiling
Visual Studio 2013 Profiling
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Devnexus 2018
Devnexus 2018Devnexus 2018
Devnexus 2018
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
Migrating Complex Data Aggregation from Hadoop to Spark-(Ashish Singh andPune...
 
Unit 2 processor&memory-organisation
Unit 2 processor&memory-organisationUnit 2 processor&memory-organisation
Unit 2 processor&memory-organisation
 
Unit 1 processormemoryorganisation
Unit 1 processormemoryorganisationUnit 1 processormemoryorganisation
Unit 1 processormemoryorganisation
 
참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의참여기관_발표자료-국민대학교 201301 정기회의
참여기관_발표자료-국민대학교 201301 정기회의
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 

More from Alpine Data

Big Data Day LA 2017
Big Data Day LA 2017Big Data Day LA 2017
Big Data Day LA 2017
Alpine Data
 
Operationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryOperationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud Foundry
Alpine Data
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
Alpine Data
 
UC Berkeley Data Science Webinar
UC Berkeley Data Science WebinarUC Berkeley Data Science Webinar
UC Berkeley Data Science Webinar
Alpine Data
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
Alpine Data
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System Administrators
Alpine Data
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with Spark
Alpine Data
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with Spark
Alpine Data
 

More from Alpine Data (8)

Big Data Day LA 2017
Big Data Day LA 2017Big Data Day LA 2017
Big Data Day LA 2017
 
Operationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud FoundryOperationalizing Data Science using Cloud Foundry
Operationalizing Data Science using Cloud Foundry
 
Think Like Spark
Think Like SparkThink Like Spark
Think Like Spark
 
UC Berkeley Data Science Webinar
UC Berkeley Data Science WebinarUC Berkeley Data Science Webinar
UC Berkeley Data Science Webinar
 
Enterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using SparkEnterprise Scale Topological Data Analysis Using Spark
Enterprise Scale Topological Data Analysis Using Spark
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System Administrators
 
Real Time Visualization with Spark
Real Time Visualization with SparkReal Time Visualization with Spark
Real Time Visualization with Spark
 
Harnessing Big Data with Spark
Harnessing Big Data with SparkHarnessing Big Data with Spark
Harnessing Big Data with Spark
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Spark Autotuning - Spark Summit East 2017

  • 2. Overview •  Motivation •  Spark Autotuning •  Future enhancements
  • 4. We use Spark •  End-2-end support –  Problem inception through model operationalization •  Extensive use of Spark at all stages –  ETL –  Feature engineering –  Model training –  Model scoring
  • 6. Spark Configuration Data scientists must set: •  Size of driver •  Size of executors •  Number of executors •  Number of partitions
  • 7. Inefficient •  Manual Spark tuning is a time consuming and inefficient process –  Frequently results in “trial-and-error” approach –  Can be hours (or more) before OOM fails occur •  Less problematic for unchanging operationalized jobs –  Run same analysis every hour/day/week on new data –  Worth spending time & resources to fine tune
  • 8. AutoML Data Set Alg #1 Alg #2 Alg #3 Alg #N Alg #1 Alg #N 1)Investigate N ML algorithms 2) Tune top performing algorithms Feature engineering Alg #2 Alg #1 Alg #N 2) Feature elimination
  • 9. BI Integration •  Sometimes there is no data scientist… –  ML behind the scenes deep in the stack
  • 11. Spark Architecture •  Driver can be client or cluster resident
  • 12. It depends…. •  Size & complexity of data •  Complexity of algorithm(s) •  Nuances of the implementation •  Cluster size •  YARN configuration •  Cluster utilization
  • 13. Default settings? •  Spark defaults are targeted at toy data sets & small clusters –  Not applicable to real-world data science •  Resource computations often assume a dedicated cluster –  Looking at total vCPU and memory resources •  Enterprise clusters are typically shared –  Applying dedicated computations is wasteful
  • 15. Challenges •  Algorithm needs to –  Determine compute requirements for current analysis –  Reconcile requirements with available resources –  Allow users to set specific variables •  Changing one parameter will generally impact others 1. Increase memory per executor èDecrease number of executors 2. Increase cores per executor èDecrease memory per core [N.B. Not just a single “correct” configuration!!!]
  • 16. Understanding the inputs •  Sample from input data set(s) •  Estimate –  Schema inference –  Dimensionality –  Cardinality –  Row count •  Remember about compressed formats •  Offline background scans feasible –  Retain snapshots in metastore •  Take into account impact of downstream operations
  • 17. Consider the entire flow NRV Norm LoR NRV PCA LoR NRV 1-Hot LoR csv csv csv
  • 18. Algorithmic complexity •  Each operation has unique requirements –  Wide range in memory and compute requirements across algorithms •  Provide levers for algorithms to provide input on resource asks –  Resource heavy or resource light (Memory & CPU) •  Per algorithm modifiers computed for all algorithms in Application toolbox
  • 19. Wrapping OSS •  Autotuning APIs are made available to user & OSS algorithms wrapped using our extensions SDK –  H20 –  MLlib –  Etc. •  Provide input on relative memory requirements and computational complexity
  • 20. YARN Queues •  Queues frequently used in enterprise settings –  Control resource utilization •  Variety of queuing strategies –  Users bound to queues –  Groups bound to queues –  Applications bound to queues •  Queues can have own maximum container sizes •  Queues can have different scheduling polices
  • 21. Dynamic allocation •  Spark provides support for auto-scaling –  Dynamically add/remove executors according to demand •  Is preemption also enabled? –  Minimize wait time for other users •  Still need to determine optimal executor and driver sizing
  • 22. Query cluster •  Query YARN Resource Manager to get cluster state –  Kerberised REST API •  Determine key resource considerations –  Preemption enabled? –  Queue mapping –  Queue type –  Queue resources –  Real-time node utilization
  • 23. Executor Sizing •  Try to make efficient use of the available resources –  Start with a balanced strategy •  Consume resources based on their relative abundance MemPerCore= QueueMem QueueCores *MemHungry CoresPerExecutor = min(MaxUsableMemoryPerContainer MemoryPerCore ,5) MemPerExecutor = max(1GB,MemPerCore*CoresPerExecutor)
  • 24. Number of executors (1/2) •  Determine how many executors can be accommodated per node •  Consider queue constraints •  Consider testing per node with different core values AvailableExecutors = Min( AvailableNodeMemory MemoryPerExecutor+Overheads n=0 ActiveNodes ∑ , AvailableNodeCores CoresPerExecutor ) UseableExecutors = Min(AvailableExecutors, QueueMemory MemoryPerExecutor , QueueCores CoresPerExecutor )
  • 25. Number of executors (2/2) •  Compute executor count required for memory requirements •  Determine final executor count NeededExecutors = CacheSize SparkMemPerExecutor*SparkStorageFraction FinalExecutors = min(UseableExecutors, NeededExecutors*ComputeHungry)
  • 26. Data Partitioning •  Minimally want a partition per executor core •  Increase number of partitions if memory constrained MinPartitions = NumExecutors*CoresPerExecutor MemPerTask = SparkMemPerExecutor*(1−SparkStorageFraction) CoresPerExecutor MemPartitions = CacheSize MemPerTask FinalPartitions = max(MinPartitions,MemPartitions)
  • 27. Driver sizing •  Leverage executor size as baseline •  Increase driver memory if analysis requires –  Some algorithms very driver memory heavy •  Reduce executor count if needed to reflect presence of cluster-side driver
  • 28. Sanity checks •  Make sure none of the chosen parameters look crazy! –  Massive partition counts –  Crazy resourcing for small files –  Complex jobs squeezed for overwhelmed clusters •  Think about edge cases •  Iterate if necessary!
  • 30. Future opportunities Cloud Scaling •  Autoscale cloud EMR clusters –  Leverage understanding of required resources to dynamically manage the number of compute nodes –  Significant performance benefits and cost savings Iterative improvements •  Retain decisions from prior runs in metastore •  Harvest additional sizing and runtime information –  Spark REST APIs •  Leverage data to fine tune future runs
  • 31. Conclusions •  Enterprise clusters are noisy, shared environments •  Configuring Spark can be labor intensive & error prone –  Significant trial and error process •  Important to automate the Spark configuration process –  Remove requirement for data scientists to understand Spark and cluster configuration details •  Software automatically reconciles analysis requirements with available resources •  Even simple estimators can do a good job
  • 32. More details? •  Too little time… –  Will blog more details, example resourcing and scala code fragments alpinedata.com/blog
  • 34. Thank You. Questions? Come by Booth K3 lawrence@alpinenow.com @spracklen