SlideShare a Scribd company logo
1 of 52
Download to read offline
Re-Architecting Apache Spark for
Performance Understandability
Kay Ousterhout
Joint work with Christopher Canel, Max Wolffe,
Sylvia Ratnasamy, Scott Shenker
About Me
PhD candidate at UC Berkeley
Thesis work on performance of large-scale distributed
systems
Apache Spark PMC member
About this talk
Future architecture for systems like Spark
Implementation is API-compatible with Spark
Major change to Spark’s internals (~20K lines of code)
Spark cluster is a black box, runs the job fast
rdd.groupBy(…)…
Idealistic view: Spark cluster is a black box, runs the job fast
rdd.groupBy(…)…
Idealistic view: Spark cluster is a black box, runs the job fast
rdd.groupBy(…)…
rdd.groupBy(…)…
rdd.reduceByKey(…)…	
Configuration:	
spark.serializer	KryoSerializer	
spark.executor.cores	8	
Idealistic view: Spark cluster is a black box, runs the job fast
rdd.groupBy(…)…
rdd.reduceByKey(…)…	
Realistic view: user uses performance characteristics to tune
job, configuration, hardware, etc.
Configuration:	
spark.serializer	KryoSerializer	
spark.executor.cores	8
rdd.groupBy(…)…
rdd.reduceByKey(…)…	
Realistic view: user uses performance characteristics to tune
job, configuration, hardware, etc.
Configuration:	
spark.serializer	KryoSerializer	
spark.executor.cores	8	
Users need to be able to reason about
performance
Reasoning about Spark Performance
Widely accepted that network and disk I/O are bottlenecks
CPU (not I/O) typically the bottleneck
network optimizations can improve job completion time by at most 2%
Reasoning about Spark Performance
Spark Summit 2015:
CPU (not I/O) often
the bottleneck
Project Tungsten:
initiative to optimize Spark’s CPU use,
driven in part by our measurements
Reasoning about Spark Performance
Spark Summit 2015:
CPU (not I/O) often
the bottleneck
Project Tungsten:
initiative to optimize
Spark’s CPU use
Spark 2.0:
Some evidence that I/O
is again the bottleneck
[HotCloud ’16]
Users need to understand performance to extract the best
runtimes
Reasoning about performance is currently difficult
Software and hardware are constantly evolving,
so performance is always in flux
Vision: jobs report high-level performance metrics
Vision: jobs report high-level performance metrics
Vision: jobs report high-level performance metrics
Vision: jobs report high-level performance metrics
How can we achieve this vision?
Spark overview
Reasoning about Spark’s performance: why it’s hard
New architecture: monotasks
Reasoning about monotasks performance: why it’s easy
Monotasks in action (results)
Example Spark Job:
Read remote data
Filter records
Write result to disk
Task 1:
Read and filter block 1
write result to disk
Task 2:
Read and filter block 2
write result to disk
Task 3:
Read and filter block 3
write result to disk
Task 4:
Read and filter block 4
write result to disk
Task 5:
Read and filter block 5
write result to disk
Task 6:
Read and filter block 6
write result to disk
Task n:
Read and filter block n
write result to disk
…
Example Spark Job:
Read remote data
Filter records
Write result to disk
time
Task 1 Task 9
Task 2
Task 3
Task 4
Task 11
Task 10
Task 12
Worker 1
Task 5 Task 13
Task 6
Task 7
Task 8
Task 15
Task 14
Task 16
Worker 2
…	
Fixed number
of “slots”
Task 17
Task 18
Task 19
Task 1 Task 9
Task 2
Task 3
Task 4
Task 11
Task 10
Task 12
Worker 1
Task 5 Task 13
Task 6
Task 7
Task 8
Task 15
Task 14
Task 16
Worker 2
…
Task 17
Task 18
Task 19
Task 1:
Read block 1, filter, write result to disk
Task 1
time
Network read
CPU (filter)
Disk write
How can we achieve this vision?
Spark overview
Reasoning about Spark’s performance: why it’s hard
New architecture: monotasks
Reasoning about monotasks performance: why it’s easy
Monotasks in action (results)
Task 18
Task 19
Task 1
Network read
CPU (filter)
Disk write
Challenge 1: Task pipelines multiple resources,
bottlenecked on different resources at different times
Task
bottlenecked
on disk write
Task
bottlenecked
on network
read
Task 1
Task 2
Task 9
Task 3
Task 4
Task 11
Task 10
Task 12
time
4 concurrent tasks
on Worker 1
Task 1
Task 2
Task 9
Task 3
Task 4
Task 11
Task 10
Task 12
time
Challenge 2:
Concurrent tasks may
contend for
the same resource
(e.g., network)
Blocked time analysis: how quickly
could a job have completed if a resource
were infinitely fast? (upper bound)
Spark Summit 2015:
Result of ~1 year of adding metrics to Spark!
Task 1
Task 2
Task 9
Task 3
Task 4
Task 11
Task 10
Task 12
How much faster
would the job be with
2x disk throughput?
How would runtimes for these
disk writes change?
How would that change timing of
(and contention for) other resources?
Challenges to reasoning about performance
Tasks bottleneck on different resources at different times
Concurrent tasks on a machine may contend for resources
No model for performance
How can we achieve this vision?
Spark overview
Reasoning about Spark’s performance: why it’s hard
New architecture: monotasks
Reasoning about monotasks performance: why it’s easy
Monotasks in action (results)
Spark:
Tasks
bottleneck on
different
resources
Concurrent
tasks may
contend
No model for
performance
Spark:
Tasks
bottleneck on
different
resources
Concurrent
tasks may
contend
No model for
performance
Monotasks: Each task uses one resource
Example Spark Job:
Read remote data
Filter records
Write result to disk
Network monotasks:
Each read one remote block
…	
CPU monotasks:
Each filter one block,
generate serialized output
…	
Disk monotasks:
Each writes one block to disk
…
Spark:
Tasks
bottleneck on
different
resources
Concurrent
tasks may
contend
No model for
performance
Monotasks:
Each task uses
one resource
Dedicated schedulers control contention
…	…	…	
Network
scheduler
time
CPU
scheduler
…	
…	
Disk
scheduler
…	
One task per CPU core
One task per disk
Spark:
Tasks
bottleneck on
different
resources
Concurrent
tasks may
contend
No model for
performance
Monotasks:
Each task uses
one resource
Monotask times can be used to model
performance
Dedicated
schedulers control
contention
Ideal CPU time: total CPU monotask time / # CPU cores
Spark:
Tasks
bottleneck on
different
resources
Concurrent
tasks may
contend
No model for
performance
Monotasks:
Each task uses
one resource
Monotask times can be used to model
performance
Dedicated
schedulers control
contention
Ideal CPU time:
total CPU monotask
time / # CPU cores
Ideal network runtime
Ideal disk runtime
Job runtime:
max of ideal
times
Spark:
Tasks
bottleneck on
different
resources
Concurrent
tasks may
contend
No model for
performance
Monotasks:
Each task uses
one resource
How much faster would the job be with 2x
disk throughput?
Dedicated
schedulers control
contention
Ideal CPU time:
total CPU monotask
time / # CPU cores
Ideal network runtime
Ideal disk runtime
Monotask times can
be used to model
performance
Spark:
Tasks
bottleneck on
different
resources
Concurrent
tasks may
contend
No model for
performance
Monotasks:
Each task uses
one resource
How much faster would the job be with 2x
disk throughput?
Dedicated
schedulers control
contention
Ideal CPU time:
total CPU monotask
time / # CPU cores
Ideal network runtime
Ideal disk runtime
(2x disk concurrency)
Monotask times can
be used to model
performance
New job runtime
Spark:
Tasks
bottleneck on
different
resources
Concurrent
tasks may
contend
No model for
performance
Monotasks:
Each task uses
one resource
Dedicated
schedulers control
contention
Monotask times can
be used to model
performance
Task 1
Task 2
Task 9
Task 3
Task 4
Task 11
Task 10
Task 12
…	
…	
…	
Compute monotasks:
one task per CPU core
Disk monotasks: one
per disk
Network monotasks
How does this decomposition work?
How does this decomposition work?
Task 1
Network read
CPU
Disk write
Monotasks
Network
monotask
Disk monotaskCompute
monotask
Implementation
API-compatible with Apache Spark
Workloads can be run on monotasks without re-compiling
Monotasks decomposition handled by Spark internals
Monotasks works at the application level
No operating system changes
How can we achieve this vision?
Spark overview
Reasoning about Spark’s performance: why it’s hard
New architecture: monotasks
Reasoning about monotasks performance: why it’s easy
Monotasks in action (results)
Performance on-par with Apache Spark
Big data benchmark
(SQL workload)
�
���
���
���
���
���
���
���
���
�� �� �� �� �� �� �� �� �� �
����������
�����
�����
���������
Performance on-par with Apache Spark
Sort
(600 GB, 20 machines)
Block coordinate descent
(Matrix workload used in ML applications)
16 machines
��
���
���
���
���
���
���
���
���
���
����
����
������
��������
����������� ������
��������
�����
�����������������������
�����
���������
��
����
����
����
����
����
����
����
����
����
�����
��� ������ �����
�����������������������
�����
���������
Monotasks in action
Modeling performance
Leveraging performance clarity to optimize performance
How much faster would jobs run if each machine
had 2 disks instead of 1?
��
����
����
����
����
����
����
����
����
����
��������� ��������� ���������
�����������
��������
����������������
���������������������
������������������
Predictions for different hardware
within 10% of the actual runtime
How much faster would job run if data were de-
serialized and in memory?
Eliminates disk time to read input data
Eliminates CPU time to de-serialize data
How much faster would job run if data were de-
serialized and in memory?
Task
18
Spark Task
Network
CPU
Disk write
: (de)serialization time
Measuring (de) serialization time with Spark
(De) serialization pipelined with
other processing for each record
Application-level measurement
incurs high overhead
How much faster would job run if data were de-
serialized and in memory?
: (de)serialization time
Measuring (de) serialization time with Monotasks
Eliminating fine-grained
pipelining enables
measurement!
Original compute monotask
Un-rolled monotask
How much faster would job run if data were de-
serialized and in memory?
��
���
���
���
���
���
���
�����������������������
����������������
���������������������
������������������
Eliminate disk monotask time
Eliminate CPU monotask time
spent (de)serialiazing
Re-run model
Sort
Leveraging Performance Clarity to Improve
Performance
Schedulers have complete
visibility over resource use
Framework can configure for
best performance
…	…	…	
Network
time
CPU
…	
…	
Disk
…
Configuring the number of concurrent tasks
		
��
���
���
���
���
���
���
���
���
������ ������ ������ ������ ����
��������������������������
��������
�����������
����������
Spark with different numbers of
concurrent tasks
Monotasks better than any
configuration:
per-resource schedulers
automatically schedule with the
ideal concurrency
Future Work
Leveraging performance clarity to improve performance
– Use resource queue lengths to dynamically adjust job
Automatically configuring for multi-tenancy
– Don’t need jobs to specify resource requirements
– Can achieve higher utilization: no multi-resource bin packing
Monotasks:
Each task uses
one resource
Dedicated
schedulers control
contention
Monotask times can
be used to model
performance
…	
…	
…	
Compute monotasks:
one task per CPU core
Disk monotasks: one
per disk
Network monotasks
Vision:
Spark always reports
bottleneck information
Challenging with existing
architecture
Interested? Have a job whose performance
you can’t figure out? Email me:
keo@cs.berkeley.edu

More Related Content

What's hot

Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Shelan Perera
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Spark Summit
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at FacebookDatabricks
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Josef A. Habdank
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web ServiceEvan Chan
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingSpark Summit
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Spark Summit
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedDatabricks
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsDatabricks
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics Jen Aman
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesJen Aman
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangSpark Summit
 

What's hot (20)

Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.Apache Flink vs Apache Spark - Reproducible experiments on cloud.
Apache Flink vs Apache Spark - Reproducible experiments on cloud.
 
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Scaling Apache Spark at Facebook
Scaling Apache Spark at FacebookScaling Apache Spark at Facebook
Scaling Apache Spark at Facebook
 
Spark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca CanaliSpark Summit EU talk by Luca Canali
Spark Summit EU talk by Luca Canali
 
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Reactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark StreamingReactive Streams, Linking Reactive Application To Spark Streaming
Reactive Streams, Linking Reactive Application To Spark Streaming
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Spark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg SchadSpark Summit EU talk by Jorg Schad
Spark Summit EU talk by Jorg Schad
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons LearnedA Journey into Databricks' Pipelines: Journey and Lessons Learned
A Journey into Databricks' Pipelines: Journey and Lessons Learned
 
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload DiagnosticsTracing the Breadcrumbs: Apache Spark Workload Diagnostics
Tracing the Breadcrumbs: Apache Spark Workload Diagnostics
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Using Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene PangUsing Spark with Tachyon by Gene Pang
Using Spark with Tachyon by Gene Pang
 

Viewers also liked

A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Jen Aman
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersJen Aman
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLJen Aman
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkJen Aman
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkJen Aman
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development KitJen Aman
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonJen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousJen Aman
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityJen Aman
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkJen Aman
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibJen Aman
 

Viewers also liked (14)

A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
Massive Simulations In Spark: Distributed Monte Carlo For Global Health Forec...
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 

Similar to Re-Architecting Apache Spark for Performance Understandability

Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentDatabricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache SparkDatabricks
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovMaksud Ibrahimov
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localyticsandrew311
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdmainside-BigData.com
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaItai Yaffe
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...Chris Fregly
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 

Similar to Re-Architecting Apache Spark for Performance Understandability (20)

Apache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and PresentApache Spark Performance: Past, Future and Present
Apache Spark Performance: Past, Future and Present
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
 
Spark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud IbrahimovSpark performance tuning - Maksud Ibrahimov
Spark performance tuning - Maksud Ibrahimov
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Optimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at LocalyticsOptimizing MongoDB: Lessons Learned at Localytics
Optimizing MongoDB: Lessons Learned at Localytics
 
Accelerating apache spark with rdma
Accelerating apache spark with rdmaAccelerating apache spark with rdma
Accelerating apache spark with rdma
 
How to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOSHow to deploy Apache Spark 
to Mesos/DCOS
How to deploy Apache Spark 
to Mesos/DCOS
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...Advanced Apache Spark Meetup:  How Spark Beat Hadoop @ 100 TB Daytona GraySor...
Advanced Apache Spark Meetup: How Spark Beat Hadoop @ 100 TB Daytona GraySor...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 

More from Jen Aman

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaJen Aman
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To ProductionJen Aman
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduJen Aman
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersJen Aman
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Jen Aman
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningJen Aman
 

More from Jen Aman (10)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
 

Recently uploaded

Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknowmakika9823
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 

Recently uploaded (20)

Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service LucknowAminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
Aminabad Call Girl Agent 9548273370 , Call Girls Service Lucknow
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 

Re-Architecting Apache Spark for Performance Understandability

  • 1. Re-Architecting Apache Spark for Performance Understandability Kay Ousterhout Joint work with Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker
  • 2. About Me PhD candidate at UC Berkeley Thesis work on performance of large-scale distributed systems Apache Spark PMC member
  • 3. About this talk Future architecture for systems like Spark Implementation is API-compatible with Spark Major change to Spark’s internals (~20K lines of code)
  • 4. Spark cluster is a black box, runs the job fast rdd.groupBy(…)…
  • 5. Idealistic view: Spark cluster is a black box, runs the job fast rdd.groupBy(…)…
  • 6. Idealistic view: Spark cluster is a black box, runs the job fast rdd.groupBy(…)…
  • 8. rdd.groupBy(…)… rdd.reduceByKey(…)… Realistic view: user uses performance characteristics to tune job, configuration, hardware, etc. Configuration: spark.serializer KryoSerializer spark.executor.cores 8
  • 9. rdd.groupBy(…)… rdd.reduceByKey(…)… Realistic view: user uses performance characteristics to tune job, configuration, hardware, etc. Configuration: spark.serializer KryoSerializer spark.executor.cores 8 Users need to be able to reason about performance
  • 10. Reasoning about Spark Performance Widely accepted that network and disk I/O are bottlenecks CPU (not I/O) typically the bottleneck network optimizations can improve job completion time by at most 2%
  • 11. Reasoning about Spark Performance Spark Summit 2015: CPU (not I/O) often the bottleneck Project Tungsten: initiative to optimize Spark’s CPU use, driven in part by our measurements
  • 12. Reasoning about Spark Performance Spark Summit 2015: CPU (not I/O) often the bottleneck Project Tungsten: initiative to optimize Spark’s CPU use Spark 2.0: Some evidence that I/O is again the bottleneck [HotCloud ’16]
  • 13. Users need to understand performance to extract the best runtimes Reasoning about performance is currently difficult Software and hardware are constantly evolving, so performance is always in flux
  • 14. Vision: jobs report high-level performance metrics
  • 15. Vision: jobs report high-level performance metrics
  • 16. Vision: jobs report high-level performance metrics
  • 17. Vision: jobs report high-level performance metrics
  • 18. How can we achieve this vision? Spark overview Reasoning about Spark’s performance: why it’s hard New architecture: monotasks Reasoning about monotasks performance: why it’s easy Monotasks in action (results)
  • 19. Example Spark Job: Read remote data Filter records Write result to disk Task 1: Read and filter block 1 write result to disk Task 2: Read and filter block 2 write result to disk Task 3: Read and filter block 3 write result to disk Task 4: Read and filter block 4 write result to disk Task 5: Read and filter block 5 write result to disk Task 6: Read and filter block 6 write result to disk Task n: Read and filter block n write result to disk …
  • 20. Example Spark Job: Read remote data Filter records Write result to disk time Task 1 Task 9 Task 2 Task 3 Task 4 Task 11 Task 10 Task 12 Worker 1 Task 5 Task 13 Task 6 Task 7 Task 8 Task 15 Task 14 Task 16 Worker 2 … Fixed number of “slots” Task 17 Task 18 Task 19
  • 21. Task 1 Task 9 Task 2 Task 3 Task 4 Task 11 Task 10 Task 12 Worker 1 Task 5 Task 13 Task 6 Task 7 Task 8 Task 15 Task 14 Task 16 Worker 2 … Task 17 Task 18 Task 19 Task 1: Read block 1, filter, write result to disk Task 1 time Network read CPU (filter) Disk write
  • 22. How can we achieve this vision? Spark overview Reasoning about Spark’s performance: why it’s hard New architecture: monotasks Reasoning about monotasks performance: why it’s easy Monotasks in action (results)
  • 23. Task 18 Task 19 Task 1 Network read CPU (filter) Disk write Challenge 1: Task pipelines multiple resources, bottlenecked on different resources at different times Task bottlenecked on disk write Task bottlenecked on network read
  • 24. Task 1 Task 2 Task 9 Task 3 Task 4 Task 11 Task 10 Task 12 time 4 concurrent tasks on Worker 1
  • 25. Task 1 Task 2 Task 9 Task 3 Task 4 Task 11 Task 10 Task 12 time Challenge 2: Concurrent tasks may contend for the same resource (e.g., network)
  • 26. Blocked time analysis: how quickly could a job have completed if a resource were infinitely fast? (upper bound) Spark Summit 2015: Result of ~1 year of adding metrics to Spark!
  • 27. Task 1 Task 2 Task 9 Task 3 Task 4 Task 11 Task 10 Task 12 How much faster would the job be with 2x disk throughput? How would runtimes for these disk writes change? How would that change timing of (and contention for) other resources?
  • 28. Challenges to reasoning about performance Tasks bottleneck on different resources at different times Concurrent tasks on a machine may contend for resources No model for performance
  • 29. How can we achieve this vision? Spark overview Reasoning about Spark’s performance: why it’s hard New architecture: monotasks Reasoning about monotasks performance: why it’s easy Monotasks in action (results)
  • 31. Spark: Tasks bottleneck on different resources Concurrent tasks may contend No model for performance Monotasks: Each task uses one resource Example Spark Job: Read remote data Filter records Write result to disk Network monotasks: Each read one remote block … CPU monotasks: Each filter one block, generate serialized output … Disk monotasks: Each writes one block to disk …
  • 32. Spark: Tasks bottleneck on different resources Concurrent tasks may contend No model for performance Monotasks: Each task uses one resource Dedicated schedulers control contention … … … Network scheduler time CPU scheduler … … Disk scheduler … One task per CPU core One task per disk
  • 33. Spark: Tasks bottleneck on different resources Concurrent tasks may contend No model for performance Monotasks: Each task uses one resource Monotask times can be used to model performance Dedicated schedulers control contention Ideal CPU time: total CPU monotask time / # CPU cores
  • 34. Spark: Tasks bottleneck on different resources Concurrent tasks may contend No model for performance Monotasks: Each task uses one resource Monotask times can be used to model performance Dedicated schedulers control contention Ideal CPU time: total CPU monotask time / # CPU cores Ideal network runtime Ideal disk runtime Job runtime: max of ideal times
  • 35. Spark: Tasks bottleneck on different resources Concurrent tasks may contend No model for performance Monotasks: Each task uses one resource How much faster would the job be with 2x disk throughput? Dedicated schedulers control contention Ideal CPU time: total CPU monotask time / # CPU cores Ideal network runtime Ideal disk runtime Monotask times can be used to model performance
  • 36. Spark: Tasks bottleneck on different resources Concurrent tasks may contend No model for performance Monotasks: Each task uses one resource How much faster would the job be with 2x disk throughput? Dedicated schedulers control contention Ideal CPU time: total CPU monotask time / # CPU cores Ideal network runtime Ideal disk runtime (2x disk concurrency) Monotask times can be used to model performance New job runtime
  • 37. Spark: Tasks bottleneck on different resources Concurrent tasks may contend No model for performance Monotasks: Each task uses one resource Dedicated schedulers control contention Monotask times can be used to model performance Task 1 Task 2 Task 9 Task 3 Task 4 Task 11 Task 10 Task 12 … … … Compute monotasks: one task per CPU core Disk monotasks: one per disk Network monotasks How does this decomposition work?
  • 38. How does this decomposition work? Task 1 Network read CPU Disk write Monotasks Network monotask Disk monotaskCompute monotask
  • 39. Implementation API-compatible with Apache Spark Workloads can be run on monotasks without re-compiling Monotasks decomposition handled by Spark internals Monotasks works at the application level No operating system changes
  • 40. How can we achieve this vision? Spark overview Reasoning about Spark’s performance: why it’s hard New architecture: monotasks Reasoning about monotasks performance: why it’s easy Monotasks in action (results)
  • 41. Performance on-par with Apache Spark Big data benchmark (SQL workload) � ��� ��� ��� ��� ��� ��� ��� ��� �� �� �� �� �� �� �� �� �� � ���������� ����� ����� ���������
  • 42. Performance on-par with Apache Spark Sort (600 GB, 20 machines) Block coordinate descent (Matrix workload used in ML applications) 16 machines �� ��� ��� ��� ��� ��� ��� ��� ��� ��� ���� ���� ������ �������� ����������� ������ �������� ����� ����������������������� ����� ��������� �� ���� ���� ���� ���� ���� ���� ���� ���� ���� ����� ��� ������ ����� ����������������������� ����� ���������
  • 43. Monotasks in action Modeling performance Leveraging performance clarity to optimize performance
  • 44. How much faster would jobs run if each machine had 2 disks instead of 1? �� ���� ���� ���� ���� ���� ���� ���� ���� ���� ��������� ��������� ��������� ����������� �������� ���������������� ��������������������� ������������������ Predictions for different hardware within 10% of the actual runtime
  • 45. How much faster would job run if data were de- serialized and in memory? Eliminates disk time to read input data Eliminates CPU time to de-serialize data
  • 46. How much faster would job run if data were de- serialized and in memory? Task 18 Spark Task Network CPU Disk write : (de)serialization time Measuring (de) serialization time with Spark (De) serialization pipelined with other processing for each record Application-level measurement incurs high overhead
  • 47. How much faster would job run if data were de- serialized and in memory? : (de)serialization time Measuring (de) serialization time with Monotasks Eliminating fine-grained pipelining enables measurement! Original compute monotask Un-rolled monotask
  • 48. How much faster would job run if data were de- serialized and in memory? �� ��� ��� ��� ��� ��� ��� ����������������������� ���������������� ��������������������� ������������������ Eliminate disk monotask time Eliminate CPU monotask time spent (de)serialiazing Re-run model Sort
  • 49. Leveraging Performance Clarity to Improve Performance Schedulers have complete visibility over resource use Framework can configure for best performance … … … Network time CPU … … Disk …
  • 50. Configuring the number of concurrent tasks �� ��� ��� ��� ��� ��� ��� ��� ��� ������ ������ ������ ������ ���� �������������������������� �������� ����������� ���������� Spark with different numbers of concurrent tasks Monotasks better than any configuration: per-resource schedulers automatically schedule with the ideal concurrency
  • 51. Future Work Leveraging performance clarity to improve performance – Use resource queue lengths to dynamically adjust job Automatically configuring for multi-tenancy – Don’t need jobs to specify resource requirements – Can achieve higher utilization: no multi-resource bin packing
  • 52. Monotasks: Each task uses one resource Dedicated schedulers control contention Monotask times can be used to model performance … … … Compute monotasks: one task per CPU core Disk monotasks: one per disk Network monotasks Vision: Spark always reports bottleneck information Challenging with existing architecture Interested? Have a job whose performance you can’t figure out? Email me: keo@cs.berkeley.edu