SlideShare a Scribd company logo
Profiling & Testing with Spark
Apache Spark 2.0 Improvements, Flame Graphs & Testing
Outline
Overview
Spark 2.0 Improvements
Profiling with Flame Graphs
How-to Flame Graphs
Testing in Spark
Overview
Apache Spark™ is a fast and general engine for large-scale data processing
Speed: Runs in-memory computing, up to 100x faster than MapReduce
Ease of Use: Support for Java, Scala, Python and R binding
Generality: Enabled for SQL, Streaming and complex analytics (ML)
Portable: Runs on Yarn, Mesos, standalone or Cloud
Overview (Big Picture)
Overview (architecture)
Overview (code sample)
Monte-carlo π calculation
“This code estimates π by "throwing darts" at a circle. We pick random
points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit
circle. The fraction should be π / 4, so we use this to get our estimate.”
Main Takeaway
Spark SQL:
Provides parallelism, affordable at scale
Scale out SQL on storage for Big Data volumes
Scale out on CPU for memory-intensive queries
Offloading reports from RDBMS becomes attractive
Spark 2.0 improvements:
Considerable speedup of CPU-intensive queries
Spark 2.0 Improvements
SQL Queries
sqlContext.sql("
SELECT a.bucket, sum(a.val2) tot
FROM t1 a, t1 b
WHERE a.bucket=b.bucket and
a.val1+b.val1<1000
GROUP BY a.bucket
ORDER BY a.bucket").show()
Complex and resource-intensive SELECT statement:
EXPLAIN directive (execution plan)
Execution Plan
The execution plan:
First instrumentation point for SQL tuning
Shows how Spark wants to execute the query (break-down)
Main players:
Catalyst: the query optimizer
Catalyst (query optimizer)
Logical Plan:
Describes computation on data sets without defining how to conduct it
Physical Plan:
Defines which computation to conduct on each dataset
Project Tungsten (Goal)
“Improves the memory and CPU efficiency of Spark backend
execution by pushing performance close to the limits of
modern hardware.”
Project Tungsten
Perform manual memory management instead of relying on Java objects:
Reduce memory footprint
Eliminate garbage collection overheads
Use java.unsafe and off-heap memory
Code generation for expression evaluation:
Reduce virtual function calls and interpretation overhead (JVM)
Project Tungsten (Code-Gen)
Project Tungsten (Code-Gen)
The Volcano Iterator Model:
Standard for 30 years: almost all
databases do it.
Each operator is an “iterator” that
consumes records from its input
operator
Project Tungsten (Code-Gen)
Downside the Volcano Iterator Model:
Too many virtual function calls
at least 3 calls for each row in Aggregate phase
Can’t take advantage of modern CPU features
pipelining, prefetching, branch prediction,
Project Tungsten (Code-Gen)
What if we hire a college freshman to implement this query in Java in 10 mins?
Whole-stage Code-Gen: Spark as a “Compiler”
Project Tungsten (Code-Gen)
A student beating 30 years of science ...
Project Tungsten (Code-Gen)
Volcano
● Many virtual function calls
● Data in memory (or cache)
● No loop unrolling, SIMD
Hand-written code
● No virtual function calls
● Data in CPU registers
● Exploit compiler optimizations
○ loop unrolling, SIMD, pipelining
Take advantage of all the information that is known after query compilation
Execution plan comparison (legacy vs whole stage code-gen)
WholeStageCodeGen
Profiling with Flame Graphs
Root Cause Analysis
Benchmarking:
Run the workload and measure it with the relevant diagnostic tools
Goals: understand the bottleneck(s) and find root causes
Limitations:
Our tools & time available for analysis are limiting factors
Profiling CPU-Bound workloads
Flame graph visualization of stack profiles:
● Brain child of Brendan Gregg (Dec 2011)
● Code: https://github.com/brendangregg/FlameGraph
● Now very popular, available for many languages, also for JVM
Shows which parts of the code are hot
● Very useful to understand where CPU cycles are spent
Flame Graph Visualization
Recipe:
● Gather multiple stack traces
● Aggregate them by sorting alphabetically by function/method name
● Visualization using stacked colored boxes
● Length of the box proportional to time spent there
Flame Graph (Spark 1.6)
Flame Graph (Spark 2.0)
Spark CodeGen vs. Volcano
Code generation improves CPU-intensive workloads
● Replaces loops and virtual function calls with code generated for the query
● The use of vector operations (e.g. SIMD) also beneficial
● Codegen is crucial for modern in-memory DBs
Commercial RDBMS engines
● Typically use the slower volcano model (with loops and virtual function calls)
● In the past optimizing for I/O latency was more important, now CPU cycles matter
more
Flame Graphs
Pros: good to understand where CPU cycles are spent
● Useful for performance troubleshooting
● Functions at the top of the graph are the ones using CPU
● Parent methods/functions provide context
Limitations:
● Off-CPU and wait time not charted (experimental)
● Interpretation of flame graphs requires experience/knowledge
● Not included in Spark monitoring suite
How-to Flame Graphs
CERN Java Flight Recorder Approach (1/2)
Enable Java Flight Recorder (JFR)
● Extra options in spark-defaults.conf or CLI. Example:
Collect data with jcmd:
● Example, sampling for 10 sec:
CERN Java Flight Recorder Approach (2/2)
Process the jfr file:
● From .jfr to merged stacks
● Produce the .svg file with the flame graph
● Find details in Kay Ousterhout’s article:
https://gist.github.com/kayousterhout/7008a8ebf2bab
eedc7ce6f8723fd1bf4
PayPal Approach
https://www.paypal-engineering.com/2016/09/08/spark-in-flames-profiling-
spark-applications-using-flame-graphs/
CERN HProfiler Approach
HProfiler (CERN home-built tool)
● Automates collection and aggregation of stack traces into flame graphs for
distributed applications
● Integrates with YARN to identify the processes to trace across the cluster
Based on Linux perf_events stack sampling (bare metal)
Experimental tool
● Author Joeri Hermans @ CERN
● https://github.com/cerndb/Hadoop-Profiler
● Hadoop-performance-troubleshooting-stack-tracing
Testing in Spark
Testing in Spark
● Why to run Spark outside of a cluster
● What to test
● Running Local
● Running as a Unit Test
● Data Structures
Testing in Spark
Why to run Spark outside of a cluster
● Time
● Trusted Deployment
● Money
Testing in Spark
What to test
● Experiments
● Complex logic
● Data samples
● Business generated scenarios
Testing in Spark (Running Local)
Running Local
● A test doesn’t always need to be a unit test
● UIs like Zeppelin is OK for quick feedback
but lacks from IDE Features
● Running local in your IDE is priceless
Testing in Spark (Running Local)
Example
● Use runLocal flag to set a local SparkContext
● Separate out testable work from driver code
Testing in Spark (Unit Testing)
Example
FunSuite: TDD unit testing suite
for Scala
Testing in Spark (Data Structures)
Working with “hand-written” DataFrames:
Testing in Spark (Hive)
Testing with Hive:
● Spin-up a docker-hive container for Apache Hive (Big Data Europe)
● Enables real interaction allowing to:
○ create, delete, write, ...
Testing in Spark (Hive)
Putting Hive + Spark together:
● Create a custom hive-site.xml
● Start Spark with the provided hive-site.xml
○ spark-shell --files /PATH/hive-site.xml
Testing in Spark (Hive)
Start Spark with the provided hive-site.xml:
Testing in Spark (Mini-Clusters)
Mini-Clusters
● Hadoop-mini-cluster
● Spark-unit-testing-with-hdfs
● Support for:
○ HBase & Hive
○ Kafka & Storm
○ Zookeeper
○ HDFS
○ ... access HDFS files & test code
copy files from localFS to HDFS
Conclusions
Conclusions
Apache Spark 2.0 Improvements (HDP 2.5 in tech preview)
● Scalability and performance on commodity HW
● Spark SQL useful for offloading queries from traditional RDBMS
● code generation speeds up to one order of magnitude on CPU-bound workloads
Diagnostics
● Profiling tools are important in MPP world
● Execution plans analyzed with flame graphs
● Cons: Very immature solutions
Testing
● Testing locally saves time, money and takes advantage of the IDE features
● Elegant ways to test a code by using local SparkContext
● Easy ways to recreate environments for testing real interactions (such Hadoop)
Profiling & Testing with Spark
THANK YOU!
References
● Deep-dive-into-catalyst-apache-spark-2.0
● http://es.slideshare.net/databricks/spark-performance-whats-next
● https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf
● http://www.brendangregg.com/flamegraphs.html
● http://db-blog.web.cern.ch/
● http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-ted-malaska
Q & A

More Related Content

What's hot

[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
NAVER D2
 
The Year of JRuby - RubyC 2018
The Year of JRuby - RubyC 2018The Year of JRuby - RubyC 2018
The Year of JRuby - RubyC 2018
Charles Nutter
 
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
Amazon Web Services
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
Shiao-An Yuan
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
InfluxData
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
Avleen Vig
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Spark Summit
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
Spark Summit
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
Sandy Ryza
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
Sigmoid
 
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to KafkaApache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Mark Bittmann
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
Wayne Chen
 
10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application
Amazon Web Services
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
Muhaza Liebenlito
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
Kohei KaiGai
 
icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
Kazuaki Ishizaki
 
Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!
Jonathan Katz
 
Java profiling Do It Yourself (jug.msk.ru 2016)
Java profiling Do It Yourself (jug.msk.ru 2016)Java profiling Do It Yourself (jug.msk.ru 2016)
Java profiling Do It Yourself (jug.msk.ru 2016)
aragozin
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
Preferred Networks
 

What's hot (20)

[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
[2C1] 아파치 피그를 위한 테즈 연산 엔진 개발하기 최종
 
The Year of JRuby - RubyC 2018
The Year of JRuby - RubyC 2018The Year of JRuby - RubyC 2018
The Year of JRuby - RubyC 2018
 
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
(PFC306) Performance Tuning Amazon EC2 Instances | AWS re:Invent 2014
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
 
Performance Profiling in Rust
Performance Profiling in RustPerformance Profiling in Rust
Performance Profiling in Rust
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
Taming GC Pauses for Humongous Java Heaps in Spark Graph Computing-(Eric Kacz...
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
 
Tale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark StreamingTale of Kafka Consumer for Spark Streaming
Tale of Kafka Consumer for Spark Streaming
 
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to KafkaApache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
 
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰HadoopCon 2016  - 用 Jupyter Notebook Hold 住一個上線 Spark  Machine Learning 專案實戰
HadoopCon 2016 - 用 Jupyter Notebook Hold 住一個上線 Spark Machine Learning 專案實戰
 
10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application10 tips to improve the performance of your AWS application
10 tips to improve the performance of your AWS application
 
PostgreSQL with OpenCL
PostgreSQL with OpenCLPostgreSQL with OpenCL
PostgreSQL with OpenCL
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
icpe2019_ishizaki_public
icpe2019_ishizaki_publicicpe2019_ishizaki_public
icpe2019_ishizaki_public
 
Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!Build a Complex, Realtime Data Management App with Postgres 14!
Build a Complex, Realtime Data Management App with Postgres 14!
 
Java profiling Do It Yourself (jug.msk.ru 2016)
Java profiling Do It Yourself (jug.msk.ru 2016)Java profiling Do It Yourself (jug.msk.ru 2016)
Java profiling Do It Yourself (jug.msk.ru 2016)
 
Chainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereportChainer ui v0.3 and imagereport
Chainer ui v0.3 and imagereport
 

Similar to Profiling & Testing with Spark

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
zmhassan
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Databricks
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
Tim Ellison
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
DESMOND YUEN
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Databricks
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
shareddatamsft
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
Tim Ellison
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
Spark Summit
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
Adarsh Pannu
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
J On The Beach
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Databricks
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
Yousun Jeong
 

Similar to Profiling & Testing with Spark (20)

SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDPBuild Large-Scale Data Analytics and AI Pipeline Using RayDP
Build Large-Scale Data Analytics and AI Pipeline Using RayDP
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
BigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for SparkBigDL webinar - Deep Learning Library for Spark
BigDL webinar - Deep Learning Library for Spark
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Seattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp APISeattle Spark Meetup Mobius CSharp API
Seattle Spark Meetup Mobius CSharp API
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 

More from Roger Rafanell Mas

How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?
Roger Rafanell Mas
 
Activate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifiedsActivate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifieds
Roger Rafanell Mas
 
Pensamiento lateral
Pensamiento lateralPensamiento lateral
Pensamiento lateral
Roger Rafanell Mas
 
Storm distributed cache workshop
Storm distributed cache workshopStorm distributed cache workshop
Storm distributed cache workshop
Roger Rafanell Mas
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
Roger Rafanell Mas
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingRoger Rafanell Mas
 
EEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersEEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersRoger Rafanell Mas
 

More from Roger Rafanell Mas (13)

How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?How to build a self-service data platform and what it can do for your business?
How to build a self-service data platform and what it can do for your business?
 
Activate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifiedsActivate 2019 - Search and relevance at scale for online classifieds
Activate 2019 - Search and relevance at scale for online classifieds
 
Pensamiento lateral
Pensamiento lateralPensamiento lateral
Pensamiento lateral
 
Storm distributed cache workshop
Storm distributed cache workshopStorm distributed cache workshop
Storm distributed cache workshop
 
IS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorialIS-ENES COMP Superscalar tutorial
IS-ENES COMP Superscalar tutorial
 
MRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud ComputingMRI Energy-Efficient Cloud Computing
MRI Energy-Efficient Cloud Computing
 
SDS Amazon RDS
SDS Amazon RDSSDS Amazon RDS
SDS Amazon RDS
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
EEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of DatacentersEEDC Intelligent Placement of Datacenters
EEDC Intelligent Placement of Datacenters
 
EEDC Everthing as a Service
EEDC Everthing as a ServiceEEDC Everthing as a Service
EEDC Everthing as a Service
 
EEDC Apache Pig Language
EEDC Apache Pig LanguageEEDC Apache Pig Language
EEDC Apache Pig Language
 
EEDC Distributed Systems
EEDC Distributed SystemsEEDC Distributed Systems
EEDC Distributed Systems
 
EEDC SOAP vs REST
EEDC SOAP vs RESTEEDC SOAP vs REST
EEDC SOAP vs REST
 

Recently uploaded

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
Srikant77
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
WSO2
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
IES VE
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
Fermin Galan
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 

Recently uploaded (20)

Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
RISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent EnterpriseRISE with SAP and Journey to the Intelligent Enterprise
RISE with SAP and Journey to the Intelligent Enterprise
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Using IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New ZealandUsing IESVE for Room Loads Analysis - Australia & New Zealand
Using IESVE for Room Loads Analysis - Australia & New Zealand
 
Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604Orion Context Broker introduction 20240604
Orion Context Broker introduction 20240604
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
SOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar Research Team: Latest Activities of IntelBroker
SOCRadar Research Team: Latest Activities of IntelBroker
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 

Profiling & Testing with Spark

  • 1. Profiling & Testing with Spark Apache Spark 2.0 Improvements, Flame Graphs & Testing
  • 2. Outline Overview Spark 2.0 Improvements Profiling with Flame Graphs How-to Flame Graphs Testing in Spark
  • 3. Overview Apache Spark™ is a fast and general engine for large-scale data processing Speed: Runs in-memory computing, up to 100x faster than MapReduce Ease of Use: Support for Java, Scala, Python and R binding Generality: Enabled for SQL, Streaming and complex analytics (ML) Portable: Runs on Yarn, Mesos, standalone or Cloud
  • 6. Overview (code sample) Monte-carlo π calculation “This code estimates π by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be π / 4, so we use this to get our estimate.”
  • 7. Main Takeaway Spark SQL: Provides parallelism, affordable at scale Scale out SQL on storage for Big Data volumes Scale out on CPU for memory-intensive queries Offloading reports from RDBMS becomes attractive Spark 2.0 improvements: Considerable speedup of CPU-intensive queries
  • 9. SQL Queries sqlContext.sql(" SELECT a.bucket, sum(a.val2) tot FROM t1 a, t1 b WHERE a.bucket=b.bucket and a.val1+b.val1<1000 GROUP BY a.bucket ORDER BY a.bucket").show() Complex and resource-intensive SELECT statement: EXPLAIN directive (execution plan)
  • 10. Execution Plan The execution plan: First instrumentation point for SQL tuning Shows how Spark wants to execute the query (break-down) Main players: Catalyst: the query optimizer
  • 11. Catalyst (query optimizer) Logical Plan: Describes computation on data sets without defining how to conduct it Physical Plan: Defines which computation to conduct on each dataset
  • 12. Project Tungsten (Goal) “Improves the memory and CPU efficiency of Spark backend execution by pushing performance close to the limits of modern hardware.”
  • 13. Project Tungsten Perform manual memory management instead of relying on Java objects: Reduce memory footprint Eliminate garbage collection overheads Use java.unsafe and off-heap memory Code generation for expression evaluation: Reduce virtual function calls and interpretation overhead (JVM)
  • 15. Project Tungsten (Code-Gen) The Volcano Iterator Model: Standard for 30 years: almost all databases do it. Each operator is an “iterator” that consumes records from its input operator
  • 16. Project Tungsten (Code-Gen) Downside the Volcano Iterator Model: Too many virtual function calls at least 3 calls for each row in Aggregate phase Can’t take advantage of modern CPU features pipelining, prefetching, branch prediction,
  • 17. Project Tungsten (Code-Gen) What if we hire a college freshman to implement this query in Java in 10 mins?
  • 18. Whole-stage Code-Gen: Spark as a “Compiler”
  • 19. Project Tungsten (Code-Gen) A student beating 30 years of science ...
  • 20. Project Tungsten (Code-Gen) Volcano ● Many virtual function calls ● Data in memory (or cache) ● No loop unrolling, SIMD Hand-written code ● No virtual function calls ● Data in CPU registers ● Exploit compiler optimizations ○ loop unrolling, SIMD, pipelining Take advantage of all the information that is known after query compilation
  • 21. Execution plan comparison (legacy vs whole stage code-gen) WholeStageCodeGen
  • 23. Root Cause Analysis Benchmarking: Run the workload and measure it with the relevant diagnostic tools Goals: understand the bottleneck(s) and find root causes Limitations: Our tools & time available for analysis are limiting factors
  • 24. Profiling CPU-Bound workloads Flame graph visualization of stack profiles: ● Brain child of Brendan Gregg (Dec 2011) ● Code: https://github.com/brendangregg/FlameGraph ● Now very popular, available for many languages, also for JVM Shows which parts of the code are hot ● Very useful to understand where CPU cycles are spent
  • 25. Flame Graph Visualization Recipe: ● Gather multiple stack traces ● Aggregate them by sorting alphabetically by function/method name ● Visualization using stacked colored boxes ● Length of the box proportional to time spent there
  • 28. Spark CodeGen vs. Volcano Code generation improves CPU-intensive workloads ● Replaces loops and virtual function calls with code generated for the query ● The use of vector operations (e.g. SIMD) also beneficial ● Codegen is crucial for modern in-memory DBs Commercial RDBMS engines ● Typically use the slower volcano model (with loops and virtual function calls) ● In the past optimizing for I/O latency was more important, now CPU cycles matter more
  • 29. Flame Graphs Pros: good to understand where CPU cycles are spent ● Useful for performance troubleshooting ● Functions at the top of the graph are the ones using CPU ● Parent methods/functions provide context Limitations: ● Off-CPU and wait time not charted (experimental) ● Interpretation of flame graphs requires experience/knowledge ● Not included in Spark monitoring suite
  • 31. CERN Java Flight Recorder Approach (1/2) Enable Java Flight Recorder (JFR) ● Extra options in spark-defaults.conf or CLI. Example: Collect data with jcmd: ● Example, sampling for 10 sec:
  • 32. CERN Java Flight Recorder Approach (2/2) Process the jfr file: ● From .jfr to merged stacks ● Produce the .svg file with the flame graph ● Find details in Kay Ousterhout’s article: https://gist.github.com/kayousterhout/7008a8ebf2bab eedc7ce6f8723fd1bf4
  • 34. CERN HProfiler Approach HProfiler (CERN home-built tool) ● Automates collection and aggregation of stack traces into flame graphs for distributed applications ● Integrates with YARN to identify the processes to trace across the cluster Based on Linux perf_events stack sampling (bare metal) Experimental tool ● Author Joeri Hermans @ CERN ● https://github.com/cerndb/Hadoop-Profiler ● Hadoop-performance-troubleshooting-stack-tracing
  • 36. Testing in Spark ● Why to run Spark outside of a cluster ● What to test ● Running Local ● Running as a Unit Test ● Data Structures
  • 37. Testing in Spark Why to run Spark outside of a cluster ● Time ● Trusted Deployment ● Money
  • 38. Testing in Spark What to test ● Experiments ● Complex logic ● Data samples ● Business generated scenarios
  • 39. Testing in Spark (Running Local) Running Local ● A test doesn’t always need to be a unit test ● UIs like Zeppelin is OK for quick feedback but lacks from IDE Features ● Running local in your IDE is priceless
  • 40. Testing in Spark (Running Local) Example ● Use runLocal flag to set a local SparkContext ● Separate out testable work from driver code
  • 41. Testing in Spark (Unit Testing) Example FunSuite: TDD unit testing suite for Scala
  • 42. Testing in Spark (Data Structures) Working with “hand-written” DataFrames:
  • 43. Testing in Spark (Hive) Testing with Hive: ● Spin-up a docker-hive container for Apache Hive (Big Data Europe) ● Enables real interaction allowing to: ○ create, delete, write, ...
  • 44. Testing in Spark (Hive) Putting Hive + Spark together: ● Create a custom hive-site.xml ● Start Spark with the provided hive-site.xml ○ spark-shell --files /PATH/hive-site.xml
  • 45. Testing in Spark (Hive) Start Spark with the provided hive-site.xml:
  • 46. Testing in Spark (Mini-Clusters) Mini-Clusters ● Hadoop-mini-cluster ● Spark-unit-testing-with-hdfs ● Support for: ○ HBase & Hive ○ Kafka & Storm ○ Zookeeper ○ HDFS ○ ... access HDFS files & test code copy files from localFS to HDFS
  • 48. Conclusions Apache Spark 2.0 Improvements (HDP 2.5 in tech preview) ● Scalability and performance on commodity HW ● Spark SQL useful for offloading queries from traditional RDBMS ● code generation speeds up to one order of magnitude on CPU-bound workloads Diagnostics ● Profiling tools are important in MPP world ● Execution plans analyzed with flame graphs ● Cons: Very immature solutions Testing ● Testing locally saves time, money and takes advantage of the IDE features ● Elegant ways to test a code by using local SparkContext ● Easy ways to recreate environments for testing real interactions (such Hadoop)
  • 49. Profiling & Testing with Spark THANK YOU!
  • 50. References ● Deep-dive-into-catalyst-apache-spark-2.0 ● http://es.slideshare.net/databricks/spark-performance-whats-next ● https://paperhub.s3.amazonaws.com/dace52a42c07f7f8348b08dc2b186061.pdf ● http://www.brendangregg.com/flamegraphs.html ● http://db-blog.web.cern.ch/ ● http://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-ted-malaska
  • 51. Q & A

Editor's Notes

  1. Could be provided on Maven POM as: <scope>test</scope>
  2. MPP = Massive Parallel Processing