SlideShare a Scribd company logo
1 of 40
Download to read offline
© 2015 IBM Corporation
A Javatm
Implementer's Guide to
Better Apache Sparktm
Performance
Tim Ellison
IBM Runtimes Team, Hursley, UK
tellison
@tpellison
© 2016 IBM Corporation2
Apache Spark is a fast, general
purpose cluster computing platform
© 2016 IBM Corporation3
Apache Spark is a fast, general
purpose cluster computing platform
© 2016 IBM Corporation4
SQL Streaming
Machine
Learning Graph
Core
Data Frames
Machine Learning
Pipelines
Low-level language optimizations
Run the existing code quicker!
© 2016 IBM Corporation5
SQL Streaming
Machine
Learning Graph
Core
Data Frames
Machine Learning
Pipelines
Higher-level platform optimizations
Write quicker code!
© 2016 IBM Corporation6
Apache Spark Platform Optimizations
 A computing platform wants to offer high-level APIs to understand
“intent” of application and optimize it's implementation
–Apache Spark platform optimizes use of Scala/Java platform
–Scala/Java platform optimizes use of machine hardware capabilities
© 2016 IBM Corporation
IBM Java SE Platform Optimizations
IBM
Runtime
Technology
Centre
Quality
Engineering
Security fixes
Additional Testing
Customer Feedback
Production
Requirements
Performance
Reliability
Scalability
Monitoring and Management
AWT Swing Java2d
XML Crypto Corba
Core libraries
“J9” virtual machine
and JIT compiler
 Deep integration with hardware and software innovation
 Heuristics optimised for workload scenarios
© 2016 IBM Corporation
Low-level Profiling of Apache Spark Workloads
 Tools like HealthCenter and tprof show us how the JVM is performing and the effect on the
CPU.
© 2016 IBM Corporation9
A few things we can do with the JVM to enhance the performance of
Apache Spark!
1) JIT compiler enhancements, and writing JIT-friendly code
2) Faster IO – networking and storage
3) Offloading tasks to co-processors
© 2016 IBM Corporation10
Adaptive Compilation : trading-off effort and benefit
1) The VM searches the JAR, loads and verifies bytecodes, converts
to internal representation (IR), runs bytecode order directly.
2) After many invocations (or via sampling) code gets compiled at
‘cold’ or ‘warm’ level. Modest, local optimizations performed.
3) A low overhead sampling thread is used to identify frequently used
methods within the application.
4) Methods may get recompiled at ‘hot’ or ‘scorching’ levels for more
intensive optimizations.
5) Transition to ‘scorching’ goes through a temporary profiling step, to
produce application specific, globally optimized code.
cold
hot
scorching
profiling
interpreter
warm
Java's bytecodes are compiled as required and based on runtime profiling
- dynamic compilation to the target machine capabilities and application demands
The JIT takes a holistic view of the application, looking for global optimizations
based on actual usage patterns, and speculative assumptions.
© 2016 IBM Corporation11 11
;
Language → IR Generators
Compile
Time
Connectors
Local and Global Optimization Techniques
Block Hoisting
Inliner
Value Propagation
Block Hoisting
CSE
Loop Versioner
PRE
Escape Analysis
Redundant Monitor Elimination
Cold block Outlining
Loop Unroller
Copy Propagation
Optimal Store Placement
Simplifier
Dead Tree Removal
Switch Analyzer
Async Check Removal
cold warm hot FSDsmall profiling customscorching AOT
Profiler
CPU-specific Code Generators
Optimizer
• Multiple optimization
strategies for different
code quality
compile-time
tradeoffs
• Spend compile time
where it makes
biggest difference
•Easy to adopt the
latest optimizations.
•Embodies many
years of Java
experience.
JIT Optimization Techniques
© 2016 IBM Corporation
JIT Optimizations are Limited to Preserving Semantic Equivalence
 Compilers thrive on proving properties of code
– Uncertainty makes code slower
 Problem: Java has some inherent uncertainty
– virtual calls
– many small methods
– dynamic class loading
 Problem: Scala has even more inherent uncertainty
– field accesses become method calls
– many layers of indirection to facilitate multiple inheritance
– frequent use of interface calls
 JIT can speculate to overcome some of these
© 2016 IBM Corporation
Compiler Techniques to Cope with Uncertainty
 Challenge: dynamic class loading means application shape changes
 Approach: JIT speculates about invariance of code, and copes with failure
 Challenge: lots of small methods
 Approach: JIT aggregates small methods into larger ones to avoid call overhead
 Challenge: lots of virtual method calls
 Approach: JIT rewites virtual calls to direct calls through flow analysis
 Challenge: objects are heap-allocated, and must be GC'd
 Approach: JIT converts variables to stack allocation where possible
13
© 2016 IBM Corporation
Programmers Can Help by Reducing Uncertainty
 Locals are faster than globals
– Can prove closed set of storage readers / modifers.
– Fields and statics slow; parameters and locals fast.
 Constants are faster than variables
– Can copy constants inline or across memory caches.
– Java’s final and Scala’s val are your friends.
 private is faster than public
– Private methods can't be dynamically redefined.
– protected and “package private” just as slow as public.
 Small methods (≤100 bytecodes) are good
– More opportunities to in-line them.
 Simple is faster than complex
– Easier for the JIT to reason about the effects.
 Limit extension points and architectural complexity when practical
– Makes call sites concrete.
14
© 2016 IBM Corporation
 Scala is a language with many, many features.
– Some features are for convenience others are for performance
– The JVM is not optimized for these patterns and idioms.
 Understand how the features you are using perform
 Idiomatic:
 Scala & Java library implementations
are standard imperative
 Idiomatic is convenient, but slow!
Example source & performance from https://jazzy.id.au/2012/10/16/benchmarking_scala_against_java.html
Running Scala code on the Java VM
© 2016 IBM Corporation16
Example: Exploiting JIT optimizations to reduce overhead of logging
checks
 Tip: Check for the non-null value of a static field ref to instance of a logging class singleton
– e.g.
– Uses the JIT's speculative optimization to avoid the explicit test for logging being enabled;
instead it ...
1) Generates an internal JIT runtime assumption (e.g. InfoLogger.class is undefined),
2) NOPs the test for trace enablement
3) Uses a class initialization hook for the InfoLogger.class (already necessary for instantiating the class)
4) The JIT will regenerate the test code if the class event is fired
 Spark's logging calls are gated on the checks of a static boolean value
trait Logging
Spark
© 2016 IBM Corporation17
Style: Judicious use of polymorphism
 Spark / Scala has a number of highly polymorphic interface call sites and high fan-in (several calling
contexts invoking the same callee method) in map, reduce, filter, flatMap, ...
– e.g. ExternalSorter.insertAll is very hot (drains an iterator using hasNext/next calls)
 Pattern #1:
– InterruptibleIterator → Scala's mapIterator → Scala's filterIterator → …
 Pattern #2:
– InterruptibleIterator → Scala's filterIterator → Scala's mapIterator → …
 The JIT can only choose one pattern to in-line!
– Makes JIT devirtualization and speculation more risky; using profiling information from a different
context could lead to incorrect devirtualization.
– More conservative speculation, or good phase change detection and recovery are needed in the JIT
compiler to avoid getting it wrong.
 Lambdas and functions as arguments, by definition, introduce different code flow targets
– Passing in widely implemented interfaces produce many different bytecode sequences
– When we in-line we have to put runtime checks ahead of in-lined method bodies to make sure we are
going to run the right method!
– Often specialized classes are used only in a very limited number of places, but the majority of the code
does not use these classes and pays a heavy penalty
– e.g. Scala's attempt to specialize Tuple2 Int argument does more harm than good!
 Tip: Use polymorphism sparingly, use the same order / patterns for nested & wrappered code, and
keep call sites homogeneous.
© 2016 IBM Corporation
Example: Use of Scala Functions as Values
 Scala has first class functions – they can be treated as values
val square = (x: Int) => { x * x }
 Implemented by scalac using generic interfaces!
val square = new Function2[Int, Int] {
def apply(x: Int): Int = x * x
}
 Calls to function values site is highly uncertain
– Program contains many different classes that implement Function2
– Hard to tell which Function2 implementer to inline
def filter(values: List[A], checkOp: A => Boolean): List[A] = {
var result: List[A] = Nil
for (v <- values) { if (checkOp(v)) { result = result :: v } }
return result
}
© 2016 IBM Corporation
JIT compiler enhancements, and writing JIT-friendly code
 Understand the implementation of Scala language features – use them
judiciously.
 Try to reduce uncertainty for the compiler in your coding style.
 Stick to common coding patterns - the JIT is tuned for them.
– As new workloads emerge the latest JITs will change too
 Focus on performance hotspots in
your application
– Too much emphasis on performance
can compromise maintainability
– Too much emphasis on maintainability
can compromise performance!
#WOCinTech Chat
© 2016 IBM Corporation20
Effect of adjusting JIT heuristics for Apache Spark
IBM JDK8 SR3
(tuned)
IBM JDK8 SR3
(out of the box)
PageRank 160% 148%
Sleep 187% 113%
Sort 103% 147%
WordCount 130% 146%
Bayes 100% 91%
Terasort 160% 131%
1/Geometric mean of HiBench time on zLinux 32 cores, 25G heap
Improvements in successive IBM Java 8 releases Performance compared with OpenJDK 8
HiBench huge, Spark 2.0.1, Linux Power8 12 core * 8-way SMT
1.35x
© 2016 IBM Corporation21
Faster IO – networking and storage
© 2016 IBM Corporation22
Performance of the Apache Spark Runtime Core
 Executing tasks
– How quickly can a worker sort, compute,
transform, … the data in this partition?
– Where is the best place to send work to have it
completed quickest?
“Narrow” RDD dependencies e.g. map()
pipeline-able
“Wide” RDD dependencies e.g. reduce()
shuffles
RDD1
partition1
RDD1
partition 2
RDD1
partition 1
RDD1
partition 3
RDD1
partition n...
RDD1
partition1
RDD2
partition 2
RDD2
partition 1
RDD2
partition 3
RDD2
partition n...
RDD1
partition1
RDD3
partition 2
RDD3
partition 1
RDD3
partition 3
RDD3
partition n...
RDD1
partition1
RDD1
partition 2
RDD1
partition 1
RDD1
partition 3
RDD1
partition n...
RDD1
partition1
RDD2
partition 2
RDD2
partition 1
 Moving data blocks
– How quickly can a worker get the data needed
for a task?
– How quickly can a worker persist the results if
required?
© 2016 IBM Corporation
Remote Direct Memory Access (RDMA) Networking
Spark VM
Buffer
Off
Heap
Buffer
Spark VM
Buffer
Off
Heap
Buffer
Ether/IB
SwitchRDMA NIC/HCA RDMA NIC/HCA
OS OS
DMA DMA
(Z-Copy) (Z-Copy)
(B-Copy)(B-Copy)
Acronyms:
Z-Copy – Zero Copy
B-Copy – Buffer Copy
IB – InfiniBand
Ether - Ethernet
NIC – Network Interface Card
HCA – Host Control Adapter
●
Low-latency, high-throughput networking
●
Direct 'application to application' memory pointer exchange between remote hosts
●
Off-load network processing to RDMA NIC/HCA – OS/Kernel Bypass (zero-copy)
●
Introduces new IO characteristics that can influence the Apache Spark transfer plan
Spark node #1 Spark node #2
© 2016 IBM Corporation24
TCP/IP
RDMA
RDMA exhibits improved throughput and reduced latency.
Our JVM makes RDMA available transparently via java.net.Socket APIs (JSoR),
or explicitly via com.ibm jVerbs calls.
Characteristics of RDMA Networking
© 2016 IBM Corporation
Modifications for an RDMA-enabled Apache Spark
25
New dynamic transfer plan that adapts to the load and
responsiveness of the remote hosts.
New “RDMA” shuffle IO mode with lower latency and
higher throughput.
JVM-agnostic
IBM JVM only
JVM-agnostic
IBM JVM only
IBM JVM only
Block manipulation (i.e., RDD partitions)
High-level API
JVM-agnostic working prototype
with RDMA
© 2016 IBM Corporation
TCP-H benchmark 100Gb
30% improvement in database
operations
Shuffle-intensive benchmarks show 30% - 40% better performance
with RDMA
HiBench PageRank 3Gb
40% faster, lower CPU usage
32 cores
(1 master, 4 nodes x 8 cores/node,
32GB Mem/node)
© 2016 IBM Corporation27
Time to complete TeraSort benchmark
32 cores (1 master, 4 nodes x 8 cores/node, 32GB Mem/node), IBM Java 8
0 100 200 300 400 500 600
Spark HiBench TeraSort [30GB]
Execution Time (sec)
556s
159s
TCP/IP
JSoR
© 2016 IBM Corporation28
Faster storage IO with POWER CAPI/Flash
 POWER8 architecture offers a 40Tb Flash drive attached
via Coherent Accelerator Processor Interface (CAPI)
– Provides simple coherent block IO APIs
– No file system overhead
 Power Service Layer (PSL)
– Performs Address Translations
– Maintains Cache
– Simple, but powerful interface to the Accelerator unit
 Coherent Accelerator Processor Proxy (CAPP)
– Maintains directory of cache lines held by Accelerator
– Snoops PowerBus on behalf of Accelerator

© 2016 IBM Corporation29
Faster disk IO with CAPI/Flash-enabled Spark
 When under memory pressure, Spark spills RDDs to disk.
– Happens in ExternalAppendOnlyMap and ExternalSorter
 We have modified Spark to spill to the high-bandwidth, coherently-attached Flash device instead.
– Replacement for DiskBlockManager
– New FlashBlockManager handles spill to/from flash
 Making this pluggable requires some further abstraction in Spark:
– Spill code assumes using disks, and depends on DiskBlockManger
– We are spilling without using a file system layer
 Dramatically improves performance of executors under memory pressure.
 Allows to reach similar performance with much less memory (denser cloud deployments).
IBM Flash System 840Power8 + CAPI
© 2016 IBM Corporation30
e.g. using CAPI Flash for RDD
caching allows for 4X memory
reduction while maintaining equal
performance
© 2016 IBM Corporation31
Offloading tasks to graphics co-processors
© 2016 IBM Corporation32
JIT optimized GPU acceleration
 Comes with caveats
– Recognize a limited set of operations within the lambda expressions,
• notably no object references maintained on GPU
– Default grid dimensions and operating parameters for the GPU
workload
– Redundant/pessimistic data transfer between host and device
• Not using GPU shared memory
– Limited heuristics about when to invoke the GPU and when to
generate CPU instructions
 As the JIT compiles a stream expression we can identify candidates for GPU off-loading
– Arrays copied to and from the device implicitly
– Java operations mapped to GPU kernel operations
– Preserves the standard Java syntax and semantics bytecodes
intermediate
representation
optimizer
CPU GPU
code generator
code
generator
PTX ISACPU native
© 2016 IBM Corporation33
GPU-enabled Standard Java APIs
 Some Arrays.sort() methods will offload work to GPUs today
– e.g. sorting large arrays of ints
© 2016 IBM Corporation34
GPU optimization of Loops and Lambda expressions
Speed-up factor when run on a GPU enabled host
0.00
0.01
0.10
1.00
10.00
100.00
1000.00
auto-SIMD parallel forEach on CPU
parallel forEach on GPU
matrix size
The JIT can recognize parallel stream
code, and automatically compile down to
the GPU.
© 2016 IBM Corporation
DataFrames and DataSets Codegen Optimizations
User's SQL operators
Arbitrary user code
 Uses data schema to generate code
– Builds logical and physical execution plans.
– Loop through rows in the data set to:
●
Access serialized storage, convert data to VM objects, call function |
execute SQL plan
– Code is generated and compiled at runtime
© 2016 IBM Corporation
Storage Layout Optimization
 UnsafeRow representation
– Data laid out in a heap byte array
– Accessed via sun.misc.Unsafe method calls
 Conversion to Columunar Format
– Allows for compression and better vector processing
© 2016 IBM Corporation
/* 049 */ while (inputadapter_input.hasNext()) {
/* 050 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 051 */ boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 052 */ int inputadapter_value = inputadapter_isNull ? -1 : (inputadapter_row.getInt(0));
/* 053 */
/* 054 */ if (!(!(inputadapter_isNull))) continue;
/* 055 */
/* 056 */ boolean filter_isNull2 = false;
/* 057 */
/* 058 */ boolean filter_value2 = false;
/* 059 */ filter_value2 = inputadapter_value > 8;
/* 060 */ if (!filter_value2) continue;
/* 061 */ boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1);
/* 062 */ double inputadapter_value1 = inputadapter_isNull1 ? -1.0 : (inputadapter_row.getDouble(1));
/* 063 */
/* 064 */ if (!(!(inputadapter_isNull1))) continue;
/* 065 */
/* 066 */ boolean filter_isNull7 = false;
/* 067 */
/* 068 */ boolean filter_value7 = false;
/* 069 */ filter_value7 = org.apache.spark.util.Utils.nanSafeCompareDoubles(inputadapter_value1, 24.0D) > 0;
/* 070 */ if (!filter_value7) continue;
/* 071 */
/* 072 */ filter_numOutputRows.add(1);
/* 073 */
/* 074 */ // do aggregate
/* 075 */ // common sub-expressions
/* 076 */
/* 077 */ // evaluate aggregate function
/* 078 */ boolean agg_isNull1 = false;
/* 079 */
/* 080 */ long agg_value1 = -1L;
/* 081 */ agg_value1 = agg_bufValue + 1L;
/* 082 */ // update aggregation buffer
/* 083 */ agg_bufIsNull = false;
/* 084 */ agg_bufValue = agg_value1;
/* 085 */ if (shouldStop()) return;
/* 086 */ }
Code generation is shared between columnar and
row based storage.
●
this introduces huge overhead.
●
the complex loop is not a counted loop, contains
numerous branches (due to nullability checks)
and calls etc
Generated Code
© 2016 IBM Corporation
●
ColumnarBatch format instead of CachedBatch for in-memory cache.
●
Common compression scheme (e.g. lz4) for all of the data types in a ColumnVector
Optimizing Spark's Generated Code
/* 073 */ while (inmemorytablescan_batch != null) {
/* 074 */ int numRows = inmemorytablescan_batch.numRows();
/* 075 */ while (inmemorytablescan_batchIdx < numRows) {
/* 076 */ int inmemorytablescan_rowIdx = inmemorytablescan_batchIdx++;
/* 077 */ boolean inmemorytablescan_isNull = inmemorytablescan_colInstance0.isNullAt(inmemorytablescan_rowIdx);
/* 078 */ int inmemorytablescan_value = inmemorytablescan_isNull ? -1 : (inmemorytablescan_colInstance0.getInt(inmemorytablescan_rowIdx));
/* 079 */
/* 080 */ // filtering out values and counting rows similar to the old code
/* 081 */ ...
/* 112 */ }
/* 113 */ inmemorytablescan_batch = null;
/* 114 */ inmemorytablescan_nextBatch();
/* 115 */ }
●
Simple and specialized runtime code for building
a concrete instance of the in-memory cache
●
Read data directly from ColumnVector in the
whole-stage codegen.
3.4x performance
improvement
© 2016 IBM Corporation39
Summary
 We are focused on Core runtime performance to get a multiplier up the Apache Spark stack.
– More efficient code, more efficient memory usage/spilling, more efficient serialization &
networking, etc.
 There are hardware and software technologies we can bring to the party.
– We can tune the stack from hardware to high level structures for running Apache Spark.
 Apache Spark and Scala developers can help themselves by their style of coding.
 All the changes are being made in the Java runtime or
being pushed out to the Apache Spark community.
 There is lots more stuff I don't have time to talk about, like GC optimizations, object layout,
monitoring VM/Apache Spark events, hardware compression, security, etc. etc.
– mailto:
http://ibm.biz/spark­kit
© 2016 IBM Corporation40
Questions?

More Related Content

What's hot

Compiler Construction | Lecture 1 | What is a compiler?
Compiler Construction | Lecture 1 | What is a compiler?Compiler Construction | Lecture 1 | What is a compiler?
Compiler Construction | Lecture 1 | What is a compiler?Eelco Visser
 
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...Thomas Wuerthinger
 
Classification of Compilers
Classification of CompilersClassification of Compilers
Classification of CompilersSarmad Ali
 
Cd ch1 - introduction
Cd   ch1 - introductionCd   ch1 - introduction
Cd ch1 - introductionmengistu23
 
Ijaprr vol1-2-13-60-64tejinder
Ijaprr vol1-2-13-60-64tejinderIjaprr vol1-2-13-60-64tejinder
Ijaprr vol1-2-13-60-64tejinderijaprr_editor
 
Just-in-time compiler (March, 2017)
Just-in-time compiler (March, 2017)Just-in-time compiler (March, 2017)
Just-in-time compiler (March, 2017)Rachel M. Carmena
 
JVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir IvanovJVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir IvanovZeroTurnaround
 
Compiler Design Introduction
Compiler Design IntroductionCompiler Design Introduction
Compiler Design IntroductionKuppusamy P
 
Compiler Design Lecture Notes
Compiler Design Lecture NotesCompiler Design Lecture Notes
Compiler Design Lecture NotesFellowBuddy.com
 
JavaScript for ABAP Programmers - 3/7 Syntax
JavaScript for ABAP Programmers - 3/7 SyntaxJavaScript for ABAP Programmers - 3/7 Syntax
JavaScript for ABAP Programmers - 3/7 SyntaxChris Whealy
 
Syntax directed-translation
Syntax directed-translationSyntax directed-translation
Syntax directed-translationJunaid Lodhi
 
Introduction to course
Introduction to courseIntroduction to course
Introduction to coursenikit meshram
 
compiler and their types
compiler and their typescompiler and their types
compiler and their typespatchamounika7
 

What's hot (20)

JVM++: The Graal VM
JVM++: The Graal VMJVM++: The Graal VM
JVM++: The Graal VM
 
Compiler Construction | Lecture 1 | What is a compiler?
Compiler Construction | Lecture 1 | What is a compiler?Compiler Construction | Lecture 1 | What is a compiler?
Compiler Construction | Lecture 1 | What is a compiler?
 
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
Graal and Truffle: Modularity and Separation of Concerns as Cornerstones for ...
 
Types of Compilers
Types of CompilersTypes of Compilers
Types of Compilers
 
Classification of Compilers
Classification of CompilersClassification of Compilers
Classification of Compilers
 
Cd ch1 - introduction
Cd   ch1 - introductionCd   ch1 - introduction
Cd ch1 - introduction
 
Ijaprr vol1-2-13-60-64tejinder
Ijaprr vol1-2-13-60-64tejinderIjaprr vol1-2-13-60-64tejinder
Ijaprr vol1-2-13-60-64tejinder
 
Just-in-time compiler (March, 2017)
Just-in-time compiler (March, 2017)Just-in-time compiler (March, 2017)
Just-in-time compiler (March, 2017)
 
Compiler type
Compiler typeCompiler type
Compiler type
 
Compilers
CompilersCompilers
Compilers
 
JVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir IvanovJVM JIT compilation overview by Vladimir Ivanov
JVM JIT compilation overview by Vladimir Ivanov
 
Compiler Design Introduction
Compiler Design IntroductionCompiler Design Introduction
Compiler Design Introduction
 
Compiler Design Lecture Notes
Compiler Design Lecture NotesCompiler Design Lecture Notes
Compiler Design Lecture Notes
 
JavaScript for ABAP Programmers - 3/7 Syntax
JavaScript for ABAP Programmers - 3/7 SyntaxJavaScript for ABAP Programmers - 3/7 Syntax
JavaScript for ABAP Programmers - 3/7 Syntax
 
Syntax directed-translation
Syntax directed-translationSyntax directed-translation
Syntax directed-translation
 
Introduction to course
Introduction to courseIntroduction to course
Introduction to course
 
Compiler Construction
Compiler ConstructionCompiler Construction
Compiler Construction
 
pravesh_kumar
pravesh_kumarpravesh_kumar
pravesh_kumar
 
compiler and their types
compiler and their typescompiler and their types
compiler and their types
 
Compiler design
Compiler designCompiler design
Compiler design
 

Similar to Apache Big Data Europe 2016

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterTim Ellison
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceTim Ellison
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkAdamRobertsIBM
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance ObservationsAdam Roberts
 
Java performance tuning
Java performance tuningJava performance tuning
Java performance tuningJerry Kurian
 
What's new in Java 8
What's new in Java 8What's new in Java 8
What's new in Java 8jclingan
 
Usemon; Building The Big Brother Of The Java Virtual Machinve
Usemon; Building The Big Brother Of The Java Virtual MachinveUsemon; Building The Big Brother Of The Java Virtual Machinve
Usemon; Building The Big Brother Of The Java Virtual MachinvePaul René Jørgensen
 
Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache SparkSachin Aggarwal
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
 
What's New in IBM Java 8 SE?
What's New in IBM Java 8 SE?What's New in IBM Java 8 SE?
What's New in IBM Java 8 SE?Tim Ellison
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Julien SIMON
 
compressed.tracemonkey-pldi-09.pdf
compressed.tracemonkey-pldi-09.pdfcompressed.tracemonkey-pldi-09.pdf
compressed.tracemonkey-pldi-09.pdfGinaMartinezTacuchi
 
compressed.tracemonkey-pldi-09.pdf
compressed.tracemonkey-pldi-09.pdfcompressed.tracemonkey-pldi-09.pdf
compressed.tracemonkey-pldi-09.pdfMaherEmad1
 
Cast Iron Cloud Integration Best Practices
Cast Iron Cloud Integration Best PracticesCast Iron Cloud Integration Best Practices
Cast Iron Cloud Integration Best PracticesSarath Ambadas
 
"JIT compiler overview" @ JEEConf 2013, Kiev, Ukraine
"JIT compiler overview" @ JEEConf 2013, Kiev, Ukraine"JIT compiler overview" @ JEEConf 2013, Kiev, Ukraine
"JIT compiler overview" @ JEEConf 2013, Kiev, UkraineVladimir Ivanov
 
Real World Java Compatibility
Real World Java CompatibilityReal World Java Compatibility
Real World Java CompatibilityTim Ellison
 
JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013Vladimir Ivanov
 

Similar to Apache Big Data Europe 2016 (20)

A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.
 
Five cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark fasterFive cool ways the JVM can run Apache Spark faster
Five cool ways the JVM can run Apache Spark faster
 
A Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark PerformanceA Java Implementer's Guide to Better Apache Spark Performance
A Java Implementer's Guide to Better Apache Spark Performance
 
IBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache SparkIBM Runtimes Performance Observations with Apache Spark
IBM Runtimes Performance Observations with Apache Spark
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
 
Java performance tuning
Java performance tuningJava performance tuning
Java performance tuning
 
Java 8
Java 8Java 8
Java 8
 
What's new in Java 8
What's new in Java 8What's new in Java 8
What's new in Java 8
 
Usemon; Building The Big Brother Of The Java Virtual Machinve
Usemon; Building The Big Brother Of The Java Virtual MachinveUsemon; Building The Big Brother Of The Java Virtual Machinve
Usemon; Building The Big Brother Of The Java Virtual Machinve
 
Interactive Analytics using Apache Spark
Interactive Analytics using Apache SparkInteractive Analytics using Apache Spark
Interactive Analytics using Apache Spark
 
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...
 
What's New in IBM Java 8 SE?
What's New in IBM Java 8 SE?What's New in IBM Java 8 SE?
What's New in IBM Java 8 SE?
 
Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)Building Machine Learning Inference Pipelines at Scale (July 2019)
Building Machine Learning Inference Pipelines at Scale (July 2019)
 
Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)Apache Drill (ver. 0.2)
Apache Drill (ver. 0.2)
 
compressed.tracemonkey-pldi-09.pdf
compressed.tracemonkey-pldi-09.pdfcompressed.tracemonkey-pldi-09.pdf
compressed.tracemonkey-pldi-09.pdf
 
compressed.tracemonkey-pldi-09.pdf
compressed.tracemonkey-pldi-09.pdfcompressed.tracemonkey-pldi-09.pdf
compressed.tracemonkey-pldi-09.pdf
 
Cast Iron Cloud Integration Best Practices
Cast Iron Cloud Integration Best PracticesCast Iron Cloud Integration Best Practices
Cast Iron Cloud Integration Best Practices
 
"JIT compiler overview" @ JEEConf 2013, Kiev, Ukraine
"JIT compiler overview" @ JEEConf 2013, Kiev, Ukraine"JIT compiler overview" @ JEEConf 2013, Kiev, Ukraine
"JIT compiler overview" @ JEEConf 2013, Kiev, Ukraine
 
Real World Java Compatibility
Real World Java CompatibilityReal World Java Compatibility
Real World Java Compatibility
 
JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013
 

More from Tim Ellison

The Extraordinary World of Quantum Computing
The Extraordinary World of Quantum ComputingThe Extraordinary World of Quantum Computing
The Extraordinary World of Quantum ComputingTim Ellison
 
Java on zSystems zOS
Java on zSystems zOSJava on zSystems zOS
Java on zSystems zOSTim Ellison
 
Inside IBM Java 7
Inside IBM Java 7Inside IBM Java 7
Inside IBM Java 7Tim Ellison
 
Secure Engineering Practices for Java
Secure Engineering Practices for JavaSecure Engineering Practices for Java
Secure Engineering Practices for JavaTim Ellison
 
Securing Java in the Server Room
Securing Java in the Server RoomSecuring Java in the Server Room
Securing Java in the Server RoomTim Ellison
 
Modules all the way down: OSGi and the Java Platform Module System
Modules all the way down: OSGi and the Java Platform Module SystemModules all the way down: OSGi and the Java Platform Module System
Modules all the way down: OSGi and the Java Platform Module SystemTim Ellison
 
Virtualization aware Java VM
Virtualization aware Java VMVirtualization aware Java VM
Virtualization aware Java VMTim Ellison
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaTim Ellison
 

More from Tim Ellison (8)

The Extraordinary World of Quantum Computing
The Extraordinary World of Quantum ComputingThe Extraordinary World of Quantum Computing
The Extraordinary World of Quantum Computing
 
Java on zSystems zOS
Java on zSystems zOSJava on zSystems zOS
Java on zSystems zOS
 
Inside IBM Java 7
Inside IBM Java 7Inside IBM Java 7
Inside IBM Java 7
 
Secure Engineering Practices for Java
Secure Engineering Practices for JavaSecure Engineering Practices for Java
Secure Engineering Practices for Java
 
Securing Java in the Server Room
Securing Java in the Server RoomSecuring Java in the Server Room
Securing Java in the Server Room
 
Modules all the way down: OSGi and the Java Platform Module System
Modules all the way down: OSGi and the Java Platform Module SystemModules all the way down: OSGi and the Java Platform Module System
Modules all the way down: OSGi and the Java Platform Module System
 
Virtualization aware Java VM
Virtualization aware Java VMVirtualization aware Java VM
Virtualization aware Java VM
 
Using GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with JavaUsing GPUs to Handle Big Data with Java
Using GPUs to Handle Big Data with Java
 

Recently uploaded

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Apache Big Data Europe 2016

  • 1. © 2015 IBM Corporation A Javatm Implementer's Guide to Better Apache Sparktm Performance Tim Ellison IBM Runtimes Team, Hursley, UK tellison @tpellison
  • 2. © 2016 IBM Corporation2 Apache Spark is a fast, general purpose cluster computing platform
  • 3. © 2016 IBM Corporation3 Apache Spark is a fast, general purpose cluster computing platform
  • 4. © 2016 IBM Corporation4 SQL Streaming Machine Learning Graph Core Data Frames Machine Learning Pipelines Low-level language optimizations Run the existing code quicker!
  • 5. © 2016 IBM Corporation5 SQL Streaming Machine Learning Graph Core Data Frames Machine Learning Pipelines Higher-level platform optimizations Write quicker code!
  • 6. © 2016 IBM Corporation6 Apache Spark Platform Optimizations  A computing platform wants to offer high-level APIs to understand “intent” of application and optimize it's implementation –Apache Spark platform optimizes use of Scala/Java platform –Scala/Java platform optimizes use of machine hardware capabilities
  • 7. © 2016 IBM Corporation IBM Java SE Platform Optimizations IBM Runtime Technology Centre Quality Engineering Security fixes Additional Testing Customer Feedback Production Requirements Performance Reliability Scalability Monitoring and Management AWT Swing Java2d XML Crypto Corba Core libraries “J9” virtual machine and JIT compiler  Deep integration with hardware and software innovation  Heuristics optimised for workload scenarios
  • 8. © 2016 IBM Corporation Low-level Profiling of Apache Spark Workloads  Tools like HealthCenter and tprof show us how the JVM is performing and the effect on the CPU.
  • 9. © 2016 IBM Corporation9 A few things we can do with the JVM to enhance the performance of Apache Spark! 1) JIT compiler enhancements, and writing JIT-friendly code 2) Faster IO – networking and storage 3) Offloading tasks to co-processors
  • 10. © 2016 IBM Corporation10 Adaptive Compilation : trading-off effort and benefit 1) The VM searches the JAR, loads and verifies bytecodes, converts to internal representation (IR), runs bytecode order directly. 2) After many invocations (or via sampling) code gets compiled at ‘cold’ or ‘warm’ level. Modest, local optimizations performed. 3) A low overhead sampling thread is used to identify frequently used methods within the application. 4) Methods may get recompiled at ‘hot’ or ‘scorching’ levels for more intensive optimizations. 5) Transition to ‘scorching’ goes through a temporary profiling step, to produce application specific, globally optimized code. cold hot scorching profiling interpreter warm Java's bytecodes are compiled as required and based on runtime profiling - dynamic compilation to the target machine capabilities and application demands The JIT takes a holistic view of the application, looking for global optimizations based on actual usage patterns, and speculative assumptions.
  • 11. © 2016 IBM Corporation11 11 ; Language → IR Generators Compile Time Connectors Local and Global Optimization Techniques Block Hoisting Inliner Value Propagation Block Hoisting CSE Loop Versioner PRE Escape Analysis Redundant Monitor Elimination Cold block Outlining Loop Unroller Copy Propagation Optimal Store Placement Simplifier Dead Tree Removal Switch Analyzer Async Check Removal cold warm hot FSDsmall profiling customscorching AOT Profiler CPU-specific Code Generators Optimizer • Multiple optimization strategies for different code quality compile-time tradeoffs • Spend compile time where it makes biggest difference •Easy to adopt the latest optimizations. •Embodies many years of Java experience. JIT Optimization Techniques
  • 12. © 2016 IBM Corporation JIT Optimizations are Limited to Preserving Semantic Equivalence  Compilers thrive on proving properties of code – Uncertainty makes code slower  Problem: Java has some inherent uncertainty – virtual calls – many small methods – dynamic class loading  Problem: Scala has even more inherent uncertainty – field accesses become method calls – many layers of indirection to facilitate multiple inheritance – frequent use of interface calls  JIT can speculate to overcome some of these
  • 13. © 2016 IBM Corporation Compiler Techniques to Cope with Uncertainty  Challenge: dynamic class loading means application shape changes  Approach: JIT speculates about invariance of code, and copes with failure  Challenge: lots of small methods  Approach: JIT aggregates small methods into larger ones to avoid call overhead  Challenge: lots of virtual method calls  Approach: JIT rewites virtual calls to direct calls through flow analysis  Challenge: objects are heap-allocated, and must be GC'd  Approach: JIT converts variables to stack allocation where possible 13
  • 14. © 2016 IBM Corporation Programmers Can Help by Reducing Uncertainty  Locals are faster than globals – Can prove closed set of storage readers / modifers. – Fields and statics slow; parameters and locals fast.  Constants are faster than variables – Can copy constants inline or across memory caches. – Java’s final and Scala’s val are your friends.  private is faster than public – Private methods can't be dynamically redefined. – protected and “package private” just as slow as public.  Small methods (≤100 bytecodes) are good – More opportunities to in-line them.  Simple is faster than complex – Easier for the JIT to reason about the effects.  Limit extension points and architectural complexity when practical – Makes call sites concrete. 14
  • 15. © 2016 IBM Corporation  Scala is a language with many, many features. – Some features are for convenience others are for performance – The JVM is not optimized for these patterns and idioms.  Understand how the features you are using perform  Idiomatic:  Scala & Java library implementations are standard imperative  Idiomatic is convenient, but slow! Example source & performance from https://jazzy.id.au/2012/10/16/benchmarking_scala_against_java.html Running Scala code on the Java VM
  • 16. © 2016 IBM Corporation16 Example: Exploiting JIT optimizations to reduce overhead of logging checks  Tip: Check for the non-null value of a static field ref to instance of a logging class singleton – e.g. – Uses the JIT's speculative optimization to avoid the explicit test for logging being enabled; instead it ... 1) Generates an internal JIT runtime assumption (e.g. InfoLogger.class is undefined), 2) NOPs the test for trace enablement 3) Uses a class initialization hook for the InfoLogger.class (already necessary for instantiating the class) 4) The JIT will regenerate the test code if the class event is fired  Spark's logging calls are gated on the checks of a static boolean value trait Logging Spark
  • 17. © 2016 IBM Corporation17 Style: Judicious use of polymorphism  Spark / Scala has a number of highly polymorphic interface call sites and high fan-in (several calling contexts invoking the same callee method) in map, reduce, filter, flatMap, ... – e.g. ExternalSorter.insertAll is very hot (drains an iterator using hasNext/next calls)  Pattern #1: – InterruptibleIterator → Scala's mapIterator → Scala's filterIterator → …  Pattern #2: – InterruptibleIterator → Scala's filterIterator → Scala's mapIterator → …  The JIT can only choose one pattern to in-line! – Makes JIT devirtualization and speculation more risky; using profiling information from a different context could lead to incorrect devirtualization. – More conservative speculation, or good phase change detection and recovery are needed in the JIT compiler to avoid getting it wrong.  Lambdas and functions as arguments, by definition, introduce different code flow targets – Passing in widely implemented interfaces produce many different bytecode sequences – When we in-line we have to put runtime checks ahead of in-lined method bodies to make sure we are going to run the right method! – Often specialized classes are used only in a very limited number of places, but the majority of the code does not use these classes and pays a heavy penalty – e.g. Scala's attempt to specialize Tuple2 Int argument does more harm than good!  Tip: Use polymorphism sparingly, use the same order / patterns for nested & wrappered code, and keep call sites homogeneous.
  • 18. © 2016 IBM Corporation Example: Use of Scala Functions as Values  Scala has first class functions – they can be treated as values val square = (x: Int) => { x * x }  Implemented by scalac using generic interfaces! val square = new Function2[Int, Int] { def apply(x: Int): Int = x * x }  Calls to function values site is highly uncertain – Program contains many different classes that implement Function2 – Hard to tell which Function2 implementer to inline def filter(values: List[A], checkOp: A => Boolean): List[A] = { var result: List[A] = Nil for (v <- values) { if (checkOp(v)) { result = result :: v } } return result }
  • 19. © 2016 IBM Corporation JIT compiler enhancements, and writing JIT-friendly code  Understand the implementation of Scala language features – use them judiciously.  Try to reduce uncertainty for the compiler in your coding style.  Stick to common coding patterns - the JIT is tuned for them. – As new workloads emerge the latest JITs will change too  Focus on performance hotspots in your application – Too much emphasis on performance can compromise maintainability – Too much emphasis on maintainability can compromise performance! #WOCinTech Chat
  • 20. © 2016 IBM Corporation20 Effect of adjusting JIT heuristics for Apache Spark IBM JDK8 SR3 (tuned) IBM JDK8 SR3 (out of the box) PageRank 160% 148% Sleep 187% 113% Sort 103% 147% WordCount 130% 146% Bayes 100% 91% Terasort 160% 131% 1/Geometric mean of HiBench time on zLinux 32 cores, 25G heap Improvements in successive IBM Java 8 releases Performance compared with OpenJDK 8 HiBench huge, Spark 2.0.1, Linux Power8 12 core * 8-way SMT 1.35x
  • 21. © 2016 IBM Corporation21 Faster IO – networking and storage
  • 22. © 2016 IBM Corporation22 Performance of the Apache Spark Runtime Core  Executing tasks – How quickly can a worker sort, compute, transform, … the data in this partition? – Where is the best place to send work to have it completed quickest? “Narrow” RDD dependencies e.g. map() pipeline-able “Wide” RDD dependencies e.g. reduce() shuffles RDD1 partition1 RDD1 partition 2 RDD1 partition 1 RDD1 partition 3 RDD1 partition n... RDD1 partition1 RDD2 partition 2 RDD2 partition 1 RDD2 partition 3 RDD2 partition n... RDD1 partition1 RDD3 partition 2 RDD3 partition 1 RDD3 partition 3 RDD3 partition n... RDD1 partition1 RDD1 partition 2 RDD1 partition 1 RDD1 partition 3 RDD1 partition n... RDD1 partition1 RDD2 partition 2 RDD2 partition 1  Moving data blocks – How quickly can a worker get the data needed for a task? – How quickly can a worker persist the results if required?
  • 23. © 2016 IBM Corporation Remote Direct Memory Access (RDMA) Networking Spark VM Buffer Off Heap Buffer Spark VM Buffer Off Heap Buffer Ether/IB SwitchRDMA NIC/HCA RDMA NIC/HCA OS OS DMA DMA (Z-Copy) (Z-Copy) (B-Copy)(B-Copy) Acronyms: Z-Copy – Zero Copy B-Copy – Buffer Copy IB – InfiniBand Ether - Ethernet NIC – Network Interface Card HCA – Host Control Adapter ● Low-latency, high-throughput networking ● Direct 'application to application' memory pointer exchange between remote hosts ● Off-load network processing to RDMA NIC/HCA – OS/Kernel Bypass (zero-copy) ● Introduces new IO characteristics that can influence the Apache Spark transfer plan Spark node #1 Spark node #2
  • 24. © 2016 IBM Corporation24 TCP/IP RDMA RDMA exhibits improved throughput and reduced latency. Our JVM makes RDMA available transparently via java.net.Socket APIs (JSoR), or explicitly via com.ibm jVerbs calls. Characteristics of RDMA Networking
  • 25. © 2016 IBM Corporation Modifications for an RDMA-enabled Apache Spark 25 New dynamic transfer plan that adapts to the load and responsiveness of the remote hosts. New “RDMA” shuffle IO mode with lower latency and higher throughput. JVM-agnostic IBM JVM only JVM-agnostic IBM JVM only IBM JVM only Block manipulation (i.e., RDD partitions) High-level API JVM-agnostic working prototype with RDMA
  • 26. © 2016 IBM Corporation TCP-H benchmark 100Gb 30% improvement in database operations Shuffle-intensive benchmarks show 30% - 40% better performance with RDMA HiBench PageRank 3Gb 40% faster, lower CPU usage 32 cores (1 master, 4 nodes x 8 cores/node, 32GB Mem/node)
  • 27. © 2016 IBM Corporation27 Time to complete TeraSort benchmark 32 cores (1 master, 4 nodes x 8 cores/node, 32GB Mem/node), IBM Java 8 0 100 200 300 400 500 600 Spark HiBench TeraSort [30GB] Execution Time (sec) 556s 159s TCP/IP JSoR
  • 28. © 2016 IBM Corporation28 Faster storage IO with POWER CAPI/Flash  POWER8 architecture offers a 40Tb Flash drive attached via Coherent Accelerator Processor Interface (CAPI) – Provides simple coherent block IO APIs – No file system overhead  Power Service Layer (PSL) – Performs Address Translations – Maintains Cache – Simple, but powerful interface to the Accelerator unit  Coherent Accelerator Processor Proxy (CAPP) – Maintains directory of cache lines held by Accelerator – Snoops PowerBus on behalf of Accelerator 
  • 29. © 2016 IBM Corporation29 Faster disk IO with CAPI/Flash-enabled Spark  When under memory pressure, Spark spills RDDs to disk. – Happens in ExternalAppendOnlyMap and ExternalSorter  We have modified Spark to spill to the high-bandwidth, coherently-attached Flash device instead. – Replacement for DiskBlockManager – New FlashBlockManager handles spill to/from flash  Making this pluggable requires some further abstraction in Spark: – Spill code assumes using disks, and depends on DiskBlockManger – We are spilling without using a file system layer  Dramatically improves performance of executors under memory pressure.  Allows to reach similar performance with much less memory (denser cloud deployments). IBM Flash System 840Power8 + CAPI
  • 30. © 2016 IBM Corporation30 e.g. using CAPI Flash for RDD caching allows for 4X memory reduction while maintaining equal performance
  • 31. © 2016 IBM Corporation31 Offloading tasks to graphics co-processors
  • 32. © 2016 IBM Corporation32 JIT optimized GPU acceleration  Comes with caveats – Recognize a limited set of operations within the lambda expressions, • notably no object references maintained on GPU – Default grid dimensions and operating parameters for the GPU workload – Redundant/pessimistic data transfer between host and device • Not using GPU shared memory – Limited heuristics about when to invoke the GPU and when to generate CPU instructions  As the JIT compiles a stream expression we can identify candidates for GPU off-loading – Arrays copied to and from the device implicitly – Java operations mapped to GPU kernel operations – Preserves the standard Java syntax and semantics bytecodes intermediate representation optimizer CPU GPU code generator code generator PTX ISACPU native
  • 33. © 2016 IBM Corporation33 GPU-enabled Standard Java APIs  Some Arrays.sort() methods will offload work to GPUs today – e.g. sorting large arrays of ints
  • 34. © 2016 IBM Corporation34 GPU optimization of Loops and Lambda expressions Speed-up factor when run on a GPU enabled host 0.00 0.01 0.10 1.00 10.00 100.00 1000.00 auto-SIMD parallel forEach on CPU parallel forEach on GPU matrix size The JIT can recognize parallel stream code, and automatically compile down to the GPU.
  • 35. © 2016 IBM Corporation DataFrames and DataSets Codegen Optimizations User's SQL operators Arbitrary user code  Uses data schema to generate code – Builds logical and physical execution plans. – Loop through rows in the data set to: ● Access serialized storage, convert data to VM objects, call function | execute SQL plan – Code is generated and compiled at runtime
  • 36. © 2016 IBM Corporation Storage Layout Optimization  UnsafeRow representation – Data laid out in a heap byte array – Accessed via sun.misc.Unsafe method calls  Conversion to Columunar Format – Allows for compression and better vector processing
  • 37. © 2016 IBM Corporation /* 049 */ while (inputadapter_input.hasNext()) { /* 050 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 051 */ boolean inputadapter_isNull = inputadapter_row.isNullAt(0); /* 052 */ int inputadapter_value = inputadapter_isNull ? -1 : (inputadapter_row.getInt(0)); /* 053 */ /* 054 */ if (!(!(inputadapter_isNull))) continue; /* 055 */ /* 056 */ boolean filter_isNull2 = false; /* 057 */ /* 058 */ boolean filter_value2 = false; /* 059 */ filter_value2 = inputadapter_value > 8; /* 060 */ if (!filter_value2) continue; /* 061 */ boolean inputadapter_isNull1 = inputadapter_row.isNullAt(1); /* 062 */ double inputadapter_value1 = inputadapter_isNull1 ? -1.0 : (inputadapter_row.getDouble(1)); /* 063 */ /* 064 */ if (!(!(inputadapter_isNull1))) continue; /* 065 */ /* 066 */ boolean filter_isNull7 = false; /* 067 */ /* 068 */ boolean filter_value7 = false; /* 069 */ filter_value7 = org.apache.spark.util.Utils.nanSafeCompareDoubles(inputadapter_value1, 24.0D) > 0; /* 070 */ if (!filter_value7) continue; /* 071 */ /* 072 */ filter_numOutputRows.add(1); /* 073 */ /* 074 */ // do aggregate /* 075 */ // common sub-expressions /* 076 */ /* 077 */ // evaluate aggregate function /* 078 */ boolean agg_isNull1 = false; /* 079 */ /* 080 */ long agg_value1 = -1L; /* 081 */ agg_value1 = agg_bufValue + 1L; /* 082 */ // update aggregation buffer /* 083 */ agg_bufIsNull = false; /* 084 */ agg_bufValue = agg_value1; /* 085 */ if (shouldStop()) return; /* 086 */ } Code generation is shared between columnar and row based storage. ● this introduces huge overhead. ● the complex loop is not a counted loop, contains numerous branches (due to nullability checks) and calls etc Generated Code
  • 38. © 2016 IBM Corporation ● ColumnarBatch format instead of CachedBatch for in-memory cache. ● Common compression scheme (e.g. lz4) for all of the data types in a ColumnVector Optimizing Spark's Generated Code /* 073 */ while (inmemorytablescan_batch != null) { /* 074 */ int numRows = inmemorytablescan_batch.numRows(); /* 075 */ while (inmemorytablescan_batchIdx < numRows) { /* 076 */ int inmemorytablescan_rowIdx = inmemorytablescan_batchIdx++; /* 077 */ boolean inmemorytablescan_isNull = inmemorytablescan_colInstance0.isNullAt(inmemorytablescan_rowIdx); /* 078 */ int inmemorytablescan_value = inmemorytablescan_isNull ? -1 : (inmemorytablescan_colInstance0.getInt(inmemorytablescan_rowIdx)); /* 079 */ /* 080 */ // filtering out values and counting rows similar to the old code /* 081 */ ... /* 112 */ } /* 113 */ inmemorytablescan_batch = null; /* 114 */ inmemorytablescan_nextBatch(); /* 115 */ } ● Simple and specialized runtime code for building a concrete instance of the in-memory cache ● Read data directly from ColumnVector in the whole-stage codegen. 3.4x performance improvement
  • 39. © 2016 IBM Corporation39 Summary  We are focused on Core runtime performance to get a multiplier up the Apache Spark stack. – More efficient code, more efficient memory usage/spilling, more efficient serialization & networking, etc.  There are hardware and software technologies we can bring to the party. – We can tune the stack from hardware to high level structures for running Apache Spark.  Apache Spark and Scala developers can help themselves by their style of coding.  All the changes are being made in the Java runtime or being pushed out to the Apache Spark community.  There is lots more stuff I don't have time to talk about, like GC optimizations, object layout, monitoring VM/Apache Spark events, hardware compression, security, etc. etc. – mailto: http://ibm.biz/spark­kit
  • 40. © 2016 IBM Corporation40 Questions?