LLJVM: Bitcode to JVM bytecode
A lightweight library to inject LLVM
bitcode into JVM
▪ LLJVM
▪ https://github.com/maropu/lljvm-translator
▪ Originally authored by David Roberts,
but currently unmaintained
▪ https://github.com/davidar/lljvm
▪ Apache License, Version 2.0
▪ Main target
▪ Compile Python functions and then
run them on JVMs
▪ Not a full-fledged but a restricted translation
.class
TRANSLATE BY LLJVM
JVM
LOAD & RUN
.cc
.py
LANGUAGE DEPENDENT
FUNCTIONS
.go
.rs
.bc
…
2
* Numba: A High Performance Python Compiler, http://numba.pydata.org/
Motivation: PySpark UDF Overhead
▪ UDF is expressive and powerful in real use cases
▪ Domain specific transformation, e.g., feature engineering in ML
▪ Billions of daily UDF invocations in Microsoft Azure*
▪ Many culprits of the overhead in Spark
▪ Interpreter execution in Python
▪ (De-)Serialization between Spark executors and Python workers
▪ Whole-stage codegen breaker
3
* Ramachandra et al., Froid: Optimization of Imperative Programs in a Relational Database, Proceedings of the VLDB Endowment,
Volume 11, Issue 4, Pages 432-444, 2017.
Vectorized UDFs in PySpark
4
▪ Efficient (de-)serialization by Apache Arrow
▪ Vector computation in Python
▪ Pd.DataFrame ⇒ pd.DataFrame
* Improving Python and Spark Performance and Interoperability with Apache Arrow, Spark Summit 2017, https://bit.ly/2DEkhHC
Spark
Previous Spark Spark v2.3+
UDF: scalar ⇒ scalar
UDF: pd.DataFrame ⇒ pd.DataFrame
Immutable
Arrow Batch
Immutable
Arrow Batch
Vectorized UDFs Performance
5
* Introducing Pandas UDF for PySpark, Spark Summit 2017, https://bit.ly/2A6hAdZ
▪ 3x to over 100x faster than row-at-a-time UDFs
Whole-Stage Codegen Breaker
6
Scala UDF
Codegen’d Plans
▪ The plan difference (Scala/Python UDFs) in Spark v2.4
Python UDF
A Python UDF splits them into two parts
PySpark UDF Chain Hell...
▪ No function composition in Spark v2.4 Catalyst
7
Related Research Work: TUPLEWARE
▪ In the UDF compilation process, it gathers input data
statistics and applies low-level & data-dependent
optimizations, e.g., no-branch strategy
8
* Andrew Crotty, et al., An Architecture for Compiling UDF-centric Workflows, Proceedings of the VLDB Endowment,
Volume 8, Issue 12, Pages 1466-1477, 2015.
Related Spark JIRA discussion
▪ SPARK-14083: Analyze JVM bytecode and turn closures
into Catalyst expressions
▪ This closure transformation brings the benefits of many
Spark optimization rules, e.g., Filter Pushdown
9
LLJVM Approach
▪ LLJVM supports simple UDF logics only
▪ Limited instruction support: LLVM v7.0 has 63 instructions and
LLJVM supports 49 ones only
▪ Simple LLVM data type support
▪ Complicated aggregate type unsupported, e.g., { i32, { i64, double }* }
▪ …
▪ Focus on Numba-generated LLVM bitcode
▪ LLJVM provides internal functions that the bitcode uses, e.g.,
math functions and matrix manipulation
10
Numba and LLJVM
▪ Numba: High Performance Python Compiler
▪ Specialized code for CPUs and GPUs
▪ LLJVM provides a new option for JVMs in Numba
11
CPUs GPUs JVMs
.py
How-to-Use LLJVM
▪ Load a LLVM bitcode file via a custom class loader and then
run it by using Java runtime reflection
▪ In unsupported cases (e.g., unsupported LLVM instructions
found), it throws a LLJVMException
12
* Example code for the LLJVM translator, https://github.com/maropu/lljvm-example
Translation Example1: plus
13
Translation Example2: log10
14
JVM assembly code will appear in a next slide...
Translation Example2: log10
15
Translation Example2: array sum
16
NumPy array
Numba-internal array format
JVM assembly code will appear in a next slide...
1-d data array shape/stride
Translation Example2: array sum
17
address of Java array
▪ C-like pointer argument passing
Address of Java Arrays
18
▪ Super hacky calculation in OpenJDK 8 (64bit)
▪ Java object address in OpenJDK is compressed internally:
Ordinary Object Pointer (OOP)
▪ OOP decompression depends on shift and base values
[address of Java object] := base + ([OOP address] << shift)
▪ These values cannot be referenced on runtime, so LLJVM infers
the two values by comparing OOP/raw addresses
See: https://github.com/maropu/lljvm-translator/blob/master/core/src/main/java/io/github/maropu/lljvm/util/ArrayUtils.java
Use Case: Compile PySpark UDF
▪ PySpark UDF compilation flow
▪ 1. Compile Python code into LLVM bitcode in a driver side
▪ LLVM bitcode is a byte array, so serializable
▪ 2. Transfer the bitcode into executors
▪ 3. Load it into JVMs and run it
19
UDF: plus(x, y) => x + y
▪ Even in a simple UDF, ~50x faster than
the vectorized UDF one
UDFs
Vectorized
UDFs
Compiled
UDFs
~50x
Experimental Release: v0.1.0
▪ Supports OpenJDK 8 (64bit) only
▪ Bundles x86_64 native binaries for Linux/Mac
▪ For Linux, it is built by clang++ v3.6.2
▪ For mac, it is built by Apple clang++ v900.0.39.2
▪ LLVM v5.0.2 used internally
▪ In master, the latest v7.0.0 (2018.11.19) used
<dependency>
<groupId>io.github.maropu</groupId>
<artifactId>lljvm-core</artifactId>
<version>0.1.0-EXPERIMENTAL</version>
</dependency> …but, it still has many bugs now
20
Wrap Up
▪ LLJVM: Translate LLVM bitcode into JVM bytecode
▪ Currently, it focuses on the Numba integration
▪ UDF optimization is technically challenging and brings
performance benefits in real use cases
▪ Users love writing code on structured data
▪ If you’re interested in this, plz give it your GitHub star!
▪ https://github.com/maropu/lljvm-translator
21

LLJVM: LLVM bitcode to JVM bytecode

  • 1.
    LLJVM: Bitcode toJVM bytecode
  • 2.
    A lightweight libraryto inject LLVM bitcode into JVM ▪ LLJVM ▪ https://github.com/maropu/lljvm-translator ▪ Originally authored by David Roberts, but currently unmaintained ▪ https://github.com/davidar/lljvm ▪ Apache License, Version 2.0 ▪ Main target ▪ Compile Python functions and then run them on JVMs ▪ Not a full-fledged but a restricted translation .class TRANSLATE BY LLJVM JVM LOAD & RUN .cc .py LANGUAGE DEPENDENT FUNCTIONS .go .rs .bc … 2 * Numba: A High Performance Python Compiler, http://numba.pydata.org/
  • 3.
    Motivation: PySpark UDFOverhead ▪ UDF is expressive and powerful in real use cases ▪ Domain specific transformation, e.g., feature engineering in ML ▪ Billions of daily UDF invocations in Microsoft Azure* ▪ Many culprits of the overhead in Spark ▪ Interpreter execution in Python ▪ (De-)Serialization between Spark executors and Python workers ▪ Whole-stage codegen breaker 3 * Ramachandra et al., Froid: Optimization of Imperative Programs in a Relational Database, Proceedings of the VLDB Endowment, Volume 11, Issue 4, Pages 432-444, 2017.
  • 4.
    Vectorized UDFs inPySpark 4 ▪ Efficient (de-)serialization by Apache Arrow ▪ Vector computation in Python ▪ Pd.DataFrame ⇒ pd.DataFrame * Improving Python and Spark Performance and Interoperability with Apache Arrow, Spark Summit 2017, https://bit.ly/2DEkhHC Spark Previous Spark Spark v2.3+ UDF: scalar ⇒ scalar UDF: pd.DataFrame ⇒ pd.DataFrame Immutable Arrow Batch Immutable Arrow Batch
  • 5.
    Vectorized UDFs Performance 5 *Introducing Pandas UDF for PySpark, Spark Summit 2017, https://bit.ly/2A6hAdZ ▪ 3x to over 100x faster than row-at-a-time UDFs
  • 6.
    Whole-Stage Codegen Breaker 6 ScalaUDF Codegen’d Plans ▪ The plan difference (Scala/Python UDFs) in Spark v2.4 Python UDF A Python UDF splits them into two parts
  • 7.
    PySpark UDF ChainHell... ▪ No function composition in Spark v2.4 Catalyst 7
  • 8.
    Related Research Work:TUPLEWARE ▪ In the UDF compilation process, it gathers input data statistics and applies low-level & data-dependent optimizations, e.g., no-branch strategy 8 * Andrew Crotty, et al., An Architecture for Compiling UDF-centric Workflows, Proceedings of the VLDB Endowment, Volume 8, Issue 12, Pages 1466-1477, 2015.
  • 9.
    Related Spark JIRAdiscussion ▪ SPARK-14083: Analyze JVM bytecode and turn closures into Catalyst expressions ▪ This closure transformation brings the benefits of many Spark optimization rules, e.g., Filter Pushdown 9
  • 10.
    LLJVM Approach ▪ LLJVMsupports simple UDF logics only ▪ Limited instruction support: LLVM v7.0 has 63 instructions and LLJVM supports 49 ones only ▪ Simple LLVM data type support ▪ Complicated aggregate type unsupported, e.g., { i32, { i64, double }* } ▪ … ▪ Focus on Numba-generated LLVM bitcode ▪ LLJVM provides internal functions that the bitcode uses, e.g., math functions and matrix manipulation 10
  • 11.
    Numba and LLJVM ▪Numba: High Performance Python Compiler ▪ Specialized code for CPUs and GPUs ▪ LLJVM provides a new option for JVMs in Numba 11 CPUs GPUs JVMs .py
  • 12.
    How-to-Use LLJVM ▪ Loada LLVM bitcode file via a custom class loader and then run it by using Java runtime reflection ▪ In unsupported cases (e.g., unsupported LLVM instructions found), it throws a LLJVMException 12 * Example code for the LLJVM translator, https://github.com/maropu/lljvm-example
  • 13.
  • 14.
    Translation Example2: log10 14 JVMassembly code will appear in a next slide...
  • 15.
  • 16.
    Translation Example2: arraysum 16 NumPy array Numba-internal array format JVM assembly code will appear in a next slide... 1-d data array shape/stride
  • 17.
    Translation Example2: arraysum 17 address of Java array ▪ C-like pointer argument passing
  • 18.
    Address of JavaArrays 18 ▪ Super hacky calculation in OpenJDK 8 (64bit) ▪ Java object address in OpenJDK is compressed internally: Ordinary Object Pointer (OOP) ▪ OOP decompression depends on shift and base values [address of Java object] := base + ([OOP address] << shift) ▪ These values cannot be referenced on runtime, so LLJVM infers the two values by comparing OOP/raw addresses See: https://github.com/maropu/lljvm-translator/blob/master/core/src/main/java/io/github/maropu/lljvm/util/ArrayUtils.java
  • 19.
    Use Case: CompilePySpark UDF ▪ PySpark UDF compilation flow ▪ 1. Compile Python code into LLVM bitcode in a driver side ▪ LLVM bitcode is a byte array, so serializable ▪ 2. Transfer the bitcode into executors ▪ 3. Load it into JVMs and run it 19 UDF: plus(x, y) => x + y ▪ Even in a simple UDF, ~50x faster than the vectorized UDF one UDFs Vectorized UDFs Compiled UDFs ~50x
  • 20.
    Experimental Release: v0.1.0 ▪Supports OpenJDK 8 (64bit) only ▪ Bundles x86_64 native binaries for Linux/Mac ▪ For Linux, it is built by clang++ v3.6.2 ▪ For mac, it is built by Apple clang++ v900.0.39.2 ▪ LLVM v5.0.2 used internally ▪ In master, the latest v7.0.0 (2018.11.19) used <dependency> <groupId>io.github.maropu</groupId> <artifactId>lljvm-core</artifactId> <version>0.1.0-EXPERIMENTAL</version> </dependency> …but, it still has many bugs now 20
  • 21.
    Wrap Up ▪ LLJVM:Translate LLVM bitcode into JVM bytecode ▪ Currently, it focuses on the Numba integration ▪ UDF optimization is technically challenging and brings performance benefits in real use cases ▪ Users love writing code on structured data ▪ If you’re interested in this, plz give it your GitHub star! ▪ https://github.com/maropu/lljvm-translator 21