Efficient Distributed R Dataframes on Apache Flink
Andreas Kunft, Jens Meiners, Tilmann Rabl, Volker Markl
• R got huge traction
• Open source
• Rich support for analytics & statistics
• But, standalone not well suited for out of core data loads
• Multiple extensions for distributed execution
• Hadoop + R
• Spark + R
• SystemML
1
Our Goals
Provide API with natural feeling
•
•
•
Achieve comparable performance as native dataflow system
2
1
df$km <- df$miles * 1.6
df <- select(df, f = df$flights, df$distance)
df <- apply(df, key = id, aggFunc)
2
General Approach
• R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>>
• Create execution plan
• Map R dataframe functions to the native API whenever possible
e.g., select to projections
• Call user defined R functions within the worker nodes
3
General Approach
• R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>>
• Create execution plan
• Map R dataframe functions to the native API whenever possible
e.g., select to projections
• Call user defined R functions within the worker nodes
4
Handling user defined R functions
5
Inter Process Communication
6
Job
Manager
Client
Task
Manager
Task
Task
Task
Manager
Task
Task
R Process
R Process
R Process
R Process
Inter Process Communication
Communication + Serialization
Java and R compete for memory
7
Task
Manager
filter R Process
filter <- function(df) {
df$language == ‘R’
}
1
2
1
2
Source-to-Source Translation
• Translate restrict set of operations to native dataflow API
• Operations are executed natively
8
df <- filter(
df,
df$language == ‘R’
)
val df = df.filter($”language” === “R”)
df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)
Flink + fastR
9
Truffle/Graal
10
HotSpot
JIT
Bytecode
Truffle/Graal
11
HotSpot
JIT
Bytecode
Graal
Truffle/Graal
12
HotSpot
Graal
Truffle
GraalVM
Truffle/Graal
13Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015.
HotSpot Runtime
Graal Interpreter GC …
Truffle
TruffleR (fastR) TruffleJSjavac
*.js*.R*.java
GraalVM
AST Interpreter
Source Code
Flink + fastR
fastR: R implementation on top of Truffle/Graal
• Allows us to execute R code in the same VM as Flink
• Infer result types of R functions
• Access Java (Flink) data types in R
14
Client:
1. Dataframe rows to Flink tuples
2. Determine return types of UDFs
3. Create execution plan
15
Job
Manager
Client
Task
Manager
map
map
Task
Manager
map
map
flink.init(SERVER, PORT)
flink.parallelism(DOP)
df <- flink.readdf(SOURCE,
list("id", “body“, …),
list(character, character, …)
)
df$wordcount <- length(strsplit(df$body, " ")[[1]])
flink.writeAsText(df, SINK)
flink.execute()
function(tuple) {
.fun <- function(tuple) { length(strsplit(tuple[[2]], " ")[[1]] }
flink.tuple(tuple[[1]], tuple[[2]], .fun(tuple))
}
Dataframe proxy keeps track of columns, provides efficient access
Can be extended with new columns
Rewrite to directly use Flink tuples
16
df$wordcount <- length(strsplit(df$body, " ")[[1]])
1
23
1
2
3
17
Job
Manager
Client
Task
Manager
map
map
Task
Manager
map
map
map { tuple =>
executeRFunction(func, tuple)
}
map { tuple =>
executeRFunction(func, tuple)
}
• Task Manager:
Evaluate R UDF & Execute
Local - 1.4GB
18
Local - 14GB
19
Local – 1GB
20
Cluster – 10GB
21
fastR + Flink
• R dataframe abstraction for distributed computation
• Performance gains even on single node (local mode)
• Approaches native performance even for R UDFs
• Interesting opportunities for:
• Streaming
• Other dynamic languages
• Dynamic Re-optimization
Thank you for your attention!
22

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

  • 1.
    Efficient Distributed RDataframes on Apache Flink Andreas Kunft, Jens Meiners, Tilmann Rabl, Volker Markl
  • 2.
    • R gothuge traction • Open source • Rich support for analytics & statistics • But, standalone not well suited for out of core data loads • Multiple extensions for distributed execution • Hadoop + R • Spark + R • SystemML 1
  • 3.
    Our Goals Provide APIwith natural feeling • • • Achieve comparable performance as native dataflow system 2 1 df$km <- df$miles * 1.6 df <- select(df, f = df$flights, df$distance) df <- apply(df, key = id, aggFunc) 2
  • 4.
    General Approach • Rdataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>> • Create execution plan • Map R dataframe functions to the native API whenever possible e.g., select to projections • Call user defined R functions within the worker nodes 3
  • 5.
    General Approach • Rdataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>> • Create execution plan • Map R dataframe functions to the native API whenever possible e.g., select to projections • Call user defined R functions within the worker nodes 4
  • 6.
    Handling user definedR functions 5
  • 7.
  • 8.
    Inter Process Communication Communication+ Serialization Java and R compete for memory 7 Task Manager filter R Process filter <- function(df) { df$language == ‘R’ } 1 2 1 2
  • 9.
    Source-to-Source Translation • Translaterestrict set of operations to native dataflow API • Operations are executed natively 8 df <- filter( df, df$language == ‘R’ ) val df = df.filter($”language” === “R”) df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
    Truffle/Graal 13Figure based on:Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015. HotSpot Runtime Graal Interpreter GC … Truffle TruffleR (fastR) TruffleJSjavac *.js*.R*.java GraalVM AST Interpreter Source Code
  • 15.
    Flink + fastR fastR:R implementation on top of Truffle/Graal • Allows us to execute R code in the same VM as Flink • Infer result types of R functions • Access Java (Flink) data types in R 14
  • 16.
    Client: 1. Dataframe rowsto Flink tuples 2. Determine return types of UDFs 3. Create execution plan 15 Job Manager Client Task Manager map map Task Manager map map flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) df$wordcount <- length(strsplit(df$body, " ")[[1]]) flink.writeAsText(df, SINK) flink.execute()
  • 17.
    function(tuple) { .fun <-function(tuple) { length(strsplit(tuple[[2]], " ")[[1]] } flink.tuple(tuple[[1]], tuple[[2]], .fun(tuple)) } Dataframe proxy keeps track of columns, provides efficient access Can be extended with new columns Rewrite to directly use Flink tuples 16 df$wordcount <- length(strsplit(df$body, " ")[[1]]) 1 23 1 2 3
  • 18.
    17 Job Manager Client Task Manager map map Task Manager map map map { tuple=> executeRFunction(func, tuple) } map { tuple => executeRFunction(func, tuple) } • Task Manager: Evaluate R UDF & Execute
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
    fastR + Flink •R dataframe abstraction for distributed computation • Performance gains even on single node (local mode) • Approaches native performance even for R UDFs • Interesting opportunities for: • Streaming • Other dynamic languages • Dynamic Re-optimization Thank you for your attention! 22