Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

•

0 likes•543 views

This document discusses providing an R dataframe abstraction for efficient distributed computation on Apache Flink. The goals are to provide a natural API for R and achieve performance comparable to Flink's native dataflow. The approach represents R dataframes as Flink data sets and compiles R functions into the native execution plan where possible. For user-defined R functions, they are evaluated within worker tasks using a just-in-time compiler. This allows executing R code within the same Java virtual machine as Flink for good performance, even on a single node. Results show it can achieve native Flink performance even for functions containing R code.

Data & Analytics

Efficient Distributed R Dataframes on Apache Flink
Andreas Kunft, Jens Meiners, Tilmann Rabl, Volker Markl

• R got huge traction
• Open source
• Rich support for analytics & statistics
• But, standalone not well suited for out of core data loads
• Multiple extensions for distributed execution
• Hadoop + R
• Spark + R
• SystemML
1

Our Goals
Provide API with natural feeling
•
•
•
Achieve comparable performance as native dataflow system
2
1
df$km <- df$miles * 1.6
df <- select(df, f = df$flights, df$distance)
df <- apply(df, key = id, aggFunc)
2

General Approach
• R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>>
• Create execution plan
• Map R dataframe functions to the native API whenever possible
e.g., select to projections
• Call user defined R functions within the worker nodes
3

Inter Process Communication
6
Job
Manager
Client
Task
Manager
Task
Task
Task
Manager
Task
Task
R Process
R Process
R Process
R Process

Inter Process Communication
Communication + Serialization
Java and R compete for memory
7
Task
Manager
filter R Process
filter <- function(df) {
df$language == ‘R’
}
1
2
1
2

Source-to-Source Translation
• Translate restrict set of operations to native dataflow API
• Operations are executed natively
8
df <- filter(
df,
df$language == ‘R’
)
val df = df.filter($”language” === “R”)
df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)

Truffle/Graal
11
HotSpot
JIT
Bytecode
Graal

Truffle/Graal
12
HotSpot
Graal
Truffle
GraalVM

Truffle/Graal
13Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015.
HotSpot Runtime
Graal Interpreter GC …
Truffle
TruffleR (fastR) TruffleJSjavac
*.js*.R*.java
GraalVM
AST Interpreter
Source Code

Flink + fastR
fastR: R implementation on top of Truffle/Graal
• Allows us to execute R code in the same VM as Flink
• Infer result types of R functions
• Access Java (Flink) data types in R
14

Client:
1. Dataframe rows to Flink tuples
2. Determine return types of UDFs
3. Create execution plan
15
Job
Manager
Client
Task
Manager
map
map
Task
Manager
map
map
flink.init(SERVER, PORT)
flink.parallelism(DOP)
df <- flink.readdf(SOURCE,
list("id", “body“, …),
list(character, character, …)
)
df$wordcount <- length(strsplit(df$body, " ")[[1]])
flink.writeAsText(df, SINK)
flink.execute()

function(tuple) {
.fun <- function(tuple) { length(strsplit(tuple[[2]], " ")[[1]] }
flink.tuple(tuple[[1]], tuple[[2]], .fun(tuple))
}
Dataframe proxy keeps track of columns, provides efficient access
Can be extended with new columns
Rewrite to directly use Flink tuples
16
df$wordcount <- length(strsplit(df$body, " ")[[1]])
1
23
1
2
3

17
Job
Manager
Client
Task
Manager
map
map
Task
Manager
map
map
map { tuple =>
executeRFunction(func, tuple)
}
map { tuple =>
executeRFunction(func, tuple)
}
• Task Manager:
Evaluate R UDF & Execute

fastR + Flink
• R dataframe abstraction for distributed computation
• Performance gains even on single node (local mode)
• Approaches native performance even for R UDFs
• Interesting opportunities for:
• Streaming
• Other dynamic languages
• Dynamic Re-optimization
Thank you for your attention!
22

What's hot

Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...Flink Forward

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...Flink Forward

Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...Flink Forward

Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...Flink Forward

Marton Balassi – Stateful Stream ProcessingFlink Forward

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward

Flink Forward San Francisco 2019: Developing and operating real-time applicat...Flink Forward

Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...Flink Forward

Apache Flink at Strata San Jose 2016Kostas Tzoumas

Virtual Flink Forward 2020: Build your next-generation stream platform based ...Flink Forward

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasFlink Forward

Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...Flink Forward

Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...Flink Forward

Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...Flink Forward

Stream Loops on Flink - Reinventing the wheel for the streaming eraParis Carbone

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward

Apache Flink Training: System OverviewFlink Forward

Alexander Kolb – Flink. Yet another Streaming Framework?Flink Forward

Flink Connector Development Tips & TricksEron Wright

Streaming your Lyft Ride Prices - Flink Forward SF 2019Thomas Weise

What's hot (20)

Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...

Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse wit...

Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...

Marton Balassi – Stateful Stream Processing

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...

Flink Forward San Francisco 2019: Developing and operating real-time applicat...

Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...

Apache Flink at Strata San Jose 2016

Virtual Flink Forward 2020: Build your next-generation stream platform based ...

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

Time to-live: How to Perform Automatic State Cleanup in Apache Flink - Andrey...

Flink Forward Berlin 2017: Ruben Casado Tejedor - Flink-Kudu connector: an op...

Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...

Stream Loops on Flink - Reinventing the wheel for the streaming era

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...

Apache Flink Training: System Overview

Alexander Kolb – Flink. Yet another Streaming Framework?

Flink Connector Development Tips & Tricks

Streaming your Lyft Ride Prices - Flink Forward SF 2019

Similar to Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

FastR+Apache FlinkJuan Fumero

So you think you can stream.pptxPrakash Chockalingam

Wayfair Use Case: The four R's of Metrics DeliveryInfluxData

Flink internals web Kostas Tzoumas

SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)Yuuki Takano

Beautiful Monitoring With Grafana and InfluxDBleesjensen

Ingesting hdfs intosolrusingsparktrimmedwhoschek

Apache Flink Deep DiveVasia Kalavri

Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...Flink Forward

Apache Flink: API, runtime, and project roadmapKostas Tzoumas

Interpreting the Data:Parallel Analysis with SawzallTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Productionizing your Streaming JobsDatabricks

MapReduce on Zero VM Joy Rahman

Scaling metricsVladimir Varfolomeev

Plane SpottingTed Coyle

Finding OOMS in Legacy Systems with the Syslog Telegraf PluginInfluxData

Near real-time anomaly detection at Lyftmarkgrover

Influx data basicСергій Саварин

The magic behind your Lyft ride prices: A case study on machine learning and ...Karthik Murugesan

Similar to Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink (20)

FastR+Apache Flink

So you think you can stream.pptx

Wayfair Use Case: The four R's of Metrics Delivery

Flink internals web

SF-TAP: Scalable and Flexible Traffic Analysis Platform (USENIX LISA 2015)

Beautiful Monitoring With Grafana and InfluxDB

Ingesting hdfs intosolrusingsparktrimmed

Apache Flink Deep Dive

Flink Forward San Francisco 2019: Streaming your Lyft Ride Prices - Thomas We...

Apache Flink: API, runtime, and project roadmap

Interpreting the Data:Parallel Analysis with Sawzall

Productionizing your Streaming Jobs

MapReduce on Zero VM

Scaling metrics

Plane Spotting

Finding OOMS in Legacy Systems with the Syslog Telegraf Plugin

Near real-time anomaly detection at Lyft

Influx data basic

The magic behind your Lyft ride prices: A case study on machine learning and ...

Recently uploaded

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...gajnagarg

Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...HyderabadDolls

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls

Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...HyderabadDolls

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...HyderabadDolls

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls

Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131

Introduction to Statistics Presentation.pptxAniqa Zai

Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Riyadh +966572737505 get cytotec

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg

Statistics notes ,it includes mean to index numberssuginr1

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg

Recently uploaded (20)

Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...

Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...

Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...

Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now

Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...

Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...

Belur $ Female Escorts Service in Kolkata (Adult Only) 8005736733 Escort Serv...

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...

Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...

Dubai Call Girls Peeing O525547819 Call Girls Dubai

Introduction to Statistics Presentation.pptx

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...

Statistics notes ,it includes mean to index numbers

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

1. Efficient Distributed R Dataframes on Apache Flink Andreas Kunft, Jens Meiners, Tilmann Rabl, Volker Markl

2. • R got huge traction • Open source • Rich support for analytics & statistics • But, standalone not well suited for out of core data loads • Multiple extensions for distributed execution • Hadoop + R • Spark + R • SystemML 1

3. Our Goals Provide API with natural feeling • • • Achieve comparable performance as native dataflow system 2 1 df$km <- df$miles * 1.6 df <- select(df, f = df$flights, df$distance) df <- apply(df, key = id, aggFunc) 2

4. General Approach • R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>> • Create execution plan • Map R dataframe functions to the native API whenever possible e.g., select to projections • Call user defined R functions within the worker nodes 3

5. General Approach • R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>> • Create execution plan • Map R dataframe functions to the native API whenever possible e.g., select to projections • Call user defined R functions within the worker nodes 4

6. Handling user defined R functions 5

7. Inter Process Communication 6 Job Manager Client Task Manager Task Task Task Manager Task Task R Process R Process R Process R Process

8. Inter Process Communication Communication + Serialization Java and R compete for memory 7 Task Manager filter R Process filter <- function(df) { df$language == ‘R’ } 1 2 1 2

9. Source-to-Source Translation • Translate restrict set of operations to native dataflow API • Operations are executed natively 8 df <- filter( df, df$language == ‘R’ ) val df = df.filter($”language” === “R”) df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)

10. Flink + fastR 9

11. Truffle/Graal 10 HotSpot JIT Bytecode

12. Truffle/Graal 11 HotSpot JIT Bytecode Graal

13. Truffle/Graal 12 HotSpot Graal Truffle GraalVM

14. Truffle/Graal 13Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015. HotSpot Runtime Graal Interpreter GC … Truffle TruffleR (fastR) TruffleJSjavac *.js*.R*.java GraalVM AST Interpreter Source Code

15. Flink + fastR fastR: R implementation on top of Truffle/Graal • Allows us to execute R code in the same VM as Flink • Infer result types of R functions • Access Java (Flink) data types in R 14

16. Client: 1. Dataframe rows to Flink tuples 2. Determine return types of UDFs 3. Create execution plan 15 Job Manager Client Task Manager map map Task Manager map map flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) df$wordcount <- length(strsplit(df$body, " ")[[1]]) flink.writeAsText(df, SINK) flink.execute()

17. function(tuple) { .fun <- function(tuple) { length(strsplit(tuple[[2]], " ")[[1]] } flink.tuple(tuple[[1]], tuple[[2]], .fun(tuple)) } Dataframe proxy keeps track of columns, provides efficient access Can be extended with new columns Rewrite to directly use Flink tuples 16 df$wordcount <- length(strsplit(df$body, " ")[[1]]) 1 23 1 2 3

18. 17 Job Manager Client Task Manager map map Task Manager map map map { tuple => executeRFunction(func, tuple) } map { tuple => executeRFunction(func, tuple) } • Task Manager: Evaluate R UDF & Execute

19. Local - 1.4GB 18

20. Local - 14GB 19

21. Local – 1GB 20

22. Cluster – 10GB 21

23. fastR + Flink • R dataframe abstraction for distributed computation • Performance gains even on single node (local mode) • Approaches native performance even for R UDFs • Interesting opportunities for: • Streaming • Other dynamic languages • Dynamic Re-optimization Thank you for your attention! 22

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

Similar to Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink