Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Similar to Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink(20)

Advertisement

More from Flink Forward(20)

Recently uploaded(20)

Advertisement

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes on Flink

  1. Efficient Distributed R Dataframes on Apache Flink Andreas Kunft, Jens Meiners, Tilmann Rabl, Volker Markl
  2. • R got huge traction • Open source • Rich support for analytics & statistics • But, standalone not well suited for out of core data loads • Multiple extensions for distributed execution • Hadoop + R • Spark + R • SystemML 1
  3. Our Goals Provide API with natural feeling • • • Achieve comparable performance as native dataflow system 2 1 df$km <- df$miles * 1.6 df <- select(df, f = df$flights, df$distance) df <- apply(df, key = id, aggFunc) 2
  4. General Approach • R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>> • Create execution plan • Map R dataframe functions to the native API whenever possible e.g., select to projections • Call user defined R functions within the worker nodes 3
  5. General Approach • R dataframe(T1,T2,…,TN) as DataSet<TupleN<T1,T2,…,TN>> • Create execution plan • Map R dataframe functions to the native API whenever possible e.g., select to projections • Call user defined R functions within the worker nodes 4
  6. Handling user defined R functions 5
  7. Inter Process Communication 6 Job Manager Client Task Manager Task Task Task Manager Task Task R Process R Process R Process R Process
  8. Inter Process Communication Communication + Serialization Java and R compete for memory 7 Task Manager filter R Process filter <- function(df) { df$language == ‘R’ } 1 2 1 2
  9. Source-to-Source Translation • Translate restrict set of operations to native dataflow API • Operations are executed natively 8 df <- filter( df, df$language == ‘R’ ) val df = df.filter($”language” === “R”) df$km <- df$miles * 1.6 val df = df.withColumn(“km”, $”miles” * 1.6)
  10. Flink + fastR 9
  11. Truffle/Graal 10 HotSpot JIT Bytecode
  12. Truffle/Graal 11 HotSpot JIT Bytecode Graal
  13. Truffle/Graal 12 HotSpot Graal Truffle GraalVM
  14. Truffle/Graal 13Figure based on: Grimmer, Matthias, et al. "High-performance cross-language interoperability in a multi-language runtime." ACM SIGPLAN Notices. Vol. 51. No. 2. ACM, 2015. HotSpot Runtime Graal Interpreter GC … Truffle TruffleR (fastR) TruffleJSjavac *.js*.R*.java GraalVM AST Interpreter Source Code
  15. Flink + fastR fastR: R implementation on top of Truffle/Graal • Allows us to execute R code in the same VM as Flink • Infer result types of R functions • Access Java (Flink) data types in R 14
  16. Client: 1. Dataframe rows to Flink tuples 2. Determine return types of UDFs 3. Create execution plan 15 Job Manager Client Task Manager map map Task Manager map map flink.init(SERVER, PORT) flink.parallelism(DOP) df <- flink.readdf(SOURCE, list("id", “body“, …), list(character, character, …) ) df$wordcount <- length(strsplit(df$body, " ")[[1]]) flink.writeAsText(df, SINK) flink.execute()
  17. function(tuple) { .fun <- function(tuple) { length(strsplit(tuple[[2]], " ")[[1]] } flink.tuple(tuple[[1]], tuple[[2]], .fun(tuple)) } Dataframe proxy keeps track of columns, provides efficient access Can be extended with new columns Rewrite to directly use Flink tuples 16 df$wordcount <- length(strsplit(df$body, " ")[[1]]) 1 23 1 2 3
  18. 17 Job Manager Client Task Manager map map Task Manager map map map { tuple => executeRFunction(func, tuple) } map { tuple => executeRFunction(func, tuple) } • Task Manager: Evaluate R UDF & Execute
  19. Local - 1.4GB 18
  20. Local - 14GB 19
  21. Local – 1GB 20
  22. Cluster – 10GB 21
  23. fastR + Flink • R dataframe abstraction for distributed computation • Performance gains even on single node (local mode) • Approaches native performance even for R UDFs • Interesting opportunities for: • Streaming • Other dynamic languages • Dynamic Re-optimization Thank you for your attention! 22
Advertisement