Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian Kruse and Yasser Idris (

233 views

Published on

We are witnessing a proliferation of big data, which has lead to a zoo of data processing systems. Each system providing a different set of features. For example, Spark provides scalability to analytic tasks, but Java 8 Streams provides low-latency. Furthermore, complex applications, such as ETL and ML, are now requiring a mixture of platforms to perform tasks efficiently. In such complex data analytics pipelines, the use of multiple data processing system is not only for performance reasons, but also because of data diversity. Datasets often natively reside on different data formats and storage engines. Unfortunately, developers are left alone in the challenging tasks of: (1) choosing the right platform for their applications; and (2) performing tedious and costly data migration and integration tasks to obtain the results.

In this talk, we will present Rheem, an open source scalable cross-platform system that frees developers from these burdens. Rheem provides an abstraction layer on top of Spark (and other processing platforms) with the aim of enabling cross-platform optimization and interoperability. It automatically selects the best data processing platforms for a given task and also handles the cross-platform execution. In particular, we will discuss how Rheem allows Spark to work in tandem with other platforms in order to achieve higher performance. We will also show how easy a developer can write complex applications on top of Rheem to seamlessly use multiple different data processing platforms according to their tasks at hand. Using Rheem developers do not have to worry about the integration or data migration between Spark and other platforms.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Interoperating a Zoo of Data Processing Platforms Using with Rheem Sebastian Kruse and Yasser Idris (

  1. 1. INTEROPERATINGA ZOO OF DATA PROCESSING PLATFORMS YASSER IDRIS, SEBASTIAN KRUSE
  2. 2. Spark Stack LFS S3HDFS Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch 2
  3. 3. Apps On Spark Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch LFS S3HDFS Batch App DB App Stream App ML APP Graph App 3
  4. 4. Beyond a Single Platform Complex Apps Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch LFS S3HDFS BI App 4
  5. 5. Beyond a Single Platform Big vs. Small Data Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch LFS S3HDFS ML App 5
  6. 6. Beyond a Single Platform Big vs. Small Data Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch LFS S3HDFS ML App Java Streams 6
  7. 7. RHEEM Complex Apps BI App RHEEM Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Java Streams LFS S3HDFS 7
  8. 8. RHEEM Big vs. Small Data ML App RHEEM Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch LFS S3HDFS Java Streams 8
  9. 9. RHEEM Big vs. Small Data ML App RHEEM Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch LFS S3HDFS Java Streams 9
  10. 10. ML App RHEEM Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch LFS S3HDFS RHEEM Big vs. Small Data Java Streams 10 CROSS-PLATFORM USE CASES
  11. 11. Platform Independence Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS DB App RHEEM GraphChi SELECT SUM(O_TOTALPRICE) FROM ORDERS; Java Streams 11
  12. 12. Platform Independence Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS DB App RHEEM GraphChi Java Streams SELECT SUM(O_TOTALPRICE) FROM ORDERS; 12
  13. 13. Platform Independence Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS DB App RHEEM GraphChi Java Streams SELECT SUM(O_TOTALPRICE) FROM ORDERS; 13
  14. 14. Platform Independence Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS DB App RHEEM GraphChi Java Streams SELECT SUM(O_TOTALPRICE) FROM ORDERS; 14
  15. 15. Opportune Multi-Platform Execution 15 Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS SGD RHEEM Java Streams GraphChi read update transform sample compute
  16. 16. Opportune Multi-Platform Execution 16 Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS SGD RHEEM Java Streams GraphChi read update transform sample compute
  17. 17. Opportune Multi-Platform Execution 17 Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS SGD RHEEM Java Streams GraphChi read update transform sample compute
  18. 18. Polystore Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS TPC-H Q5 RHEEM Java Streams GraphChi LINEITEMCUSTOMER ORDERSSUPPLIER NATION REGION Postgres 18 Local FS HDFS
  19. 19. Polystore Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS TPC-H Q5 RHEEM Java Streams GraphChi LINEITEMCUSTOMER ORDERSSUPPLIER NATION REGION Postgres 19 Local FS HDFS
  20. 20. Polystore Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS TPC-H Q5 RHEEM Java Streams GraphChi LINEITEMCUSTOMER ORDERSSUPPLIER NATION REGION Postgres 20 Local FS HDFS
  21. 21. Mandatory Multi-Platform Execution Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS CroCoPR RHEEM GraphChi Intersect PageRank 21 Java Streams
  22. 22. Mandatory Multi-Platform Execution Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS CroCoPR RHEEM GraphChi Intersect PageRank 22 Java Streams
  23. 23. Mandatory Multi-Platform Execution Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS CroCoPR RHEEM GraphChi Intersect PageRank 23 Java Streams
  24. 24. RHEEM INTERNALS 24 Let’s have a look under the hood…
  25. 25. RHEEM in a Nutshell Rheem App Apache Spark Java StreamsGraphChi Postgres HDFS LFS S3 Java Driver Rheem API Spark DriverPg Driver Monitor Cross-Platform Optimizer Cost Learner Cross-Platform Executor 25
  26. 26. Fine-grained Platform Selection Table source Filter Map Group by Collect Table source Filter Map Group by Collect Get real and predicted weights from the year 2017 Calculate MSE Group by Airline Output results 26 Logical plan RHEEM plan Execution plan
  27. 27. Automatic Data Movement Table source Filter Map Group by Collect Table source Filter Map Group by Collect Get real and predicted weights from the year 2017 Calculate MSE Group by Airline Output results 27 Stage 2 Spark Stage 1 Postgres Logical plan RHEEM plan Execution plan
  28. 28. Automatic Data Movement Table source Filter Map Group by Collect Table source Filter Map Group by Collect Get real and predicted weights from the year 2017 Calculate MSE Group by Airline Output results SQL 2 RDD 28 Stage 2 Spark Stage 1 Postgres Logical plan RHEEM plan Execution plan
  29. 29. Rheem Extensibility Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS Java Streams GraphChiHadoop Map Filter Reduce Count Approx. … … … 29 Map Filter Reduce Map … Filter Page Rank … Map … Count Approx. Page Rank
  30. 30. Scala REST PIGLATIN PythonJava Multiple APIs Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS Cross-Platform Apps RHEEM Java Streams GraphChi 30
  31. 31. Rheem Studio 31
  32. 32. DEMO 32 Let’s play…
  33. 33. Scala REST PIGLATIN PythonJava The dRHEEM goes on… Spark Spark SQL MLlib Spark Streaming GraphX Spark Batch Postgres LFS S3HDFS Cross-Platform Apps RHEEM Java Streams GraphChi 33 Add more platforms Integrate with resource managers Enhance data exchange paths Continuously improve optimizer Re-use intermediate datasets across jobs
  34. 34. We want you! Website: http://da.qcri.org/rheem/ GitHub: https://github.com/rheem-ecosystem Apache Incubator Very Soon! Interested? Then, become a dRHEEMer! Gitter: https://gitter.im/rheem-ecosystem/Lobby

×