Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of Apache Flink

365 views

Published on

This slide set was presented at UCSB on Sep. 30, 2017.
The talk covers an extended version of the slides from SoCC 2017 plus a quick overview of Apache Flink.

Published in: Software
  • Be the first to comment

JT@UCSB - On-Demand Data Streaming from Sensor Nodes and A quick overview of Apache Flink

  1. 1. On-Demand Data Streaming from Sensor Nodes (ACM SoCC 2017) and A quick overview of Apache Flink Presentation at Sep. 30, 2017 University of California, Santa Barbara
  2. 2. About me • Researcher and PhD candidate at – Technische Universität Berlin (DIMA) – German Research Center for Artificial Intelligence (DFKI) / (IAM) • Working with Volker Markl • Before – Master’s degree in Computer Science (KTH Stockholm and TU Belin) – Bachelor’s degree in Applied Computer Science (DHBW Stuttgart) – Four years at IBM in Germany and the USA Jonas Traub jon@s-traub.com Jonas.traub@tu-berlin.de
  3. 3. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Optimized On-Demand Data Streaming from Sensor Nodes Jonas Traub, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl Extended Talk for . . Santa Clara, California, September 25-27, 2017
  4. 4. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud Real-time insights 4
  5. 5. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud Real-time insights Billions of sensor nodes form a sensor cloud and provide data streams to analysis systems. 5
  6. 6. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud Real-time insights Billions of sensor nodes form a sensor cloud and provide data streams to analysis systems. 6
  7. 7. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud Real-time insights Billions of sensor nodes form a sensor cloud and provide data streams to analysis systems. 7
  8. 8. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud Real-time insights Billions of sensor nodes form a sensor cloud and provide data streams to analysis systems. 8
  9. 9. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud – Problems Real-time insights 9 Billions of sensor nodes form a sensor cloud and provide data streams to analysis systems.
  10. 10. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud – Problems Real-time insights Streaming all data from billions of sensors to all applications with maximal frequencies is impossible 10 Billions of sensor nodes form a sensor cloud and provide data streams to analysis systems.
  11. 11. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud – Problems Real-time insights Streaming all data from billions of sensors to all applications with maximal frequencies is impossible Increasing data rates require expensive system scale-out. 11 Billions of sensor nodes form a sensor cloud and provide data streams to analysis systems.
  12. 12. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud – Solutions 12 Tailor Data Streams to the Demand of Applications
  13. 13. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud – Solutions 13 Tailor Data Streams to the Demand of Applications • Provide an abstraction to define the data demand of applications. User-Defined Sampling Functions (UDSFs)
  14. 14. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud – Solutions 14 Tailor Data Streams to the Demand of Applications • Provide an abstraction to define the data demand of applications. • Optimize communication costs while maintaining the result accuracy. User-Defined Sampling Functions (UDSFs) Read-Time Optimization
  15. 15. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 The Sensor Cloud – Solutions 15 Tailor Data Streams to the Demand of Applications • Provide an abstraction to define the data demand of applications. • Optimize communication costs while maintaining the result accuracy. • Share sensor reads and data transfer among users and queries. User-Defined Sampling Functions (UDSFs) Read-Time Optimization Multi-Query / Multi-User Optimization
  16. 16. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example 16
  17. 17. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example 17
  18. 18. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example 18 Different Data Data Demands: • Query 1 adaptively increases sampling rates when accelerating or braking.
  19. 19. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example 19 Different Data Data Demands: • Query 1 adaptively increases sampling rates when accelerating or braking.
  20. 20. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example 20 Different Data Data Demands: • Query 1 adaptively increases sampling rates when accelerating or braking. • Query 2 requires a sample at least every 20 meters
  21. 21. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example 21 Different Data Data Demands: • Query 1 adaptively increases sampling rates when accelerating or braking. • Query 2 requires a sample at least every 20 meters
  22. 22. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example 22 Different Data Data Demands: • Query 1 adaptively increases sampling rates when accelerating or braking. • Query 2 requires a sample at least every 20 meters • Query 3 requires a sample at least every 0.3s.
  23. 23. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example 23 Different Data Data Demands: • Query 1 adaptively increases sampling rates when accelerating or braking. • Query 2 requires a sample at least every 20 meters • Query 3 requires a sample at least every 0.3s.
  24. 24. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example 24 Different Data Data Demands: • Query 1 adaptively increases sampling rates when accelerating or braking. • Query 2 requires a sample at least every 20 meters • Query 3 requires a sample at least every 0.3s.
  25. 25. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example - Evaluation 25
  26. 26. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example - Evaluation 26 -57%
  27. 27. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example - Evaluation 27 -57% -72%
  28. 28. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A Motivating Example - Evaluation 28 1/3 because 3 values per tuple -57% -72%
  29. 29. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Architecture Overview 29
  30. 30. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Architecture Overview 30
  31. 31. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Architecture Overview 31
  32. 32. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Architecture Overview 32
  33. 33. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Architecture Overview 33
  34. 34. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Sensor Read Scheduling 34
  35. 35. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 User-Defined Sampling Functions 35 Input: Sensor read time and value Output: Next Sensor Read Request
  36. 36. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 User-Defined Sampling Functions 36 Input: Sensor read time and value
  37. 37. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 User-Defined Sampling Functions 37 Enable adaptive sampling techniques to reduce data transmission e.g., Adam [Trihinas ‘15], FAST [Fan ‘14], L-SIP [Gaura ’13]
  38. 38. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 User-Defined Sampling Functions - Examples 38
  39. 39. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 User-Defined Sampling Functions - Examples 39
  40. 40. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 User-Defined Sampling Functions - Examples 40
  41. 41. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Sensor Read Fusion 41
  42. 42. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Sensor Read Fusion 42 1) Minimize Sensor Reads and Data Transfer: Latest possible read time
  43. 43. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Read Time Optimization 43 2) Optimize Sensor Read Times: ● Minimize penalty while executing the minimum number of sensor reads only ● Challenge: assign read requests to sensor reads
  44. 44. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Assigning Read Requests to Sensor Reads 44 PostponeAssign to next Read
  45. 45. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Assigning Read Requests to Sensor Reads 45 PostponeAssign to next Read
  46. 46. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Assigning Read Requests to Sensor Reads 46 PostponeAssign to next Read
  47. 47. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Assigning Read Requests to Sensor Reads 47 PostponeAssign to next Read
  48. 48. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Local Filtering 48
  49. 49. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Local Filtering 49 ● Enable adaptive filtering in combination with adaptive sampling ● Enable model-driven data acquisition
  50. 50. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Local Filtering 50
  51. 51. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Evaluation ●Replay sensor data - from a football match [DEBS Grand Challenge ’13] - formula 1 telementry data ●Random UDSFs: - Read in a poisson process (also simulate load peaks) - In average 1 read per query per second - Exponentially distributed read time tolerance - high probability for small tolerances - small probability for large tolerances - In average 0.04s read time tolerance 51
  52. 52. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 52 Increasing the number of concurrent queries • On-Demand scheduling reduces sensor reads and data transfer by up to 87%. • The # of reads and transfers increases sub-linearly with the # of queries.
  53. 53. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 53 Increasing the number of concurrent queries • Our read-time optimizer reduces the deviation from desired read times by up to 69% (preserving the min. # of reads and transfers).
  54. 54. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 54 Increasing read time tolerances
  55. 55. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 55 Increasing read time tolerances
  56. 56. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 56 Query Prioritization (1/2)
  57. 57. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 57 Query Prioritization (2/2)
  58. 58. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 58 Slack Robustness of Adaptive Sampling
  59. 59. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 Optimized On-Demand Data Streaming from Sensor Nodes Wrap-Up: Tailor Data Streams to the Demand of Applications • Define data demand: User-Defined Sampling Functions • Schedule sensor reads and data transfer on-demand • Optimize read times globally - for all users and queries Jonas Traub, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, Volker Markl
  60. 60. Traub et al., Optimized On-Demand Data Streaming from Sensor Nodes, SoCC ‘17 A quick overview of Apache Flink - research summary - Jonas Traub visiting September 30, 2017
  61. 61. Outline Apache Flink Primer • Stratosphere – The origin of Apache Flink • What is Apache Flink? – Basic System Internals • The Flink Community An Apache Flink Research Summary 61
  62. 62. 62 © Volker Markl • Relational Algebra • Declarativity • Query Optimization • Robust Out-of-core • Scalability • User-defined Functions • Complex Data Types • Schema on Read 62 Draws on Database Technology Draws on MapReduce Technology Stratosphere: General Purpose Programming + Database Execution
  63. 63. 63 © Volker Markl • Relational Algebra • Declarativity • Query Optimization • Robust Out-of-core • Scalability • User-defined Functions • Complex Data Types • Schema on Read • Iterations • Advanced Dataflows • General APIs • Native Streaming 63 Draws on Database Technology Draws on MapReduce Technology Adds Stratosphere: General Purpose Programming + Database Execution
  64. 64. 64 © Volker Markl 64 64 © Volker Markl Apache Flink is an open source platform for scalable batch and stream data processing. What is Apache Flink? http://flink.apache.org A distributed system that you can use to process data Like a DBMS but not exactly a DBMS What kind of data? Data that comes in the form of streams What kind of processing Quite flexible. You can use Java/Scala APIs similar to programming with Java collections, the new SQL API, etc Distributed: runs on many (1000s) of machines and hides this complexity from the user 64
  65. 65. 65 © Volker Markl 65 65 © Volker Markl Basic application architecture app state app state app state event log Query service Sources of data (e.g., sensors, logs, …) A replayable log of events with pub/sub functionality Processing of events Storage and query systems By courtesy of Kostas Tzoumas 65
  66. 66. 66 © Volker Markl 66 © 2013 Berlin Big Data Center • All Rights Reserved 66 © Volker Markl Technology inside Flink case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Program
  67. 67. 67 © Volker Markl 67 © 2013 Berlin Big Data Center • All Rights Reserved 67 © Volker Markl Technology inside Flink case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Cost-based optimizer Type extraction stack Pre-flight (Client) Program
  68. 68. 68 © Volker Markl 68 © 2013 Berlin Big Data Center • All Rights Reserved 68 © Volker Markl Technology inside Flink case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Cost-based optimizer Type extraction stack Pre-flight (Client) DataSourc e orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph
  69. 69. 69 © Volker Markl 69 © 2013 Berlin Big Data Center • All Rights Reserved 69 © Volker Markl Technology inside Flink case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Cost-based optimizer Type extraction stack Task scheduling Recovery metadata Pre-flight (Client) Master Workers DataSourc e orders.tbl Filter Map DataSourc e lineitem.tbl Join Hybrid Hash build HT probe hash-part [0] hash-part [0] GroupRed sort forward Program Dataflow Graph deploy operators track intermediate results
  70. 70. 70 © Volker Markl 70 70 © Volker Markl Flink community 0 50 100 150 200 250 300 Feb 15 Dec 15 Dec 16 Number of Contributors 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Feb 15 Dec 15 Dec 16 Stars on GitHub 0 200 400 600 800 1000 1200 1400 Feb 15 Dec 15 Dec 16 Forks on GitHub By courtesy of Kostas Tzoumas Project Statistics (Updated: Sep 29, 2017) 70
  71. 71. 71 © Volker Markl 71 71 © Volker Markl Companies Using Flink
  72. 72. Apache Flink - Related Publications System Paper 2015: System Paper 2014: 72
  73. 73. Apache Flink - Related Publications System Paper 2015: System Paper 2014: 73
  74. 74. Apache Flink - Related Publications System Paper 2015: System Paper 2014: 74
  75. 75. State Management VLDB 2017 75
  76. 76. Iterative Processing VLDB 2012 76
  77. 77. Iterative Processing VLDB 2012 SIGMOD 2013 77
  78. 78. Fault Tolerance 78
  79. 79. Fault Tolerance 79
  80. 80. Fault Tolerance 80
  81. 81. Fault Tolerance 81
  82. 82. Streaming Window Aggregation 82
  83. 83. Visualization of Streaming Data 83
  84. 84. On-Demand Data Streaming from Sensor Nodes (ACM SoCC 2017) and A quick overview of Apache Flink Presentation at Sep. 30, 2017 University of California, Santa Barbara

×