Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015 IBM Corporation1
! Agenda
- Spark Streaming 1.X
•  Features
•  Areas for Improvement
- Spark Streaming 2.0 – Struct...
© 2015 IBM Corporation2
Spark Streaming 1. X
! Features of Spark Streaming
-  High Level API (stateful, joins, aggregates,...
© 2015 IBM Corporation3
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
For end-2-end exactly once guarantee...
© 2015 IBM Corporation4
Fault-Tolerant Semantics
Exactly	
  Once,	
  If	
  Outputs	
  are	
  Idempotent	
  or	
  transac6o...
© 2015 IBM Corporation5
Spark Streaming 1. X – Areas of improvement
! Fault-tolerance
-  For end-2-end exactly once guaran...
© 2015 IBM Corporation6
Spark Streaming 2.0 API
! Built on top of Spark SQL Engine
! Implicit Benefits
- Extend the primar...
© 2015 IBM Corporation7
Lets Dive in
© 2015 IBM Corporation8
SQL Batch vs SQL Streaming- Conceptually
© 2015 IBM Corporation9
Batch vs Streaming - Programmatically
© 2015 IBM Corporation10
Output Modes - Sink
! Defined as what gets written from the Result table to external storage (Sin...
© 2015 IBM Corporation11
Supported Sinks & Modes in 2.0
*DEBUG	
  ONLY	
  
*DEBUG	
  ONLY	
  
© 2015 IBM Corporation12
Windowing in Structured Streaming
© 2015 IBM Corporation13
Window operations
!  Continuous time based aggregations are most common in Streaming applications...
© 2015 IBM Corporation14
Event Time Windows
! Event-Time is time embedded within the data itself
It is not the time Spark ...
© 2015 IBM Corporation15
Handling Late Arrival in Event-Time
! Since the ‘Result’ table is updated by Spark, the late data...
© 2015 IBM Corporation16
Fault Tolerance
! Why Care?
! Different guarantees for Data Loss
! Atleast Once
! Exactly Once
! ...
© 2015 IBM Corporation17
Spark 1.x Best Fault tolerance - Kafka Direct API
•  Simplified Parallelism
•  Less Storage Need
...
© 2015 IBM Corporation18
Fault Tolerance in Structured Streaming
Active
Driver
Checkpoint	
  to	
  HDFS	
  
! Structured S...
© 2015 IBM Corporation19
Fault Tolerance in Structured Streaming
! End-to-End Exactly Once guarantees with
-  idempotent S...
© 2015 IBM Corporation20
Managing Streaming Queries
!  Streaming in 1.x was definetly lacking in
-  Starting / Stopping in...
© 2015 IBM Corporation21
Managing Streaming Queries
© 2015 IBM Corporation22
Managing Streaming Queries
© 2015 IBM Corporation23
Summary
!  Overall has a good set of features
-  Easier code share between Batch and Streaming (N...
© 2015 IBM Corporation24
For Each Sink
© 2015 IBM Corporation25
Thank YOU
Upcoming SlideShare
Loading in …5
×

Introduction to Structured Streaming

1,343 views

Published on

Spark streaming 2.0 - Structured streaming

Published in: Technology
  • Be the first to comment

Introduction to Structured Streaming

  1. 1. © 2015 IBM Corporation1 ! Agenda - Spark Streaming 1.X •  Features •  Areas for Improvement - Spark Streaming 2.0 – Structured Streaming •  Addressing the Improvement Areas •  API •  Fault Tolerance •  Event Time •  Managing Streaming queries - Structured Streaming Examples https://github.com/agsachin/spark-meetup/tree/master/sparkStructuredStreaming - Summary thoughts
  2. 2. © 2015 IBM Corporation2 Spark Streaming 1. X ! Features of Spark Streaming -  High Level API (stateful, joins, aggregates, windows etc.) •  Overlap with RDD API (batch) -  Fault – Tolerant (exactly once semantics achievable) -  Back Pressure -  Deep Integration with Spark Ecosystem (MLlib, SQL, GraphX etc.) ! Apache  Hadoop  Day  2015  
  3. 3. © 2015 IBM Corporation3 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink Can that be handled in a very simple way for the end-user ? Apache  Hadoop  Day  2015  
  4. 4. © 2015 IBM Corporation4 Fault-Tolerant Semantics Exactly  Once,  If  Outputs  are  Idempotent  or  transac6onal   Exactly  Once,  as  long  as  received  data  is  not  lost   Exactly  Once  needs  re-­‐playable  sources  (e.g.  Ka?a  Direct)   Source Receiver Transforming Outputting Sink
  5. 5. © 2015 IBM Corporation5 Spark Streaming 1. X – Areas of improvement ! Fault-tolerance -  For end-2-end exactly once guarantees, user needs to do all the heavy lifting in the Sink ! API -  Request for more seamless API between Batch & Stream -  Reduce complexities of streaming app * ! No Event Time support -  Hard to support when processing time/batch time exposed in externals ! Streaming Query Management ! Micro-batch ! Apache  Hadoop  Day  2015  
  6. 6. © 2015 IBM Corporation6 Spark Streaming 2.0 API ! Built on top of Spark SQL Engine ! Implicit Benefits - Extend the primary Batch API even to Streaming - Gain an Optimizer and all other enhancements done in SparkSQL. ! Challenge - Remove/Keep streaming complexities to minimum !
  7. 7. © 2015 IBM Corporation7 Lets Dive in
  8. 8. © 2015 IBM Corporation8 SQL Batch vs SQL Streaming- Conceptually
  9. 9. © 2015 IBM Corporation9 Batch vs Streaming - Programmatically
  10. 10. © 2015 IBM Corporation10 Output Modes - Sink ! Defined as what gets written from the Result table to external storage (Sink) ! Output modes -  Complete – Entire updated Result table is written to external storage. -  Append – Only new rows added in the Result table since last incremental query execution is written to external storage. -  Update - Only the rows updated in the Result table since last incremental query execution is written to external storage. Upto implementation of Storage connector to decide how to write. * Aggregate queries only support complete mode and non-aggregate queries append mode
  11. 11. © 2015 IBM Corporation11 Supported Sinks & Modes in 2.0 *DEBUG  ONLY   *DEBUG  ONLY  
  12. 12. © 2015 IBM Corporation12 Windowing in Structured Streaming
  13. 13. © 2015 IBM Corporation13 Window operations !  Continuous time based aggregations are most common in Streaming applications. -  Sliding window & Tumbling window E.g. Top x hashtags on Twitter in last half hour, every 5 minutes ! New function that treats windowing as a regular aggregation !  Used in a Group By clause Can be used in Batch as well
  14. 14. © 2015 IBM Corporation14 Event Time Windows ! Event-Time is time embedded within the data itself It is not the time Spark received the data ! What about processing time windows if you want them
  15. 15. © 2015 IBM Corporation15 Handling Late Arrival in Event-Time ! Since the ‘Result’ table is updated by Spark, the late data is put in its correct window group ! Use a normal filter in the SQL ? ! Watermarks
  16. 16. © 2015 IBM Corporation16 Fault Tolerance ! Why Care? ! Different guarantees for Data Loss ! Atleast Once ! Exactly Once ! What all can fail? ! Driver ! Executor
  17. 17. © 2015 IBM Corporation17 Spark 1.x Best Fault tolerance - Kafka Direct API •  Simplified Parallelism •  Less Storage Need •  Exactly Once Semantics. source & processing Benefits  of  this  approach  
  18. 18. © 2015 IBM Corporation18 Fault Tolerance in Structured Streaming Active Driver Checkpoint  to  HDFS   ! Structured Streaming Checkpointing Decided Offsets ranges for a trigger interval is logged to checkpoint Directory *before* any processing is started for that trigger Nth record in log indicates data that is currently being processed N-1 entry in log indicates offsets idempotent written to Sink Log entries are monotonically increasing integers ! On Recovery Restart processing of nth entry in WAL
  19. 19. © 2015 IBM Corporation19 Fault Tolerance in Structured Streaming ! End-to-End Exactly Once guarantees with -  idempotent Sinks (built-in for commonly used sinks e.g. Files / JDBC) -  Built-in Sources will *mostly* be only ones that support replay https://issues.apache.org/jira/browse/SPARK-15842
  20. 20. © 2015 IBM Corporation20 Managing Streaming Queries !  Streaming in 1.x was definetly lacking in -  Starting / Stopping individual Streaming Queries -  Changing the computation done in a Query. -  When a Streaming Query abnormally terminates handle more gracefully than app crash.
  21. 21. © 2015 IBM Corporation21 Managing Streaming Queries
  22. 22. © 2015 IBM Corporation22 Managing Streaming Queries
  23. 23. © 2015 IBM Corporation23 Summary !  Overall has a good set of features -  Easier code share between Batch and Streaming (No different type hierarchies) -  Window not tied to Batch interval -  No Streaming context -  Optimizer now available for your queries. !  Getting started -  Combining of 3 things (Output Mode & Sink Type & Query type) needs some time to wrap your head around * And not much control over those. -  Only get Runtime exceptions when you mess with above !  How does it compare to Apache Beam ?
  24. 24. © 2015 IBM Corporation24 For Each Sink
  25. 25. © 2015 IBM Corporation25 Thank YOU

×