Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Understanding time in structured streaming

524 views

Published on

Introduction to Time and Window API in Structured Streaming

Published in: Data & Analytics
  • Be the first to comment

Understanding time in structured streaming

  1. 1. Understanding Time in Structured Streaming Time and Window API https://github.com/phatak-dev/spark2.0-examples/tree/master/src/main/scala/co m/madhukaraphatak/examples/sparktwo/streaming
  2. 2. ● Madhukara Phatak ● Team Lead at Tellius and Part time consultant at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  3. 3. Agenda ● Evolution of Time in Stream Processing ● Introduction to Structured Streaming ● Different Time Abstractions ● Window API ● Emulating Process Time ● Working With Ingestion Time ● Event Time Abstraction ● Watermarks ● Beyond Time Windows
  4. 4. Evolution of Time in Stream Processing
  5. 5. Time is King ● Time plays major role in the stream processing ● Latency dictates the kind of operations users want to do ● Window Time dictates the state users want to maintain in stream processor ● Batch Time dictates the rate in which users want to process ● Most of the business questions asked in stream processing also time based
  6. 6. View of Time in Stream Processing ● Most of early generations stream processing system optimized for latency ● Latency differentiated between batch processing and stream processing ● Latency informed the window time and batch time ● So many early generation stream processing only had one concept of time ● It’s not good enough for new generation systems
  7. 7. Need for different time abstractions ● In a streaming system, there is ○ Source - System from where events are generated like sensors etc ○ Ingestion System - Temporary storage like Kafka ○ Processing System - Structured Streaming ● Each of these system has their own time ● Typically users want to use different system’s time to do analysis rather depending upon processing system
  8. 8. Different Time Abstractions
  9. 9. Process Time ● Time is tracked using a clock run by the processing engine. ● Default abstraction in most of stream processing engines like DStream API ● Last 10 seconds means the records arrived in last 10 seconds for the processing ● Easy to implement in framework but hard to reason about for application developers
  10. 10. Event Time ● Event Time is birth time of an event at source ● Event time is the time embed in the data that is coming into the system ● Last 10 seconds means, all the records generated in those 10 seconds at the source ● This time is independent of the clock that is kept by the processing engine ● Hard to implement in framework and easy for application developer to reason
  11. 11. Ingestion Time ● Ingestion time is the time when events ingested into the system ● This time is in between of the event time and processing time ● In processing time, each machine in cluster is used to assign the time stamp to track events ● Ingestion time, timestamp is assigned in ingestion so that all the machines in the cluster have exact same view ● Source Dependent
  12. 12. Introduction to Structured Streaming
  13. 13. Structured Streaming ● Structured Streaming is a new streaming API introduced in 2.0 ● In structured streaming, a stream is modeled as an infinite table aka infinite Dataset ● As we are using structured abstraction, it’s called structured streaming API ● All input sources, stream transformations and output sinks modeled as Dataset ● Stream transformations are represented using SQL and Dataset DSL
  14. 14. Advantage of Stream as infinite table ● Structured data analysis is first class not layered over the unstructured runtime ● Easy to combine with batch data as both use same Dataset abstraction ● Can use full power of SQL language to express stateful stream operations ● Benefits from SQL optimisations learnt over decades ● Easy to learn and maintain
  15. 15. Window API
  16. 16. Window API from Spark SQL ● Supporting multiple time abstractions in a single API is tricky ● Flink API makes it an environmental setting to specify what’s the default time abstraction of application env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) ● Spark takes route of explicit time column route which is inspired from spark sql API ● So spark API is optimised for event time by default
  17. 17. Window API ● Window API in structured streaming is part of group by operation on Dataset/Dataframe ● val windowedCount = wordsDs .groupBy( window($"processingTime", "15 seconds") ) ● It takes three parameters ○ Time Column - Name of the time column. ○ Window Time - How long is the window. ○ Slide Time - An optional parameter to specify the sliding time.
  18. 18. Window on Processing Time ● Useful for porting existing DStream code to structured streaming API ● By default, window API doesn’t have support processing time ● But we can emulate processing time, by adding a time column derived from processing time ● We will be using current_timestamp() API of spark sql to generate the time column ● Ex : ProcessingTimeWindow
  19. 19. Window on Ingestion Time ● Ingestion time abstraction is useful when each batch of data is captured in real time but takes considerable amount to process ● Ingestion time helps us to get better results than processing time without worrying about out of order events like in event time ● Ingestion time support is depended on source ● In our example, we will use socket stream which has support same ● IngestionTimeWindow
  20. 20. Window on Event Time
  21. 21. Importance of Event Time ● Event time is source of truth for all the events ● As more and more stream processing are sensitive to the time of capture, event time plays a big role ● Event time helps developers to correlate the events from various sources easily ● Correlation of events within and across sources helps developer build interesting streaming applications ● So for this reason, event time is default abstraction supported in structured streaming
  22. 22. Challenges of Event Time ● Event time is cool but it complicates the design of stream processing frameworks ● Time passed in source may be different than in processing engine ● How to handle out of order events and how long you wait ? ● How you correlate events from source which are running on their own speed ? ● How to reconcile event time with processing time ?
  23. 23. Window on Event Time ● Event Time will be a column embedded in data itself ● Default window API is built for this use case itself ● Windowing on event time make sures that even there is latency in network we are doing processing on actual time on source rather than speed of processing engine ● In our example, we analyse apple stock data which embeds the tick time ● EventTimeExample
  24. 24. Late events ● Whenever we use event time, the challenge how to handle late events? ● Default nature of the event time window in spark, it keeps windows forever. That means we can handle late events forever ● It will be great in application point of view to make sure that we never miss any event at all ● Ex : EventTimeExample
  25. 25. Need of Watermarks ● Keeping around windows forever is great for logic, but problematic resources point of view ● As each window creates state in spark, the state keeps expanding as time passes ● This kind of state keeps using more memory and makes recovery more difficult ● So we need to a mechanism to restrict time to keep around windows ● This mechanism is known as watermarks
  26. 26. Watermarks ● Watermarks is a threshold , which defines the how long we wait for the late events ● Using watermarks with event time make sure spark drops the window state once this threshold is passed in source ● Spark will maintain state and allow late data to update the state until (max event time seen by the engine - late threshold > T) ● WaterMarkExample
  27. 27. Beyond Time Windows
  28. 28. Need of timeless windows ● Most of the streaming applications use time as the criteria to do most of the analysis ● But there are use cases in streaming where is state is not bounded by time ● In the scenarios, we need a mechanism where we can define window using non time part of the data ● In DStream API, it was tricky. But with structured streaming we can define it easily
  29. 29. Sessionization ● A session is often period of time that capture different interactions with an application from user ● In an online portal session normally starts when user logs into the application and torn down when user logged out or it expires when there is no activity for some time ● Session is not a purely time based interaction as different sessions can go for different time
  30. 30. Session Window ● A session window, is a window which allows us to group different records from the stream for a specific session ● Window will start when the session starts and evaluated when session is ended ● Window also will support tracking multiple sessions at a same time ● Session windows are often used to analyze user behavior across multiple interactions bounded by session.
  31. 31. Implementing Session Window
  32. 32. Custom State Management ● There is no direct API to define non time based windows in Structured Streaming ● As window internally represented using state , we need use custom state management to do non time windows ● In structured streaming, mapGroupWithState API allows developers to do custom state management. ● This API behaves similar to mapWithState from DStream API
  33. 33. Modeling User Session ● case class Session(sessionId:String, value:Double, endSignal:Option[String]) ● sessionId uniquely identifies the given session ● value is the data that is captured for the given session ● endSignal is the explicit signal from the application end of the session ● This endSignal can be log out event or completion of a transaction etc ● Timeout will be not part of the record
  34. 34. State Management Models ● Whenever we do custom state management we need to define two different models ● One keeps around SessionInfo which tracks overall case class SessionInfo(totalSum: Double) ● SessionUpdate model calculate communicates updates for each batch case class SessionUpdate(id: String,totalSum: Doubleexpired: Boolean)
  35. 35. State Management ● We group records by sessionId ● We use mapGroupState API to go through each record from batch belonging to specific session id. ● For each group, we check is it expired or not by the data ● If expired, we use state.remove for dropping state ● If not expired, we call state.update to update the state with new data ● SessionisationExample
  36. 36. References ● http://blog.madhukaraphatak.com/categories/introductio n-structured-streaming/ ● https://databricks.com/blog/2017/01/19/real-time-stream ing-etl-structured-streaming-apache-spark-2-1.html ● https://flink.apache.org/news/2016/05/24/stream-sql.htm l

×