Introduction to Structured Streaming

3,515 views

Published on

Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

Published in: Software
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,515
On SlideShare
0
From Embeds
0
Number of Embeds
3,381
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Introduction to Structured Streaming

  1. 1. Introduction to Structured Streaming Manish Mishra Software Consultant Knoldus Software LLP
  2. 2. ● What is Structured Streaming ? ● How is it different from Previous Streaming Engine? ● Structured Streaming Programming model ● Basic Operations, Selection, Projection and Aggregation ● Window Operations on Event Time ● Example Demo Agenda
  3. 3. What is Structured Streaming ?
  4. 4. What is Structured Streaming ● It is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. ● It was part of Spark 2.0 release ● A unified API for streams which can combine stream computation and batch processing ● The computations can be performed in sql-like queries which are applicable for Dataset/DataFrames on streaming dataframes.
  5. 5. What is new with this Streaming Engine?
  6. 6. What is new in Structured Streaming ● The entry point of the streaming app is spark session in spite of previous streamingContext ● Unlike Dstreams, It is an infinite Dataframe. ● It is interoperable with DStreams ● It can harness the power of Catalyst Optimizer to increase performance of query without changing the query semantics ● An Unified API makes developer task easy and no one has to reason about how streaming computation will differ from a normal map-red computation
  7. 7. Structured Streaming Programming Model ● It treats a live data stream as an unbounded table ● Any streaming computation can be expressed as a batch-like query on static tables ● The spark runs this computation as incremental query internally ● The result of the computation depends on the output modes specified in the streaming query.
  8. 8. Structured Streaming Programming Model Image Source: Apache Spark Documentations
  9. 9. Structured Streaming Programming Model Image Source: Apache Spark Documentations
  10. 10. There are three output modes which decides what result output goes into the sink namely: ● Complete Mode: ● Update Mode: ● Append Mode: Structured Streaming Programming Model:Output Modes
  11. 11. ● Complete Mode: Entire updated Result Table will be written to the sink. It is up to the storage connector to decide how to handle writing of the entire table. It can be specified by outputMode("complete") while instantiating a stream query object. Structured Streaming Programming Model:Output Modes
  12. 12. ● Append Mode (default) : Only the new rows appended in the result Table since the last trigger will be written to the external storage. ● This is applicable only on the queries where existing rows in the Result Table are not expected to change. Structured Streaming Programming Model:Output Modes
  13. 13. ● Update Mode : Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (not available yet in Spark 2.0). Note that this is different from the Complete Mode in that this mode does not output the rows that are not changed. Note: This mode is not implemented yet till Spark 2.0 Structured Streaming Programming Model:Output Modes
  14. 14. // Create DataFrame representing the stream of input lines from connection to localhost:9000 val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9000) .load() // Split the lines into words val words = lines.as[String].flatMap(_.split(" ")) // Generate running word count val wordCounts = words.groupBy("value").count() Example: Running Word Count
  15. 15. // Start running the query that prints the running counts to the console val query = wordCounts.writeStream .outputMode("complete") .format("console") .start() query.awaitTermination() Example: Running Word Count
  16. 16. Basic Operations - Selection, Projection, Aggregation
  17. 17. case class DeviceData(device: String, type: String, signal: Double, time: DateTime) val df: DataFrame = ... // streaming DataFrame with IOT device data with schema { device: string, type: string, signal: double, time: string } val ds: Dataset[DeviceData] = df.as[DeviceData] // streaming Dataset with IOT device data / Select the devices which have signal more than 10 df.select("device").where("signal > 10") // using untyped APIs ds.filter(_.signal > 10).map(_.device) // using typed APIs Basic Operations - Selection, Projection, Aggregation
  18. 18. / Running count of the number of updates for each device type df.groupBy("type").count() // using untyped API // Running average signal for each device type import org.apache.spark.sql.expressions.scalalang.typed._ ds.groupByKey(_.type).agg(typed.avg(_.signal)) // using typed API Basic Operations - Selection, Projection, Aggregation
  19. 19. Window Operations on Event Time
  20. 20. Window Operations on Event Time import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word" ).count()
  21. 21. Window Operations on Event Time Image Source: Apache Spark Documentations
  22. 22. References Structured Streaming Programming Guide ● http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
  23. 23. Thanks!!

×