Structured streaming in Spark


Published on

How Structured Streaming Feature works in Apache Spark.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Structured streaming in Spark

  1. 1. Structured Streaming Spark Streaming 2.0 Giri R Varatharajan
  2. 2. What is Structured Streaming in Apache Spark ● Continuous Data Flow Programming Model in Spark introduced in 2.0 ● Low Tolerance & High Throughput System ● Exactly Once Semantic - No Duplicates ● Stateful Aggregation over the Time, Event, Window, Record. ● A Streaming platform built on top of Spark SQL ● Express your the computational code as your batch computational code in Spark SQL Dataframes ● Alpha Release released with Spark 2.0 ● Supports HDFS, S3 now and support for Kafka, Kinesis and Other Sources very soon.
  3. 3. Spark Streaming < 2.0 Behavior ● Micro Batching : streams are called as Discretized Streams (DStreams) ● Running Aggregations needs to be specified with a updateStateByKey method ● Requires careful construction of fault tolerance. Micro Batching
  4. 4. Streaming Model ● Live Data Streams Keep appending to the Dataframe called Unbounded table. ● Runs incremental aggregates on the Unbounded table.
  5. 5. Spark Streaming 2.0 Behavior + Demo ● Continuous Data Flow : Streams are appended in an Unbounded Table with Dataframes APIs on it. ● No need to specify any method for running aggregates over the time, window, or record. ● Look at the network socket wordcount program. ● Streaming is performed in Complete, Append, Update Mode(s) Continuous Data Flow Lines = Input Table wordCounts = Result Table
  6. 6. Streaming Model //Socket Stream - Read as and when it arrives in NetCat Channel val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
  7. 7. Streaming Model val windowedCounts = words.groupBy( window($"timestamp", windowDuration, slideDuration), $"word" ).count().orderBy("window")
  8. 8. Create/ Read Streams SparkSession.readStream() ● File Source (HDFS, S3, Text, Parquet, Csv, Json,etc.) ● Socket Stream (NetCat) ● Kafka, Kinesis and Other Input Sources are Under Research so cross your fingers. ● DataStreamReader API ( .html#org.apache.spark.sql.streaming.DataStream Reader)
  9. 9. Outputting Streams SparkSession.writeStream() Output Sink Types: ● Parquet Sink - HDFS, S3, Parquet ● Console Sink - Terminal ● Memory Sink - In memory table that can be queried over time interactively ● Foreach Sink ● DataStreamWriter API( reaming.DataStreamWriter) Output Modes: ● Append Mode(Default) ○ New rows only appended ○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc) ● Complete Mode ■ Output the whole result to any Sink ■ Applicable only for aggregated Queries (groupBy, etc) ● Update Mode ○ Updates on any of the row attributes will get appended to the output sink.
  10. 10. CheckPointing ● In case of Failure recover the previous progress and state of a previous query, and continue where it left off. ● Configure a CheckPoint location in writeStream method of DataStreamWriter ● Must be configured for Parquet Sink, File Sink.
  11. 11. Unsupported Operations yet ● Sort, Limit of First N rows, Distinct on Input Streams ● Joins bt two streaming datasets ● Outer Joins (FO, LO, RO) bt two streaming datasets. ● ds.count() ⇒ Use ds.groupBy.count() instead
  12. 12. Key Takeaways ● Structured Streaming is still experimental but please try it out. ● Streaming Events are gathered and appended to a infinite dataframe series (Unbounded Table) and queries are running on top of that. ● Development is very similar to the development of Spark for Static Dataframe/DataSets APIs. ● Execute Ad-hoc Queries, Run aggregates, update DBs, track session data, prepare dashboards,etc. ● readStream() - Schema of the Streaming Dataframes are checked only at run time hence it’s untyped. ● writeStream() with various Output Modes, Output Sinks are available. Always remember when to use what types of Output Mode. ● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks are the upcoming features and are being developed at the open source community. ● Structured Streaming is not recommended for Production workloads at this point even if it’s a File Streaming, Socket Streaming.
  13. 13. Thank You Spark Code is available in my github: /src/main/scala/structStreaming Other Spark related repositories: My blogs and Learning in Spark: