Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Structured Streaming
Spark Streaming 2.0
https://hadoopist.wordpress.com
Giri R Varatharajan
https://www.linkedin.com/in/g...
What is
Structured
Streaming in
Apache Spark
● Continuous Data Flow Programming Model in
Spark introduced in 2.0
● Low Tol...
Spark
Streaming
< 2.0
Behavior
● Micro Batching : streams are called as Discretized
Streams (DStreams)
● Running Aggregati...
Streaming Model
● Live Data Streams Keep appending
to the Dataframe called Unbounded
table.
● Runs incremental aggregates ...
Spark
Streaming
2.0
Behavior
+
Demo
● Continuous Data Flow : Streams are appended in
an Unbounded Table with Dataframes AP...
Streaming Model
//Socket Stream - Read as and when it arrives in NetCat Channel
val lines = spark.readStream
.format("sock...
Streaming Model
val windowedCounts = words.groupBy(
window($"timestamp", windowDuration,
slideDuration), $"word"
).count()...
Create/
Read Streams
SparkSession.readStream()
● File Source (HDFS, S3, Text, Parquet, Csv,
Json,etc.)
● Socket Stream (Ne...
Outputting
Streams
SparkSession.writeStream()
Output Sink Types:
● Parquet Sink - HDFS, S3, Parquet
● Console Sink - Termi...
CheckPointing ● In case of Failure recover the previous progress
and state of a previous query, and continue where
it left...
Unsupported
Operations yet
● Sort, Limit of First N rows, Distinct on Input
Streams
● Joins bt two streaming datasets
● Ou...
Key Takeaways
● Structured Streaming is still experimental but please try it out.
● Streaming Events are gathered and appe...
Thank You Spark Code is available in my github:
https://github.com/vgiri2015/Spark2.0-and-greater/tree/master
/src/main/sc...
Upcoming SlideShare
Loading in …5
×

Structured streaming in Spark

10,828 views

Published on

How Structured Streaming Feature works in Apache Spark.

Published in: Engineering
  • Be the first to comment

Structured streaming in Spark

  1. 1. Structured Streaming Spark Streaming 2.0 https://hadoopist.wordpress.com Giri R Varatharajan https://www.linkedin.com/in/girivaratharajan
  2. 2. What is Structured Streaming in Apache Spark ● Continuous Data Flow Programming Model in Spark introduced in 2.0 ● Low Tolerance & High Throughput System ● Exactly Once Semantic - No Duplicates ● Stateful Aggregation over the Time, Event, Window, Record. ● A Streaming platform built on top of Spark SQL ● Express your the computational code as your batch computational code in Spark SQL Dataframes ● Alpha Release released with Spark 2.0 ● Supports HDFS, S3 now and support for Kafka, Kinesis and Other Sources very soon.
  3. 3. Spark Streaming < 2.0 Behavior ● Micro Batching : streams are called as Discretized Streams (DStreams) ● Running Aggregations needs to be specified with a updateStateByKey method ● Requires careful construction of fault tolerance. Micro Batching
  4. 4. Streaming Model ● Live Data Streams Keep appending to the Dataframe called Unbounded table. ● Runs incremental aggregates on the Unbounded table.
  5. 5. Spark Streaming 2.0 Behavior + Demo ● Continuous Data Flow : Streams are appended in an Unbounded Table with Dataframes APIs on it. ● No need to specify any method for running aggregates over the time, window, or record. ● Look at the network socket wordcount program. ● Streaming is performed in Complete, Append, Update Mode(s) Continuous Data Flow Lines = Input Table wordCounts = Result Table
  6. 6. Streaming Model //Socket Stream - Read as and when it arrives in NetCat Channel val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load()
  7. 7. Streaming Model val windowedCounts = words.groupBy( window($"timestamp", windowDuration, slideDuration), $"word" ).count().orderBy("window")
  8. 8. Create/ Read Streams SparkSession.readStream() ● File Source (HDFS, S3, Text, Parquet, Csv, Json,etc.) ● Socket Stream (NetCat) ● Kafka, Kinesis and Other Input Sources are Under Research so cross your fingers. ● DataStreamReader API (http://spark.apache.org/docs/latest/api/scala/index .html#org.apache.spark.sql.streaming.DataStream Reader)
  9. 9. Outputting Streams SparkSession.writeStream() Output Sink Types: ● Parquet Sink - HDFS, S3, Parquet ● Console Sink - Terminal ● Memory Sink - In memory table that can be queried over time interactively ● Foreach Sink ● DataStreamWriter API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.st reaming.DataStreamWriter) Output Modes: ● Append Mode(Default) ○ New rows only appended ○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc) ● Complete Mode ■ Output the whole result to any Sink ■ Applicable only for aggregated Queries (groupBy, etc) ● Update Mode ○ Updates on any of the row attributes will get appended to the output sink.
  10. 10. CheckPointing ● In case of Failure recover the previous progress and state of a previous query, and continue where it left off. ● Configure a CheckPoint location in writeStream method of DataStreamWriter ● Must be configured for Parquet Sink, File Sink.
  11. 11. Unsupported Operations yet ● Sort, Limit of First N rows, Distinct on Input Streams ● Joins bt two streaming datasets ● Outer Joins (FO, LO, RO) bt two streaming datasets. ● ds.count() ⇒ Use ds.groupBy.count() instead
  12. 12. Key Takeaways ● Structured Streaming is still experimental but please try it out. ● Streaming Events are gathered and appended to a infinite dataframe series (Unbounded Table) and queries are running on top of that. ● Development is very similar to the development of Spark for Static Dataframe/DataSets APIs. ● Execute Ad-hoc Queries, Run aggregates, update DBs, track session data, prepare dashboards,etc. ● readStream() - Schema of the Streaming Dataframes are checked only at run time hence it’s untyped. ● writeStream() with various Output Modes, Output Sinks are available. Always remember when to use what types of Output Mode. ● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks are the upcoming features and are being developed at the open source community. ● Structured Streaming is not recommended for Production workloads at this point even if it’s a File Streaming, Socket Streaming.
  13. 13. Thank You Spark Code is available in my github: https://github.com/vgiri2015/Spark2.0-and-greater/tree/master /src/main/scala/structStreaming Other Spark related repositories: https://github.com/vgiri2015/spark-latest-v1 My blogs and Learning in Spark: https://hadoopist.wordpress.com/category/apache-spark/

×