Structured Streaming Using Spark 2.1

•Download as PPTX, PDF•

2 likes•643 views

-What is Structured Streaming? -Programming Model -Windowing -Delayed events -Watermarking -Watermarking and output mode By Yaswanth Yadlapalli

Technology

Structured Streaming Using
Spark 2.1
BY YASWANTH YADLAPALLI

What is Structured Streaming?
 Structured Streaming is a scalable and fault-tolerant stream processing engine
built on the Spark SQL engine.
 Express streaming computation the same way as a batch computation on static
data.
 The Spark SQL engine takes care of running it incrementally and continuously and
updating the final result as streaming data continues to arrive.
 Use the Dataset/DataFrame API to express streaming aggregations, event-time
windows, stream-to-batch joins, etc.
 Structured Streaming is still ALPHA in Spark 2.1 and the APIs are still
experimental

Output modes
 There are 3 output modes in Spark Structured streaming
 Append mode - Only the new rows added to the Result Table since the last
trigger will be outputted to the sink. This is supported for only those queries
where rows added to the Result Table is never going to change
 Complete mode - The whole Result Table will be outputted to the sink after
every trigger. This is supported for aggregation queries.
 Update mode - Only the rows in the Result Table that were updated since the
last trigger will be outputted to the sink.

$Windowing import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word" ).count()$

What's hot

Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...Flink Forward

Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks

Alexander Kolb – Flink. Yet another Streaming Framework?Flink Forward

Marton Balassi – Stateful Stream ProcessingFlink Forward

From Apache Flink® 1.3 to 1.4Till Rohrmann

Developing Secure Scala Applications With Fortify For ScalaLightbend

Universal metrics with Apache BeamEtienne Chauchot

Apache Flink Training: System OverviewFlink Forward

Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks

Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...Flink Forward

Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...Flink Forward

Stateful Distributed Stream ProcessingGyula Fóra

Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann

Apache Flink at Strata San Jose 2016Kostas Tzoumas

Data Stream Processing with Apache FlinkFabian Hueske

Flink Forward SF 2017: Eron Wright - Introducing Flink TensorflowFlink Forward

Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...Flink Forward

ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data ProcessingFabian Hueske

Flink Streaming @BudapestDataGyula Fóra

Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward

What's hot (20)

Flink Forward Berlin 2017: Maciek Próchniak - TouK Nussknacker - creating Fli...

Taking Spark Streaming to the Next Level with Datasets and DataFrames

Alexander Kolb – Flink. Yet another Streaming Framework?

Marton Balassi – Stateful Stream Processing

From Apache Flink® 1.3 to 1.4

Developing Secure Scala Applications With Fortify For Scala

Universal metrics with Apache Beam

Apache Flink Training: System Overview

Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop

Flink Forward SF 2017: Malo Deniélou - No shard left behind: Dynamic work re...

Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...

Stateful Distributed Stream Processing

Machine Learning with Apache Flink at Stockholm Machine Learning Group

Apache Flink at Strata San Jose 2016

Data Stream Processing with Apache Flink

Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow

Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...

ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing

Flink Streaming @BudapestData

Matthias J. Sax – A Tale of Squirrels and Storms

Viewers also liked

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016 Databricks

Introduction to Structured streamingdatamantra

What's new with Apache Spark's Structured Streaming?Miklos Christine

Building Continuous Application with Structured Streaming and Real-Time Data ...Databricks

Introduction to Structured Streamingdatamantra

Easy, scalable, fault tolerant stream processing with structured streaming - ...Databricks

Making Structured Streaming Ready for ProductionDatabricks

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks

Deep dive into stateful stream processing in structured streaming by Tathaga...Databricks

Viewers also liked (9)

A Deep Dive into Structured Streaming: Apache Spark Meetup at Bloomberg 2016

Introduction to Structured streaming

What's new with Apache Spark's Structured Streaming?

Building Continuous Application with Structured Streaming and Real-Time Data ...

Introduction to Structured Streaming

Easy, scalable, fault tolerant stream processing with structured streaming - ...

Making Structured Streaming Ready for Production

Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das

Deep dive into stateful stream processing in structured streaming by Tathaga...

Similar to Structured Streaming Using Spark 2.1

Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...CloudxLab

Apache Spark StreamingZahra Eskandari

Introduction to Apache Spark 2.0Knoldus Inc.

Sap architectureJugul Crasta

Spark streamingWhiteklay

Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector

Operationalizing Machine Learning: Serving ML ModelsLightbend

Simplifying Big Data Applications with Apache Spark 2.0Spark Summit

Continuous Application with Structured Streaming 2.0Anyscale

Spark Summit EU talk by Ram Sriharsha and Vlad FeinbergSpark Summit

Maximizing SAP ABAP PerformancePeterHBrown

Strata NYC 2015: What's new in Spark StreamingDatabricks

Stream processing using KafkaKnoldus Inc.

SAP PI Sheet (Xstep) integration with Weighing Machine/ScaleAnkit Sharma

Ecatt1praneeth5

1.-Load-Flow-Analysis-User-Manual.pdfAkashPatel952635

Flink SQL & TableAPI in Large Scale Production at AlibabaDataWorks Summit

Apache Spark - A High Level overviewKaran Alang

KaTe AMQP Adapter for SAP Process Orchestration / SAP Process IntegrationKate_RESTful

Dataflow vs spark streamingVenkatesh Iyer

Similar to Structured Streaming Using Spark 2.1 (20)

Introduction to Structured Streaming | Big Data Hadoop Spark Tutorial | Cloud...

Apache Spark Streaming

Introduction to Apache Spark 2.0

Sap architecture

Spark streaming

Spark Streaming Recipes and "Exactly Once" Semantics Revised

Operationalizing Machine Learning: Serving ML Models

Simplifying Big Data Applications with Apache Spark 2.0

Continuous Application with Structured Streaming 2.0

Spark Summit EU talk by Ram Sriharsha and Vlad Feinberg

Maximizing SAP ABAP Performance

Strata NYC 2015: What's new in Spark Streaming

Stream processing using Kafka

SAP PI Sheet (Xstep) integration with Weighing Machine/Scale

Ecatt1

1.-Load-Flow-Analysis-User-Manual.pdf

Flink SQL & TableAPI in Large Scale Production at Alibaba

Apache Spark - A High Level overview

KaTe AMQP Adapter for SAP Process Orchestration / SAP Process Integration

Dataflow vs spark streaming

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Install Stable Diffusion in windows machinePadma Pradeep

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Pigging Solutions in Pet Food ManufacturingPigging Solutions

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

A Domino Admins Adventures (Engage 2024)Gabriella Davis

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

AI as an Interface for Commercial BuildingsMemoori

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...

08448380779 Call Girls In Friends Colony Women Seeking Men

Install Stable Diffusion in windows machine

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

How to Troubleshoot Apps for the Modern Connected Worker

Pigging Solutions in Pet Food Manufacturing

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

My Hashitalk Indonesia April 2024 Presentation

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

A Domino Admins Adventures (Engage 2024)

[2024]Digital Global Overview Report 2024 Meltwater.pdf

08448380779 Call Girls In Civil Lines Women Seeking Men

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

AI as an Interface for Commercial Buildings

Structured Streaming Using Spark 2.1

1. Structured Streaming Using Spark 2.1 BY YASWANTH YADLAPALLI

2. What is Structured Streaming?  Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.  Express streaming computation the same way as a batch computation on static data.  The Spark SQL engine takes care of running it incrementally and continuously and updating the final result as streaming data continues to arrive.  Use the Dataset/DataFrame API to express streaming aggregations, event-time windows, stream-to-batch joins, etc.  Structured Streaming is still ALPHA in Spark 2.1 and the APIs are still experimental

3. Programing Model

4. Output modes  There are 3 output modes in Spark Structured streaming  Append mode - Only the new rows added to the Result Table since the last trigger will be outputted to the sink. This is supported for only those queries where rows added to the Result Table is never going to change  Complete mode - The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries.  Update mode - Only the rows in the Result Table that were updated since the last trigger will be outputted to the sink.

5. Windowing

6. Windowing import spark.implicits._ val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String } // Group the data by window and word and compute the count of each group val windowedCounts = words.groupBy( window($"timestamp", "10 minutes", "5 minutes"), $"word" ).count()

7. Delayed events

8. Watermarking

9. Watermarking and output mode

10. Watermarking and output mode

Editor's Notes

Currently spark supports the following output sinks File sink Foreach sink Console sink (for debugging) Memory SinkAppend Mode for parquet writing and in general file sinks
Imagine our quick example is modified and the stream now contains lines along with the time when the line was generated. Instead of running word counts, we want to count words within 10 minute windows, updating every 5 minutes. That is, word counts in words received between 10 minute windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc
Now consider what happens if one of the events arrives late to the application. For example, say, a word generated at 12:04 (i.e. event time) could be received received by the application at 12:11. The application should use the time 12:04 instead of 12:11 to update the older counts for the window 12:00 - 12:10. This occurs naturally in our window-based grouping – Structured Streaming can maintain the intermediate state for partial aggregates for a long period of time such that late data can update aggregates of old windows correctly, as illustrated below.

Structured Streaming Using Spark 2.1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Structured Streaming Using Spark 2.1

Similar to Structured Streaming Using Spark 2.1 (20)

More from Sigmoid

More from Sigmoid (20)

Recently uploaded

Recently uploaded (20)

Structured Streaming Using Spark 2.1

Editor's Notes