Spark Structured Streaming

•Download as ODP, PDF•

2 likes•257 views

Knolx was about to spark structured streaming. The focus was about the difference between three APIs RDD, DataFrame, and Datasets. And some key concepts of structured streaming like schema, output modes, operations like selection, projection, aggregation, windowing, etc

Software

Agenda
● Streaming – What and Why ?
● Rdd vs DataFrames vs Datasets
● Programming Model
● Streaming DataFrames and Datasets
● Defining Schema
● Output Modes
● Basic operations
● Window operations on event time*

RDD
Some key features of RDD :-
● Resilient
● Type Safe
● Immutable
● Lazy evaluation
and many more

Problems with RDDs
● They express the ‘how’ of a solution better than the
‘what’. Rdd library is bit opaque.
● They cannot be optimized by Spark.
● It’s too easy to build an inefficient RDD
transformation chain.

DataFrames
● DataFrame API provides a higher-level abstraction,
allowing you to use a query language to manipulate
data.
● Avail SQL functionalities.
● Focus on What rather than on How.

DataFrames
“Let spark figure out how to do it for you.
Like, in RDBMS we just fire the sql queries
and are not concerned how the query brings
out the data, i.e., we just care about the
result not the process”

DataFrames
● There are three types of logical plans:-
● Parsed logical plan :- Checks column names,
table names, etc.
● Analyzed logical plan :- Analysis of the query.
● Optimized logical plan :- Perform
optimizations.
● Physical plan(Actual RDD)

Problems with DataFrames
● Till now, everything looks cool, But??
● We have lost type safety
●

Solution ?
● We can convert dataframes back into rdds, but
then we loose upon the optimization.
● We would like to get back our compile time
safety without giving up the optimizations.

Datasets
● Extension to the DataFrame API.
● Conceptually similar to RDDs. (You can
actually operate on objects)
● Interoperate more easily with dataframe api.
● Like Rdd, Dataset has a type.
● They use Encoders for
serialization/deserialization which are quite
fast compared to java/kryo.

Datasets
● Providing type to previous example in dataset

Output Modes
●
The “Output” is defined as what gets written out to the external storage. The output can be
defined in a different mode:
●
Complete Mode - The entire updated Result Table will be written to the external storage.
●
Append Mode - Only the new rows appended in the Result Table since the last trigger will
be written to the external storage. This is applicable only on the queries where existing
rows in the Result Table are not expected to change.
●
Update Mode - Only the rows that were updated in the Result Table since the last trigger
will be written to the external storage. Note that this is different from the Complete Mode in
that this mode only outputs the rows that have changed since the last trigger. If the query
doesn’t contain aggregations, it will be equivalent to Append mode.

Streaming DataFrames and
Datasets
● Streaming DataFrames can be created through the
DataStreamReader interface returned by
SparkSession.readStream().
● For example, defining streaming dataframe from
kafka (data source)
●

Defining Schema
● Structured streaming requires you to specify
the schema.
There are two ways you can specify schema:-
1. You could build your schema manually

Defining Schema
2. Use the business object that describes the
dataset

Operations on streaming
DataFrames/Datasets
● You can apply various kind of operations on
streaming DataFrames/Datasets.
● Some of the basic operations are selection,
projection, aggregation, etc.

References
● https://spark.apache.org/docs/latest/structured-stre
● https://www.youtube.com/watch?v=pZQsDloGB4w&
● https://jaceklaskowski.gitbooks.io/spark-structured-

Similar to Spark Structured Streaming

Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal

Anatomy of spark catalystdatamantra

Introduction to Structured Data Processing with Spark SQLdatamantra

Introduction to spark 2.0datamantra

Introduction to Structured StreamingKnoldus Inc.

Introduction to Spark Datasets - Functional and relational together at lastHolden Karau

Introduction to Structured streamingdatamantra

Productionalizing Spark MLdatamantra

A Step to programming with Apache SparkKnoldus Inc.

Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk

3 query tuning techniques every sql server programmer should knowRodrigo Crespi

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks

Data processing with spark in r & pythonMaloy Manna, PMP®

How Spark Does It Internally?Knoldus Inc.

Apache spark on Hadoop Yarn Resource Managerharidasnss

SparkHeena Madan

Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers

Challenges of Building a First Class SQL-on-Hadoop EngineNicolas Morales

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...Codemotion Tel Aviv

Introduction to Spark 2.0 Dataset APIdatamantra

Similar to Spark Structured Streaming (20)

Spark Concepts - Spark SQL, Graphx, Streaming

Anatomy of spark catalyst

Introduction to Structured Data Processing with Spark SQL

Introduction to spark 2.0

Introduction to Structured Streaming

Introduction to Spark Datasets - Functional and relational together at last

Introduction to Structured streaming

Productionalizing Spark ML

A Step to programming with Apache Spark

Secrets of Spark's success - Deenar Toraskar, Think Reactive

3 query tuning techniques every sql server programmer should know

Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...

Data processing with spark in r & python

How Spark Does It Internally?

Apache spark on Hadoop Yarn Resource Manager

Spark

Putting the Spark into Functional Fashion Tech Analystics

Challenges of Building a First Class SQL-on-Hadoop Engine

S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...

Introduction to Spark 2.0 Dataset API

Recently uploaded

Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH

Active Directory Penetration Testing, cionsystems.com.pdfCionsystems

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI

Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01

Software Quality Assurance Interview QuestionsArshad QA

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS

Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy

Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531

TECUNIQUE: Success Stories: IT Service providermohitmore19

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3

5 Signs You Need a Fashion PLM Software.pdfWave PLM

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.

How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc

Recently uploaded (20)

Der Spagat zwischen BIAS und FAIRNESS (2024)

Active Directory Penetration Testing, cionsystems.com.pdf

Optimizing AI for immediate response in Smart CCTV

SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI

Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide

Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...

Software Quality Assurance Interview Questions

The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications

Hand gesture recognition PROJECT PPT.pptx

TECUNIQUE: Success Stories: IT Service provider

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Exploring iOS App Development: Simplifying the Process

The Ultimate Test Automation Guide_ Best Practices and Tips.pdf

Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...

Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data

5 Signs You Need a Fashion PLM Software.pdf

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...

How To Use Server-Side Rendering with Nuxt.js

Spark Structured Streaming

1. Spark Structured Streaming Ayush Hooda Software Consultant Knoldus Inc.

2. Agenda ● Streaming – What and Why ? ● Rdd vs DataFrames vs Datasets ● Programming Model ● Streaming DataFrames and Datasets ● Defining Schema ● Output Modes ● Basic operations ● Window operations on event time*

3. RDD Some key features of RDD :- ● Resilient ● Type Safe ● Immutable ● Lazy evaluation and many more

4. Problems with RDDs ● They express the ‘how’ of a solution better than the ‘what’. Rdd library is bit opaque. ● They cannot be optimized by Spark. ● It’s too easy to build an inefficient RDD transformation chain.

5. DataFrames ● DataFrame API provides a higher-level abstraction, allowing you to use a query language to manipulate data. ● Avail SQL functionalities. ● Focus on What rather than on How.

6. DataFrames “Let spark figure out how to do it for you. Like, in RDBMS we just fire the sql queries and are not concerned how the query brings out the data, i.e., we just care about the result not the process”

7. DataFrames ● There are three types of logical plans:- ● Parsed logical plan :- Checks column names, table names, etc. ● Analyzed logical plan :- Analysis of the query. ● Optimized logical plan :- Perform optimizations. ● Physical plan(Actual RDD)

9. Problems with DataFrames ● Till now, everything looks cool, But?? ● We have lost type safety ●

10. Solution ? ● We can convert dataframes back into rdds, but then we loose upon the optimization. ● We would like to get back our compile time safety without giving up the optimizations.

11. Datasets ● Extension to the DataFrame API. ● Conceptually similar to RDDs. (You can actually operate on objects) ● Interoperate more easily with dataframe api. ● Like Rdd, Dataset has a type. ● They use Encoders for serialization/deserialization which are quite fast compared to java/kryo.

12. Datasets ● Providing type to previous example in dataset

13. Programming model

14.

15. Output Modes ● The “Output” is defined as what gets written out to the external storage. The output can be defined in a different mode: ● Complete Mode - The entire updated Result Table will be written to the external storage. ● Append Mode - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change. ● Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage. Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.

16. Streaming DataFrames and Datasets ● Streaming DataFrames can be created through the DataStreamReader interface returned by SparkSession.readStream(). ● For example, defining streaming dataframe from kafka (data source) ●

17. Defining Schema ● Structured streaming requires you to specify the schema. There are two ways you can specify schema:- 1. You could build your schema manually

18. Defining Schema 2. Use the business object that describes the dataset

19. Operations on streaming DataFrames/Datasets ● You can apply various kind of operations on streaming DataFrames/Datasets. ● Some of the basic operations are selection, projection, aggregation, etc.

20. Window Operations on Event time

21. References ● https://spark.apache.org/docs/latest/structured-stre ● https://www.youtube.com/watch?v=pZQsDloGB4w& ● https://jaceklaskowski.gitbooks.io/spark-structured-

22. Thank you

Spark Structured Streaming

Recommended

Recommended

More Related Content

Similar to Spark Structured Streaming

Similar to Spark Structured Streaming (20)

More from Knoldus Inc.

More from Knoldus Inc. (20)

Recently uploaded

Recently uploaded (20)

Spark Structured Streaming