Knolx was about to spark structured streaming. The focus was about the difference between three APIs RDD, DataFrame, and Datasets. And some key concepts of structured streaming like schema, output modes, operations like selection, projection, aggregation, windowing, etc
2. Agenda
● Streaming – What and Why ?
● Rdd vs DataFrames vs Datasets
● Programming Model
● Streaming DataFrames and Datasets
● Defining Schema
● Output Modes
● Basic operations
● Window operations on event time*
3. RDD
Some key features of RDD :-
● Resilient
● Type Safe
● Immutable
● Lazy evaluation
and many more
4. Problems with RDDs
● They express the ‘how’ of a solution better than the
‘what’. Rdd library is bit opaque.
● They cannot be optimized by Spark.
● It’s too easy to build an inefficient RDD
transformation chain.
5. DataFrames
● DataFrame API provides a higher-level abstraction,
allowing you to use a query language to manipulate
data.
● Avail SQL functionalities.
● Focus on What rather than on How.
6. DataFrames
“Let spark figure out how to do it for you.
Like, in RDBMS we just fire the sql queries
and are not concerned how the query brings
out the data, i.e., we just care about the
result not the process”
7. DataFrames
● There are three types of logical plans:-
● Parsed logical plan :- Checks column names,
table names, etc.
● Analyzed logical plan :- Analysis of the query.
● Optimized logical plan :- Perform
optimizations.
● Physical plan(Actual RDD)
10. Solution ?
● We can convert dataframes back into rdds, but
then we loose upon the optimization.
● We would like to get back our compile time
safety without giving up the optimizations.
11. Datasets
● Extension to the DataFrame API.
● Conceptually similar to RDDs. (You can
actually operate on objects)
● Interoperate more easily with dataframe api.
● Like Rdd, Dataset has a type.
● They use Encoders for
serialization/deserialization which are quite
fast compared to java/kryo.
15. Output Modes
●
The “Output” is defined as what gets written out to the external storage. The output can be
defined in a different mode:
●
Complete Mode - The entire updated Result Table will be written to the external storage.
●
Append Mode - Only the new rows appended in the Result Table since the last trigger will
be written to the external storage. This is applicable only on the queries where existing
rows in the Result Table are not expected to change.
●
Update Mode - Only the rows that were updated in the Result Table since the last trigger
will be written to the external storage. Note that this is different from the Complete Mode in
that this mode only outputs the rows that have changed since the last trigger. If the query
doesn’t contain aggregations, it will be equivalent to Append mode.
16. Streaming DataFrames and
Datasets
● Streaming DataFrames can be created through the
DataStreamReader interface returned by
SparkSession.readStream().
● For example, defining streaming dataframe from
kafka (data source)
●
17. Defining Schema
● Structured streaming requires you to specify
the schema.
There are two ways you can specify schema:-
1. You could build your schema manually
19. Operations on streaming
DataFrames/Datasets
● You can apply various kind of operations on
streaming DataFrames/Datasets.
● Some of the basic operations are selection,
projection, aggregation, etc.