Alberto Paro
 Master Degree in Computer Science Engineering at Politecnico di Milano
 Big Data Practise Leader at NTTDATA Italia
 Author of 4 books about ElasticSearch from 1 to 7.x + 6 Tech reviews
 Big Data Trainer, Developer and Consulting on Big data Technologies (Akka,
Playframework, Apache Spark, Reactive Programming) e NoSQL (Accumulo,
 Hbase, Cassandra, ElasticSearch, Kafka and MongoDB)
 Evangelist for Scala e Scala.JS Language
SUMMARY
• Why?
• Architectures
• Message Brokers
• Streaming Frameworks
• Streaming Libraries
Data Streaming: Architetture e principali soluzioni - 16 Giugno 2020 (A.Paro)
T H E S T A R T O F T H E
J O U R N E Y
WHY
STREAMING
PROCESSING
NEED FOR STREAMING
• Real-time processing/unbounded data processing is
key winning (i.e. banking, finance, … sports)
• No more related to nightly batch processing.
• Real-time processing reduces time-to-market.
• Fast feedback on customers (i.e. campaign
monitoring)
• Real-time balancing of resources (demand-
response)
• Many of the systems we want to monitor and
understand what happening with the continuous
stream of events like heartbeats, machine metrics,
GPS signals.
• Distribute data processing in time (no more big
batch jobs, if possible)
• Reduce the processing power needed in big data
environments
• Application decoupling: separation of concern on
data (the Kafka way)
• Manage backpressure in application flow.
Business Technical
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
STANDARD STREAMING FLOW
• Source
• Message Broker
• Streaming Engine
• Destination
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
CONFLUENT KAFKA LIKE
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
T O P T H R E E
MESSAGE
BROKERS
APACHE RABBITMQ
• RabbitMQ is an open-source message-broker software
(sometimes called message-oriented middleware) that
originally implemented the Advanced Message Queuing
Protocol (AMQP) and has since been extended with a plug-in
architecture to support Streaming Text Oriented Messaging
Protocol (STOMP), MQ Telemetry Transport (MQTT), and
other protocols.
• The RabbitMQ server program is written in the Erlang
programming language and is built on the Open Telecom
Platform framework for clustering and failover.
• Client libraries to interface with the broker are available for all
major programming languages.
• Rabbit Technologies Ltd. originally developed RabbitMQ.
Rabbit Technologies started as a joint venture between LShift
and CohesiveFT in 2007, and was acquired in April 2010 by
SpringSource, a division of VMware. The project became part
of Pivotal Software in May 2013.
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE KAFKA
• Kafka Streams is a client library of Kafka for real-time stream
processing and analyzing data stored in Kafka brokers.
• The Streams API allows an application to act as a stream
processor, consuming an input stream from one or more
topics and producing an output stream to one or more output
topics, effectively transforming the input streams to output
streams.
• In Kafka a stream processor is anything that takes continual
streams of data from input topics, performs some processing
on this input, and produces continual streams of data to
output topics.
• It is possible to do simple processing directly using the
producer and consumer APIs.
• This allows building applications that do non-trivial processing
that compute aggregations off of streams or join streams
together.
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE KAFKA
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE PULSAR
• Apache Pulsar is an open-source distributed pub-sub messaging
system originally created at Yahoo.
• Like Kafka, Pulsar uses the concept of topics and subscriptions to
create order from large amounts of streaming data in a scalable
and low-latency manner. In addition to publish and subscribe,
Pulsar can support point-to-point message queuing from a single
API. Like Kafka, the project relies on Zookeeper for storage, and it
also utilizes Apache BookKeeper for ordering guarantees.
• The creators of Pulsar say they developed it to address several
shortcomings of existing open source messaging systems. It has
been running in production at Yahoo since 2014 and was open
sourced in 2016. Pulsar is backed by a commercial open source
outfit called Streamlio.
• Pulsar’s strengths include multi-tenancy, geo-replication, and
strong durability guarantees, high message throughput, as well as a
single API for both queuing and publish-subscribe messaging.
Scaling a Pulsar cluster is as easy as adding additional nodes.
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
O N L Y T H E M O S T U S E D
STREAMING
FRAMEWORKS
FRAMEWORKS – 10 QUICK LIST
 Apache Spark
 Apache Flink
 Apache Samza
 Apache Storm
 Apache Kafka
 Apache Flume
 Apache Nifi
 Apache Ignite
 Apache Apex
 Apache Beam
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
SPARK STREAMING – OLD STYLE
• Stream converted in microbatch
• No Watermark
• No time-based event management
• Lose data in several ways
Old School
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
SPARK STRUCTURED STREAMING
• Real Streaming (windowing, triggers,
watermarks)
• Natively Dataframes (no RDD)
• Process with Vent Time, handling late
data
• End-to-end guarantee
Introduced in spark 2.(4)x
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE FLINK
• Flink relies on a streaming execution model, which is an
intuitive fit for processing unbounded datasets.
• Streaming execution is continuous processing on data that is
continuously produced and alignment between the type of
dataset and the type of execution model offers many
advantages with regard to accuracy and performance.
• It provides results that are accurate, even in the case of out-
of-order or late-arriving data
• It is stateful and fault-tolerant and can seamlessly recover
from failures while maintaining exactly-once application state
• It performs at large scale, running on thousands of nodes with
very good throughput and latency characteristics
• Flink guarantees exactly-once semantics for stateful
computations.
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE NIFI
• Apache NiFi supports powerful and scalable directed graphs
of data routing, transformation, and system mediation logic.
• Web-based user interface:
▪ Seamless experience between design, control, feedback,
and monitoring.
• Highly configurable:
▪ Loss tolerant vs guaranteed delivery, Low latency vs high
throughput, Dynamic prioritization, Flow can be modified
at runtime, Back pressure.
• Designed for extension:
▪ Build your own processors and more, enables rapid
development and effective testing.
• Security:
• SSL, SSH, HTTPS, encrypted content, etc...
• Multi-tenant authorization and internal
authorization/policy management
APACHE IGNITE
• Apache Ignite In-Memory Data Fabric is a high-performance,
integrated and distributed in-memory platform for computing
and transacting on large-scale data sets in real-time, orders of
magnitude faster than possible with traditional disk-based or
flash-based technologies.
• You can view Ignite as a collection of independent, well-
integrated, in-memory components geared to improve
performance and scalability of your application. Some of these
components include:
• Advanced Clustering, Data Grid
• SQL Grid, Streaming & CEP
• Compute Grid, Service Grid
• Ignite File System, Distributed Data Structures
• Distributed Messaging, Distributed Events
• Hadoop Accelerator, Spark Shared RDDs
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE BEAM
• Apache Beam is an open source, unified programming model that
you can use to create a data processing pipeline.
• You start by building a program that defines the pipeline using one
of the open source Beam SDKs.
• The pipeline is then executed by one of Beam’s supported
distributed processing back-ends, which include Apache Apex,
Apache Flink, Apache Spark, and Google Cloud Dataflow.
• Apache Beam provides an advanced unified programming model,
allowing you to implement batch and streaming data processing
jobs that can run on any execution engine.
• Apache Beam is:
▪ UNIFIED - Use a single programming model for both batch
and streaming use cases.
▪ PORTABLE - Execute pipelines on multiple execution
environments, including Apache Apex, Apache Flink, Apache
Spark, and Google Cloud Dataflow.
▪ EXTENSIBLE - Write and share new SDKs, IO connectors, and
transformation libraries.
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
APACHE BEAM
O N L Y T H E M O S T U S E D
STREAMING
LIBRARIES
REACTIVE STREAMS
”””Reactive Streams is an initiative to provide a standard for asynchronous stream processing with non-blocking back
pressure.”””
In future all the data processing will be managed by streams.
Adoptions:
 Akka Streams
 MongoDB
 Ratpack
 Reactive Rabbit – driver for RabbitMQ/AMQP
 Spring and Pivotal Project Reactor
 Netflix RxJava
 Slick 3.0
 Vert.x 3.0
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
BACK PRESSURE CONCEPTS
 The main players in managing flow are Publishers and
Subscribers (Consumers)
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
 Dropping
 Buffer overflow
BACK PRESSURE CONCEPTS
BACK-PRESSURE CONCEPTS
https://doc.akka.io/docs/alpakka/current/
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
ZIO STREAM
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
LIBRARY STREAM COMPARISON
LIBRARY STREAM COMPARISON
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
ZIO STREAM
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
Data streaming

Data streaming

  • 2.
    Alberto Paro  MasterDegree in Computer Science Engineering at Politecnico di Milano  Big Data Practise Leader at NTTDATA Italia  Author of 4 books about ElasticSearch from 1 to 7.x + 6 Tech reviews  Big Data Trainer, Developer and Consulting on Big data Technologies (Akka, Playframework, Apache Spark, Reactive Programming) e NoSQL (Accumulo,  Hbase, Cassandra, ElasticSearch, Kafka and MongoDB)  Evangelist for Scala e Scala.JS Language
  • 3.
    SUMMARY • Why? • Architectures •Message Brokers • Streaming Frameworks • Streaming Libraries Data Streaming: Architetture e principali soluzioni - 16 Giugno 2020 (A.Paro)
  • 4.
    T H ES T A R T O F T H E J O U R N E Y WHY STREAMING PROCESSING
  • 5.
    NEED FOR STREAMING •Real-time processing/unbounded data processing is key winning (i.e. banking, finance, … sports) • No more related to nightly batch processing. • Real-time processing reduces time-to-market. • Fast feedback on customers (i.e. campaign monitoring) • Real-time balancing of resources (demand- response) • Many of the systems we want to monitor and understand what happening with the continuous stream of events like heartbeats, machine metrics, GPS signals. • Distribute data processing in time (no more big batch jobs, if possible) • Reduce the processing power needed in big data environments • Application decoupling: separation of concern on data (the Kafka way) • Manage backpressure in application flow. Business Technical Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 6.
    STANDARD STREAMING FLOW •Source • Message Broker • Streaming Engine • Destination Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 7.
    CONFLUENT KAFKA LIKE DataStreaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 8.
    T O PT H R E E MESSAGE BROKERS
  • 9.
    APACHE RABBITMQ • RabbitMQis an open-source message-broker software (sometimes called message-oriented middleware) that originally implemented the Advanced Message Queuing Protocol (AMQP) and has since been extended with a plug-in architecture to support Streaming Text Oriented Messaging Protocol (STOMP), MQ Telemetry Transport (MQTT), and other protocols. • The RabbitMQ server program is written in the Erlang programming language and is built on the Open Telecom Platform framework for clustering and failover. • Client libraries to interface with the broker are available for all major programming languages. • Rabbit Technologies Ltd. originally developed RabbitMQ. Rabbit Technologies started as a joint venture between LShift and CohesiveFT in 2007, and was acquired in April 2010 by SpringSource, a division of VMware. The project became part of Pivotal Software in May 2013. Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 10.
    APACHE KAFKA • KafkaStreams is a client library of Kafka for real-time stream processing and analyzing data stored in Kafka brokers. • The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams. • In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics. • It is possible to do simple processing directly using the producer and consumer APIs. • This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together. Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 11.
    APACHE KAFKA Data Streaming:Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 12.
    APACHE PULSAR • ApachePulsar is an open-source distributed pub-sub messaging system originally created at Yahoo. • Like Kafka, Pulsar uses the concept of topics and subscriptions to create order from large amounts of streaming data in a scalable and low-latency manner. In addition to publish and subscribe, Pulsar can support point-to-point message queuing from a single API. Like Kafka, the project relies on Zookeeper for storage, and it also utilizes Apache BookKeeper for ordering guarantees. • The creators of Pulsar say they developed it to address several shortcomings of existing open source messaging systems. It has been running in production at Yahoo since 2014 and was open sourced in 2016. Pulsar is backed by a commercial open source outfit called Streamlio. • Pulsar’s strengths include multi-tenancy, geo-replication, and strong durability guarantees, high message throughput, as well as a single API for both queuing and publish-subscribe messaging. Scaling a Pulsar cluster is as easy as adding additional nodes. Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 13.
    O N LY T H E M O S T U S E D STREAMING FRAMEWORKS
  • 14.
    FRAMEWORKS – 10QUICK LIST  Apache Spark  Apache Flink  Apache Samza  Apache Storm  Apache Kafka  Apache Flume  Apache Nifi  Apache Ignite  Apache Apex  Apache Beam Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 15.
    SPARK STREAMING –OLD STYLE • Stream converted in microbatch • No Watermark • No time-based event management • Lose data in several ways Old School Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 16.
    SPARK STRUCTURED STREAMING •Real Streaming (windowing, triggers, watermarks) • Natively Dataframes (no RDD) • Process with Vent Time, handling late data • End-to-end guarantee Introduced in spark 2.(4)x Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 17.
    APACHE FLINK • Flinkrelies on a streaming execution model, which is an intuitive fit for processing unbounded datasets. • Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance. • It provides results that are accurate, even in the case of out- of-order or late-arriving data • It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state • It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics • Flink guarantees exactly-once semantics for stateful computations. Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 18.
    APACHE NIFI • ApacheNiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. • Web-based user interface: ▪ Seamless experience between design, control, feedback, and monitoring. • Highly configurable: ▪ Loss tolerant vs guaranteed delivery, Low latency vs high throughput, Dynamic prioritization, Flow can be modified at runtime, Back pressure. • Designed for extension: ▪ Build your own processors and more, enables rapid development and effective testing. • Security: • SSL, SSH, HTTPS, encrypted content, etc... • Multi-tenant authorization and internal authorization/policy management
  • 19.
    APACHE IGNITE • ApacheIgnite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash-based technologies. • You can view Ignite as a collection of independent, well- integrated, in-memory components geared to improve performance and scalability of your application. Some of these components include: • Advanced Clustering, Data Grid • SQL Grid, Streaming & CEP • Compute Grid, Service Grid • Ignite File System, Distributed Data Structures • Distributed Messaging, Distributed Events • Hadoop Accelerator, Spark Shared RDDs Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 20.
    APACHE BEAM • ApacheBeam is an open source, unified programming model that you can use to create a data processing pipeline. • You start by building a program that defines the pipeline using one of the open source Beam SDKs. • The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. • Apache Beam provides an advanced unified programming model, allowing you to implement batch and streaming data processing jobs that can run on any execution engine. • Apache Beam is: ▪ UNIFIED - Use a single programming model for both batch and streaming use cases. ▪ PORTABLE - Execute pipelines on multiple execution environments, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. ▪ EXTENSIBLE - Write and share new SDKs, IO connectors, and transformation libraries. Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 21.
  • 22.
    O N LY T H E M O S T U S E D STREAMING LIBRARIES
  • 23.
    REACTIVE STREAMS ”””Reactive Streamsis an initiative to provide a standard for asynchronous stream processing with non-blocking back pressure.””” In future all the data processing will be managed by streams. Adoptions:  Akka Streams  MongoDB  Ratpack  Reactive Rabbit – driver for RabbitMQ/AMQP  Spring and Pivotal Project Reactor  Netflix RxJava  Slick 3.0  Vert.x 3.0 Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 25.
    BACK PRESSURE CONCEPTS The main players in managing flow are Publishers and Subscribers (Consumers) Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 26.
     Dropping  Bufferoverflow BACK PRESSURE CONCEPTS
  • 27.
    BACK-PRESSURE CONCEPTS https://doc.akka.io/docs/alpakka/current/ Data Streaming:Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 28.
    ZIO STREAM Data Streaming:Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 29.
  • 30.
    LIBRARY STREAM COMPARISON DataStreaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)
  • 31.
    ZIO STREAM Data Streaming:Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)

Editor's Notes

  • #7 Data Streaming is real-time / unbounded data processing. ▪  Real-time processing and analytics bears the promise of making organizations more ▪  Many of the systems we want to monitor and understand what happening with the continuous stream of events like heartbeats, ocean currents, machine metrics, GPS signals. ▪  Even analysis of sporadic events such as website traffic can benefit from a streaming data approach. ▪  There are many potential advantages of handling data as streams, but until recently this method was somewhat difficult to do well. ▪  Streaming data and real-time analytics formed a fairly specialized undertaking rather than a widespread approach.
  • #17 ▪ ApacheSpark: ▪  Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. ▪  It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing. ▪  The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application. ▪  Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming. ▪  Apart from supporting all these workload in a respective system, it reduces the management burden of maintaining separate tools.
  • #18 Spark Streaming Spark Streaming is a separate library in Spark to process continuously flowing streaming data. It provides us with the DStream API, which is powered by Spark RDDs. DStreams provide us data divided into chunks as RDDs received from the source of streaming to be processed and, after processing, sends it to the destination. Cool, right?! Structured Streaming From the Spark 2.x release onwards, Structured Streaming came into the picture. Built on the Spark SQL library, Structured Streaming is another way to handle streaming with Spark. This model of streaming is based on Dataframe and Dataset APIs. Hence, with this library, we can easily apply any SQL query (using the DataFrame API) or Scala operations (using DataSet API) on streaming data. Okay, so that was the summarized theory for both ways of streaming in Spark. Now we need to compare the two. Distinctions 1. Real Streaming What does real streaming imply? Data which is unbounded and is being processed upon being received from the source. This definition is satisfiable (more or less). If we talk about Spark Streaming, this is not the case. Spark Streaming works on something we call a micro batch. The stream pipeline is registered with some operations and Spark polls the source after every batch duration (defined in the application) and then a batch is created of the received data, i.e. each incoming record belongs to a batch of DStream. Each batch represents an RDD. Structured Streaming works on the same architecture of polling the data after some duration, based on your trigger interval, but it has some distinction from the Spark Streaming which makes it more inclined towards real streaming. In Structured Streaming, there is no batch concept. The received data in a trigger is appended to the continuously flowing data stream. Each row of the data stream is processed and the result is updated into the unbounded result table. How you want your result (updated, new result only, or all the results) depends on the mode of your operations (Complete, Update, Append). Winner of this round: Structured Streaming. 2. RDD vs. DataFrames/DataSet Another distinction can be the use case of different APIs in both streaming models. In summary, we read that Spark Streaming works on the DStream API, which is internally using RDDs and Structured Streaming uses DataFrame and Dataset APIs to perform streaming operations. So, it is a straight comparison between using RDDs or DataFrames. There are several blogs available which compare DataFrames and RDDs in terms of `performance` and `ease of use.` This is a good read for RDD v/s Dataframes. All those comparisons lead to one result: that DataFrames are more optimized in terms of processing and provide more options for aggregations and other operations with a variety of functions available (many more functions are now supported natively in Spark 2.4). So Structured Streaming wins here with flying colors. 3. Processing With the Vent Time, Handling Late Data One big issue in the streaming world is how to process data according to the event-time. Event-time is the time when the event actually happened. It is not necessary for the source of the streaming engine to prove data in real-time. There may be latencies in data generation and handing over the data to the processing engine. There is no such option in Spark Streaming to work on the data using the event-time. It only works with the timestamp when the data is received by the Spark. Based on the ingestion timestamp, Spark Streaming puts the data in a batch even if the event is generated early and belonged to the earlier batch, which may result in less accurate information as it is equal to the data loss. On the other hand, Structured Streaming provides the functionality to process data on the basis of event-time when the timestamp of the event is included in the data received. This is a major feature introduced in Structured Streaming which provides a different way of processing the data according to the time of data generation in the real world. With this, we can handle data coming in late and get more accurate results. With event-time handling of late data, Structured Streaming outweighs Spark Streaming. 4. End-to-End Guarantees Every application requires fault tolerance and end-to-end guarantees of data delivery. Whenever the application fails, it must be able to restart from the same point where it failed in order to avoid data loss and duplication. To provide fault tolerance, Spark Streaming and Structured Streaming both use the checkpointing to save the progress of a job. But this approach still has many holes which may cause data loss. Other than checkpointing, Structured Streaming has applied two conditions to recover from any error: The source must be replayable. The sinks must support idempotent operations to support reprocessing in case of failures. Here's a link to the docs to learn more. With restricted sinks, Spark Structured Streaming always provides end-to-end, exactly once semantics. Way to go Structured Streaming! 5. Restricted or Flexible Sink: The destination of a streaming operation. It can be external storage, a simple output to console, or any action With Spark Streaming, there is no restriction to use any type of sink. Here we have the method foreachRDD to perform some action on the stream. This method returns us the RDDs created by each batch one-by-one and we can perform any actions over them, like saving to storage or performing some computations. We can cache an RDD and perform multiple actions on it as well (even sending the data to multiple databases). But in Structures Streaming, until v2.3, we had a limited number of output sinks and, with one sink, only one operation could be performed and we could not save the output to multiple external storages. To use a custom sink, the user needed to implement ForeachWriter. But here comes Spark 2.4, and with it we get a new sink called foreachBatch. This sink gives us the resultant output table as a DataFrame and hence we can use this DataFrame to perform our custom operations. With this new sink, the `restricted` Structured Streaming is now more `flexible` and gives it an edge over the Spark Streaming and other over flexible sinks.
  • #19 Flink relies on a streaming execution model, which is an intuitive fit for processing unbounded datasets. ▪  Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance. ▪  It provides results that are accurate, even in the case of out-of-order or late-arriving data ▪  It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state ▪  It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics ▪  Flink guarantees exactly-once semantics for stateful computations. 1_M_cOKU47TS17KfBimg0aRw.png
  • #20 Flink relies on a streaming execution model, which is an intuitive fit for processing unbounded datasets. ▪  Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance. ▪  It provides results that are accurate, even in the case of out-of-order or late-arriving data ▪  It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state ▪  It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics ▪  Flink guarantees exactly-once semantics for stateful computations. 1_M_cOKU47TS17KfBimg0aRw.png
  • #21 Flink relies on a streaming execution model, which is an intuitive fit for processing unbounded datasets. ▪  Streaming execution is continuous processing on data that is continuously produced and alignment between the type of dataset and the type of execution model offers many advantages with regard to accuracy and performance. ▪  It provides results that are accurate, even in the case of out-of-order or late-arriving data ▪  It is stateful and fault-tolerant and can seamlessly recover from failures while maintaining exactly-once application state ▪  It performs at large scale, running on thousands of nodes with very good throughput and latency characteristics ▪  Flink guarantees exactly-once semantics for stateful computations. 1_M_cOKU47TS17KfBimg0aRw.png
  • #25 Spark SQL (DB, Json, case class)
  • #26 Spark SQL (DB, Json, case class)
  • #27 Spark SQL (DB, Json, case class)
  • #28 Spark SQL (DB, Json, case class)
  • #29 Spark SQL (DB, Json, case class)
  • #30 Spark SQL (DB, Json, case class)
  • #31 Spark SQL (DB, Json, case class)
  • #32 Spark SQL (DB, Json, case class)
  • #33 Spark SQL (DB, Json, case class)