Data streaming

Alberto Paro
 Master Degree in Computer Science Engineering at Politecnico di Milano
 Big Data Practise Leader at NTTDATA Italia
 Author of 4 books about ElasticSearch from 1 to 7.x + 6 Tech reviews
 Big Data Trainer, Developer and Consulting on Big data Technologies (Akka,
Playframework, Apache Spark, Reactive Programming) e NoSQL (Accumulo,
 Hbase, Cassandra, ElasticSearch, Kafka and MongoDB)
 Evangelist for Scala e Scala.JS Language

SUMMARY
• Why?
• Architectures
• Message Brokers
• Streaming Frameworks
• Streaming Libraries
Data Streaming: Architetture e principali soluzioni - 16 Giugno 2020 (A.Paro)

T H E S T A R T O F T H E
J O U R N E Y
WHY
STREAMING
PROCESSING

NEED FOR STREAMING
• Real-time processing/unbounded data processing is
key winning (i.e. banking, finance, … sports)
• No more related to nightly batch processing.
• Real-time processing reduces time-to-market.
• Fast feedback on customers (i.e. campaign
monitoring)
• Real-time balancing of resources (demand-
response)
• Many of the systems we want to monitor and
understand what happening with the continuous
stream of events like heartbeats, machine metrics,
GPS signals.
• Distribute data processing in time (no more big
batch jobs, if possible)
• Reduce the processing power needed in big data
environments
• Application decoupling: separation of concern on
data (the Kafka way)
• Manage backpressure in application flow.
Business Technical
Data Streaming: Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)

STANDARD STREAMING FLOW
• Source
• Message Broker
• Streaming Engine
• Destination
Data Streaming : Architetture e principali soluzioni - 16 giugno 2020 (A.Paro)

CONFLUENT KAFKA LIKE

T O P T H R E E
MESSAGE
BROKERS

APACHE RABBITMQ
• RabbitMQ is an open-source message-broker software
(sometimes called message-oriented middleware) that
originally implemented the Advanced Message Queuing
Protocol (AMQP) and has since been extended with a plug-in
architecture to support Streaming Text Oriented Messaging
Protocol (STOMP), MQ Telemetry Transport (MQTT), and
other protocols.
• The RabbitMQ server program is written in the Erlang
programming language and is built on the Open Telecom
Platform framework for clustering and failover.
• Client libraries to interface with the broker are available for all
major programming languages.
• Rabbit Technologies Ltd. originally developed RabbitMQ.
Rabbit Technologies started as a joint venture between LShift
and CohesiveFT in 2007, and was acquired in April 2010 by
SpringSource, a division of VMware. The project became part
of Pivotal Software in May 2013.

APACHE KAFKA
• Kafka Streams is a client library of Kafka for real-time stream
processing and analyzing data stored in Kafka brokers.
• The Streams API allows an application to act as a stream
processor, consuming an input stream from one or more
topics and producing an output stream to one or more output
topics, effectively transforming the input streams to output
streams.
• In Kafka a stream processor is anything that takes continual
streams of data from input topics, performs some processing
on this input, and produces continual streams of data to
output topics.
• It is possible to do simple processing directly using the
producer and consumer APIs.
• This allows building applications that do non-trivial processing
that compute aggregations off of streams or join streams
together.

APACHE KAFKA

APACHE PULSAR
• Apache Pulsar is an open-source distributed pub-sub messaging
system originally created at Yahoo.
• Like Kafka, Pulsar uses the concept of topics and subscriptions to
create order from large amounts of streaming data in a scalable
and low-latency manner. In addition to publish and subscribe,
Pulsar can support point-to-point message queuing from a single
API. Like Kafka, the project relies on Zookeeper for storage, and it
also utilizes Apache BookKeeper for ordering guarantees.
• The creators of Pulsar say they developed it to address several
shortcomings of existing open source messaging systems. It has
been running in production at Yahoo since 2014 and was open
sourced in 2016. Pulsar is backed by a commercial open source
outfit called Streamlio.
• Pulsar’s strengths include multi-tenancy, geo-replication, and
strong durability guarantees, high message throughput, as well as a
single API for both queuing and publish-subscribe messaging.
Scaling a Pulsar cluster is as easy as adding additional nodes.

O N L Y T H E M O S T U S E D
STREAMING
FRAMEWORKS

FRAMEWORKS – 10 QUICK LIST
 Apache Spark
 Apache Flink
 Apache Samza
 Apache Storm
 Apache Kafka
 Apache Flume
 Apache Nifi
 Apache Ignite
 Apache Apex
 Apache Beam

SPARK STREAMING – OLD STYLE
• Stream converted in microbatch
• No Watermark
• No time-based event management
• Lose data in several ways
Old School

SPARK STRUCTURED STREAMING
• Real Streaming (windowing, triggers,
watermarks)
• Natively Dataframes (no RDD)
• Process with Vent Time, handling late
data
• End-to-end guarantee
Introduced in spark 2.(4)x

APACHE FLINK
• Flink relies on a streaming execution model, which is an
intuitive fit for processing unbounded datasets.
• Streaming execution is continuous processing on data that is
continuously produced and alignment between the type of
dataset and the type of execution model offers many
advantages with regard to accuracy and performance.
• It provides results that are accurate, even in the case of out-
of-order or late-arriving data
• It is stateful and fault-tolerant and can seamlessly recover
from failures while maintaining exactly-once application state
• It performs at large scale, running on thousands of nodes with
very good throughput and latency characteristics
• Flink guarantees exactly-once semantics for stateful
computations.

APACHE NIFI
• Apache NiFi supports powerful and scalable directed graphs
of data routing, transformation, and system mediation logic.
• Web-based user interface:
▪ Seamless experience between design, control, feedback,
and monitoring.
• Highly configurable:
▪ Loss tolerant vs guaranteed delivery, Low latency vs high
throughput, Dynamic prioritization, Flow can be modified
at runtime, Back pressure.
• Designed for extension:
▪ Build your own processors and more, enables rapid
development and effective testing.
• Security:
• SSL, SSH, HTTPS, encrypted content, etc...
• Multi-tenant authorization and internal
authorization/policy management

APACHE IGNITE
• Apache Ignite In-Memory Data Fabric is a high-performance,
integrated and distributed in-memory platform for computing
and transacting on large-scale data sets in real-time, orders of
magnitude faster than possible with traditional disk-based or
flash-based technologies.
• You can view Ignite as a collection of independent, well-
integrated, in-memory components geared to improve
performance and scalability of your application. Some of these
components include:
• Advanced Clustering, Data Grid
• SQL Grid, Streaming & CEP
• Compute Grid, Service Grid
• Ignite File System, Distributed Data Structures
• Distributed Messaging, Distributed Events
• Hadoop Accelerator, Spark Shared RDDs

APACHE BEAM
• Apache Beam is an open source, unified programming model that
you can use to create a data processing pipeline.
• You start by building a program that defines the pipeline using one
of the open source Beam SDKs.
• The pipeline is then executed by one of Beam’s supported
distributed processing back-ends, which include Apache Apex,
Apache Flink, Apache Spark, and Google Cloud Dataflow.
• Apache Beam provides an advanced unified programming model,
allowing you to implement batch and streaming data processing
jobs that can run on any execution engine.
• Apache Beam is:
▪ UNIFIED - Use a single programming model for both batch
and streaming use cases.
▪ PORTABLE - Execute pipelines on multiple execution
environments, including Apache Apex, Apache Flink, Apache
Spark, and Google Cloud Dataflow.
▪ EXTENSIBLE - Write and share new SDKs, IO connectors, and
transformation libraries.

O N L Y T H E M O S T U S E D
STREAMING
LIBRARIES

REACTIVE STREAMS
”””Reactive Streams is an initiative to provide a standard for asynchronous stream processing with non-blocking back
pressure.”””
In future all the data processing will be managed by streams.
Adoptions:
 Akka Streams
 MongoDB
 Ratpack
 Reactive Rabbit – driver for RabbitMQ/AMQP
 Spring and Pivotal Project Reactor
 Netflix RxJava
 Slick 3.0
 Vert.x 3.0

BACK PRESSURE CONCEPTS
 The main players in managing flow are Publishers and
Subscribers (Consumers)

 Dropping
 Buffer overflow
BACK PRESSURE CONCEPTS

BACK-PRESSURE CONCEPTS
https://doc.akka.io/docs/alpakka/current/

ZIO STREAM

LIBRARY STREAM COMPARISON

Data streaming

More Related Content

What's hot

Similar to Data streaming

More from Alberto Paro

Recently uploaded

Data streaming

Editor's Notes