Streaming data involves the continuous analysis of data as it is generated in real-time. It allows for data to be processed and transformed in memory before being stored. Popular streaming technologies include Apache Storm, Apache Flink, and Apache Spark Streaming, which allow for processing streams of data across clusters. Each technology has its own approach such as micro-batching but all aim to enable real-time analysis of high-velocity data streams.
Data Streaming ForBig Data
CMP652 Next Generation Database
Systems
Seval Çapraz
2.
Content
•
1. What, Why,How of Streaming Big Data
•
2. Overview of Data Management Systems
– Vendors, Architectures, Ecosystem
•
3. The Most Popular Streaming Technologies
– Apache Storm, Apache Flink, Spark Streaming
•
Summary
•
Questions and Answers
•
References
What is streamingdata?
•
Streaming data is an analytic computing platform that is focused on
speed.
•
By streaming, data can be continuously analyzed and transformed in
memory before it is stored on a disk.
•
It is a real time processing technique.
● All definitions are taken from reference [1]
5.
Why Streaming Data?
•
Businessesare dealing with a lot of data that needs to be
processed and analyzed in real time.
•
Therefore, the physical environment that supports this level of
responsiveness is critical.
•
Streaming data environments typically require a clustered
hardware solution, and sometimes a massively parallel processing
approach will be required to handle the analysis.
•
Defining properties or dimensions of big data are volume, variety,
and velocity. Streaming technology can cover these three.
● All definitions are taken from reference [1]
BIG DATA
6.
How Stream Processing?
•
Streamprocessing is a computer programming paradigm, equivalent to
dataflow programming, event stream processing, and reactive
programming.
•
It is the real-time processing of data continuously, concurrently, and in a
record-by-record fashion.
•
Processing streams of data works by processing “time windows” of data in
memory across a cluster of servers.
When to usestreaming?
•
Some key principles define when using streams is most appropriate:
When it is necessary to determine a retail buying opportunity
at the point of engagement, either via social media or via
permission-based messaging
Collecting information about the movement around a secure
site
To be able to react to an event that needs an immediate
response, such as a service outage or a change in a patient’s
medical condition
Real-time calculation of costs that are dependent on variables
such as usage and available resources
● All definitions are taken from reference [1]
10.
Single-pass Analysis
•
One importantfactor about streaming data analysis is the fact
that it is a single-pass analysis.
•
In other words, the analyst cannot reanalyze the data after it is
streamed.
•
This is common in applications where you are looking for the
absence of data.
•
If several passes are required, the data will have to be put into
some sort of warehouse where additional analysis can be
performed.
● All definitions are taken from reference [1]
11.
Streaming data vs.Hadoop
•
Streaming data is similar to the approach
when managing data at rest leveraging
Hadoop.
•
The primary difference is the issue of velocity.
•
In the Hadoop cluster, data is collected in
batch mode and then processed.
● All definitions are taken from reference [1]
Speed matters less
in Hadoop
than it does in
data streaming.
An architecture ofbig data processing service
● All images are taken from reference [3]
16.
Big Data AnalyticsEcosystem
•
Recently, each architectural layer changed dramatically in terms of
the software stack
•
when services such as Yahoo!, Twitter, and LinkedIn released open
source solutions for dealing with big data.
•
The new architecture:
– Apache Kafka serves as a high-throughput distributed in-
memory messaging system in data ingestion layer,
– Apache Storm as a distributed and fault-tolerant real-time
computation in data analytic layer,
– Apache Cassandra as a NoSQL database in data storage layer.
● All definitions are taken from reference [3]
17.
A simple instanceof large-scale datastream-
processing service
● All images are taken from reference [3]
Capability Analysis ofRecent Open Source Stream-Processing Systems
[13] L. Neumeyer et al., “S4: Distributed Stream Computing Platform,” Proc. IEEE Int’l Conf. on Data Mining Workshops, 2010,
pp. 170–177.
● Table is taken from reference [3]
21.
[12] M. Zahariaet al., “Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters,”
Proc. 4th Usenix Conf. Hot Topics in Cloud Computing, 2012.
22.
Some of StreamingComputation Engines
•
Three open-source streaming engines:
– Apache Storm
– Apache Flink
– Apache Spark Streaming
● All definitions and images are taken from reference [4]
23.
Apache Storm
● Alldefinitions and images are taken from reference [4]
•
Apache Storm is a free and open source distributed realtime
computation system.
•
Apache Storm has the TopologyBuilder API to create a directed graph
(topology) through which streams of data flow.
•
“Spouts” are the entry point to the graph, and “bolts” perform the
processing.
•
Data flows through the system as individual tuples.
•
Graphs are not necessarily acyclic (although that is often the case)
24.
● All definitionsare taken from reference [6]
● All images are taken from reference [4]
•
Storm is fast: a benchmark clocked it at over a million tuples
processed per second per node.
•
A Storm topology consumes streams of data and processes those
streams in arbitrarily complex ways, repartitioning the streams
between each stage of the computation however needed.
25.
Apache Flink
•
Apache Flinkis an open-source stream processing framework for
distributed, high-performing, always-available, and accurate data
streaming applications.[7]
•
Apache Flink has the DataStream API to perform operations on
streams of data. (map, filter, reduce, join, etc.)
•
These operations are turned into a graph at job submission time by
Flink.
•
It works similarly to Storm’s model.
•
Also supports a Storm-compatible API.
● All definitions and images are taken from reference [4]
26.
● All definitionsand images are taken from reference [4]
•
Flink is designed to run on large-scale clusters with many thousands
of nodes, and in addition to a standalone cluster mode.
•
Flink’s core is a distributed streaming dataflow engine, meaning that
data is processed an event-at-a-time rather than as a series of
batches.
27.
Apache Spark Streaming
•
ApacheSpark is a fast and general engine for large-scale data
processing.
•
Apache Spark has the DStream API to perform operations on streams
of data. (map, filter, reduce, join, etc.) Based on Spark’s RDD
(Resilient Distributed Dataset) abstraction.
•
Similar to Flink’s API. However streaming accomplished through
micro-batches.
•
Spark streaming job consists of one small batch after another.
● All definitions and images are taken from reference [4]
28.
● All definitionsand images are taken from reference [4]
•
A Resilient Distributed Dataset (RDD), the basic abstraction in
Spark.
•
Using RDD(Resilient Distributed Dataset), Spark hides data
partitioning and can have parallel computational framework with
an API for four mainstream programming languages.
29.
Storm 0.10
Storm 0.11
Storm0.11
NO ACK
Flink
Spark
•
Benchmark is taken from reference [4].
99th
PercentileLatency
Throughput rate (events/sec)
Comparison of Streaming Technologies
30.
Summary
•
Streaming data processingis beneficial in most scenarios where new,
dynamic data is generated on a continual basis. It applies to most of
the industry segments and big data use cases.[5]
•
Stream processing requires ingesting a sequence of data, and
incrementally updating metrics, reports, and summary statistics in
response to each arriving data record. It is better suited for real-time
monitoring and response functions.[5]
•
There are a few popular streaming data platforms such as –Apache
Storm, Apache Flink, Apache Spark Streaming.
•
Each of the streaming platforms have their advantages and
disadvantages. Active communities for big data processing projects
continue to innovate and benefit from each other’s advancements.
References
•
[1] Judith Hurwitz,Alan Nugent, Fern Halper, Marcia Kaufman, "How to Use Data Streaming
For Big Data", Dummies.com, 2017.
•
[2] Sanjai Marimadaiah (CA Technologies), “Big Data, Big Opportunity: A Primer for
Understanding The Big Data Frontier”, CA World 2015.
•
[3] Rajiv Ranjan, “Streaming Big Data Processing in Datacenter Clouds”, IEEE Cloud
Computing, vol. 1, no. 1, pp. 73-83, 2014.
•
[4] Reza Farivar, Kyle Knusbaum, “Performance Comparison of Streaming Big Data
Platforms”, DataWorks Summit/Hadoop Summit, 2016.
•
[5] “What is Streaming Data?”, https://aws.amazon.com/streaming-data/
•
[6] “Why use Storm?”, http://storm.apache.org/
•
[7] “Introduction to Flink”, https://flink.apache.org/