CS8091_BDA_Unit_IV_Stream_Computing

CS8091 / Big Data Analytics
III Year / VI Semester

UNIT IV - STREAM MEMORY
Introduction to Streams Concepts – Stream Data Model and
Architecture - Stream Computing, Sampling Data in a Stream –
Filtering Streams – Counting Distinct Elements in a Stream –
Estimating moments – Counting oneness in a Window – Decaying
Window – Real time Analytics Platform(RTAP) applications - Case
Studies - Real Time Sentiment Analysis, Stock Market Predictions.
Using Graph Analytics for Big Data: Graph Analytics.

Stream Computing
 A high performance computer system that analyzes
multiple data streams from many sources.
 Stream computing is used to mean pulling in streams of
data, processing the data and streaming it back out as a
single flow.
 It uses software algorithms that analyzes the data in real
time as it streams in to increase and accuracy when dealing
with data handling and analysis.

Stream Computing
 Stream computing delivers real-time analytic
processing on constantly changing data in
motion.
 It allows to capture and analyze all data in all
the time, just in time.

Stream Computing
Stream analyzes data before you store it.
 Analyze data that is in motion (Velocity)
 Process any type of data (Variety)
 Streams is designed to scale to process any size of
data from Tera bytes to Zeta bytes per day.

Stream Computing
 Store less
 Analyze more
 Make better decisions, faster

Stream Computing
 Data Stream processing platforms:
 Many of these are open source solutions.
These platforms facilitate the construction of real-time
applications, in particular message-oriented or event-
driven applications which support ingress of messages
or events at a very high rate, transfer to subsequent
processing, and generation of alerts.

Stream Computing
 These platforms are mostly focused on supporting
event-driven data flow through nodes in a distributed
system or within a cloud infrastructure platform.
 The Hadoop ecosystem covers a family of projects
that fall under the umbrella of infrastructure for
distributed computing and large data processing.

Stream Computing
 Hadoop includes a number of components, and below
is the list of components:
 MapReduce, a distributed data processing model and
execution environment that runs on large clusters of
commodity machines.
 Hadoop Distributed File System (HDFS), a distributed file
system that runs on large clusters of commodity machines

Stream Computing
 Hadoop includes a number of components, and below is
the list of components:
 ZooKeeper, a distributed, highly available coordination service,
providing primitives such as distributed locks that can be used for
building distributed applications.
Pig, a dataflow language and execution environment for exploring
very large datasets. Pigs runs on HDFS and MapReduce clusters.
Hive, a distributed data warehouse.

Stream Computing
 It is developed to support processing large sets of
structured, unstructured, and semi-structured data,
but it was designed as a batch processing system.

Stream Computing
 Data Stream processing platforms – SPARK:
 Apache Spark is more recent framework that combines an
engine for distributing programs across clusters of
machines with a model for writing programs on top of it.
 It is aimed at addressing the needs of the data scientist
community, in particular in support of Read-Evaluate-Print
Loop (REPL) approach for playing with data interactively.

Stream Computing
 Spark maintains MapReduce’s linear scalability and
fault tolerance, but extends it in three important ways:
 First, rather than relying on a rigid map-then-reduce format,
its engine can execute a more general directed acyclic graph
(DAG) of operators. This means that in situations where
MapReduce must write out intermediate results to the
distributed file system, Spark can pass them directly to the
next step in the pipeline.

Stream Computing
 Spark maintains MapReduce’s linear scalability and fault
tolerance, but extends it in three important ways:
 Second, it complements this capability with a rich set of
transformations that enable users to express computation more
naturally.
 Third, Spark supports in-memory processing across a cluster of
machines, thus not relying on the use of storage for recording
intermediate data, as in MapReduce.

Stream Computing
 Spark supports integration with the variety of tools in the
Hadoop ecosystem.
 It can read and write data in all of the data formats supported by
MapReduce.
 It can read from and write to NoSQL databases like HBase and
Cassandra.
 It is well suited for real-time processing and analysis, supporting
scalable, high throughput, and fault-tolerant processing of live data
streams.

Stream Computing
 Spark Streaming generates a discretized stream
(DStream) as a continuous stream of data.
 Regarding input stream, Spark Streaming receives live
input data streams through a receiver and divides data
into micro batches, which are then processed by the
Spark engine to generate the final stream of results in
batches.

Stream Computing
 Spark Streaming utilizes a small-interval (in seconds)
deterministic batch to separate stream into processable
units.
 The size of the interval dictates throughput and
latency, so the larger the interval, the higher the
throughput and the latency.

Stream Computing
 Since Spark core framework exploits main
memory (as opposed to Storm, which is using
Zookeeper) its mini batch processing can appear as
fast as “one at a time processing” adopted in
Storm, despite of the fact that the RDD units are
larger than Storm tuples.

Stream Computing
 The benefit from the mini batch is to enhance the
throughput in internal engine by reducing data
shipping overhead, such as lower overhead for the
ISO/OSI transport layer header, which will allow the
threads to concentrate on computation.
 Spark was written in Scala, but it comes with libraries
and wrappers that allow the use of R or Python.

Stream Computing
 Data Stream processing platforms – Storm:
 Storm is a distributed real-time computation system
for processing large volumes of high-velocity data.
 It makes it easy to reliably process unbounded streams
of data and has a relatively simple processing model
owing to the use of powerful abstractions.

Stream Computing
 A spout is a source of streams in a computation.
 Typically, a spout reads from a queuing broker, such as
RabbitMQ, or Kafka, but a spout can also generate its own
stream or read from somewhere like the Twitter streaming
API.
 Spout implementations already exist for most queuing
systems.

Stream Computing
 A bolt processes any number of input streams and
produces any number of new output streams.
They are event-driven components, and cannot be used to
read data. This is what spouts are designed for.
Most of the logic of a computation goes into bolts, such as
functions, filters, streaming joins, streaming aggregations,
talking to databases, and so on.

Stream Computing
 A topology is a DAG of spouts and bolts, with each
edge in the DAG representing a bolt subscribing to the
output stream of some other spout or bolt.
 A topology is an arbitrarily complex multistage stream
computation; topologies run indefinitely when deployed.

Stream Computing
 Trident provides a set of high-level abstractions in Storm
that were developed to facilitate programming of real-time
applications on top of Storm infrastructure.
 It supports joins, aggregations, grouping, functions, and
filters. In addition to these, Trident adds primitives for
doing stateful incremental processing on top of any
database or persistence store

Stream Computing
 Data Stream processing platforms – KAFKA:
 Kafka is an open source message broker project
developed by the Apache Software Foundation and
written in Scala.
 The project aims to provide a unified, high-
throughput, low-latency platform for handling real-
time data feeds.

Stream Computing
 A single Kafka broker can handle hundreds of
megabytes of reads and writes per second from
thousands of clients.
 In order to support high availability and horizontal
scalability, data streams are partitioned and spread
over a cluster of machines.

Stream Computing
 Kafka depends on Zookeeper from the Hadoop
ecosystem for coordination of processing nodes.
 The main uses of Kafka are in situations when
applications need a very high throughput for message
processing, while meeting low latency, high
availability, and high scalability requirements.

Stream Computing
 Data Stream processing platforms – Flume:
 Flume is a distributed, reliable, and available service
for efficiently collecting, aggregating, and moving
large amounts of log data.
 It is robust and fault tolerant with tunable reliability
mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model
that allows for online analytic application.

Stream Computing
 Data Stream processing platforms – Flume:
 While Flume and Kafka both can act as the event
backbone for real-time event processing, they have
different characteristics.
 Flume is better suited in cases when one needs to
support data ingestion and simple event processing.

Stream Computing
 Data Stream processing platforms – Amazon Kinesis:
 Amazon Kinesis is a cloud-based service for real-time data
processing over large, distributed data streams.
 Amazon Kinesis can continuously capture and store
terabytes of data per hour from hundreds of thousands of
sources such as website clickstreams, financial
transactions, social media feeds, IT logs, and location-
tracking events.

Stream Computing
 Data Stream processing platforms – Amazon Kinesis:
 Kinesis allows integration with Storm, as it provides a
Kinesis Storm Spout that fetches data from a Kinesis
stream and emits it as tuples.
The inclusion of this Kinesis component into a Storm
topology provides a reliable and scalable stream capture,
storage, and replay service.

CS8091_BDA_Unit_IV_Stream_Computing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to CS8091_BDA_Unit_IV_Stream_Computing

Similar to CS8091_BDA_Unit_IV_Stream_Computing (20)

More from Palani Kumar

More from Palani Kumar (9)

Recently uploaded

Recently uploaded (20)

CS8091_BDA_Unit_IV_Stream_Computing