CS8091 / Big Data Analytics
III Year / VI Semester
UNIT IV - STREAM MEMORY
Introduction to Streams Concepts – Stream Data Model and
Architecture - Stream Computing, Sampling Data in a Stream –
Filtering Streams – Counting Distinct Elements in a Stream –
Estimating moments – Counting oneness in a Window – Decaying
Window – Real time Analytics Platform(RTAP) applications - Case
Studies - Real Time Sentiment Analysis, Stock Market Predictions.
Using Graph Analytics for Big Data: Graph Analytics.
Stream Computing
 A high performance computer system that analyzes
multiple data streams from many sources.
 Stream computing is used to mean pulling in streams of
data, processing the data and streaming it back out as a
single flow.
 It uses software algorithms that analyzes the data in real
time as it streams in to increase and accuracy when dealing
with data handling and analysis.
Stream Computing
Stream Computing
 Stream computing delivers real-time analytic
processing on constantly changing data in
motion.
 It allows to capture and analyze all data in all
the time, just in time.
Stream Computing
Stream analyzes data before you store it.
 Analyze data that is in motion (Velocity)
 Process any type of data (Variety)
 Streams is designed to scale to process any size of
data from Tera bytes to Zeta bytes per day.
Stream Computing
 Store less
 Analyze more
 Make better decisions, faster
Stream Computing
 Data Stream processing platforms:
 Many of these are open source solutions.
These platforms facilitate the construction of real-time
applications, in particular message-oriented or event-
driven applications which support ingress of messages
or events at a very high rate, transfer to subsequent
processing, and generation of alerts.
Stream Computing
 Data Stream processing platforms:
 These platforms are mostly focused on supporting
event-driven data flow through nodes in a distributed
system or within a cloud infrastructure platform.
 The Hadoop ecosystem covers a family of projects
that fall under the umbrella of infrastructure for
distributed computing and large data processing.
Stream Computing
 Data Stream processing platforms:
 Hadoop includes a number of components, and below
is the list of components:
 MapReduce, a distributed data processing model and
execution environment that runs on large clusters of
commodity machines.
 Hadoop Distributed File System (HDFS), a distributed file
system that runs on large clusters of commodity machines
Stream Computing
 Data Stream processing platforms:
 Hadoop includes a number of components, and below is
the list of components:
 ZooKeeper, a distributed, highly available coordination service,
providing primitives such as distributed locks that can be used for
building distributed applications.
Pig, a dataflow language and execution environment for exploring
very large datasets. Pigs runs on HDFS and MapReduce clusters.
Hive, a distributed data warehouse.
Stream Computing
 Data Stream processing platforms:
 It is developed to support processing large sets of
structured, unstructured, and semi-structured data,
but it was designed as a batch processing system.
Stream Computing
 Data Stream processing platforms – SPARK:
 Apache Spark is more recent framework that combines an
engine for distributing programs across clusters of
machines with a model for writing programs on top of it.
 It is aimed at addressing the needs of the data scientist
community, in particular in support of Read-Evaluate-Print
Loop (REPL) approach for playing with data interactively.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark maintains MapReduce’s linear scalability and
fault tolerance, but extends it in three important ways:
 First, rather than relying on a rigid map-then-reduce format,
its engine can execute a more general directed acyclic graph
(DAG) of operators. This means that in situations where
MapReduce must write out intermediate results to the
distributed file system, Spark can pass them directly to the
next step in the pipeline.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark maintains MapReduce’s linear scalability and fault
tolerance, but extends it in three important ways:
 Second, it complements this capability with a rich set of
transformations that enable users to express computation more
naturally.
 Third, Spark supports in-memory processing across a cluster of
machines, thus not relying on the use of storage for recording
intermediate data, as in MapReduce.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark supports integration with the variety of tools in the
Hadoop ecosystem.
 It can read and write data in all of the data formats supported by
MapReduce.
 It can read from and write to NoSQL databases like HBase and
Cassandra.
 It is well suited for real-time processing and analysis, supporting
scalable, high throughput, and fault-tolerant processing of live data
streams.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark Streaming generates a discretized stream
(DStream) as a continuous stream of data.
 Regarding input stream, Spark Streaming receives live
input data streams through a receiver and divides data
into micro batches, which are then processed by the
Spark engine to generate the final stream of results in
batches.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark Streaming utilizes a small-interval (in seconds)
deterministic batch to separate stream into processable
units.
 The size of the interval dictates throughput and
latency, so the larger the interval, the higher the
throughput and the latency.
Stream Computing
 Data Stream processing platforms – SPARK:
 Since Spark core framework exploits main
memory (as opposed to Storm, which is using
Zookeeper) its mini batch processing can appear as
fast as “one at a time processing” adopted in
Storm, despite of the fact that the RDD units are
larger than Storm tuples.
Stream Computing
 Data Stream processing platforms – SPARK:
 The benefit from the mini batch is to enhance the
throughput in internal engine by reducing data
shipping overhead, such as lower overhead for the
ISO/OSI transport layer header, which will allow the
threads to concentrate on computation.
 Spark was written in Scala, but it comes with libraries
and wrappers that allow the use of R or Python.
Stream Computing
 Data Stream processing platforms – Storm:
 Storm is a distributed real-time computation system
for processing large volumes of high-velocity data.
 It makes it easy to reliably process unbounded streams
of data and has a relatively simple processing model
owing to the use of powerful abstractions.
Stream Computing
 Data Stream processing platforms – Storm:
 A spout is a source of streams in a computation.
 Typically, a spout reads from a queuing broker, such as
RabbitMQ, or Kafka, but a spout can also generate its own
stream or read from somewhere like the Twitter streaming
API.
 Spout implementations already exist for most queuing
systems.
Stream Computing
 Data Stream processing platforms – Storm:
 A bolt processes any number of input streams and
produces any number of new output streams.
They are event-driven components, and cannot be used to
read data. This is what spouts are designed for.
Most of the logic of a computation goes into bolts, such as
functions, filters, streaming joins, streaming aggregations,
talking to databases, and so on.
Stream Computing
 Data Stream processing platforms – Storm:
 A topology is a DAG of spouts and bolts, with each
edge in the DAG representing a bolt subscribing to the
output stream of some other spout or bolt.
 A topology is an arbitrarily complex multistage stream
computation; topologies run indefinitely when deployed.
Stream Computing
 Data Stream processing platforms – Storm:
 Trident provides a set of high-level abstractions in Storm
that were developed to facilitate programming of real-time
applications on top of Storm infrastructure.
 It supports joins, aggregations, grouping, functions, and
filters. In addition to these, Trident adds primitives for
doing stateful incremental processing on top of any
database or persistence store
Stream Computing
 Data Stream processing platforms – KAFKA:
 Kafka is an open source message broker project
developed by the Apache Software Foundation and
written in Scala.
 The project aims to provide a unified, high-
throughput, low-latency platform for handling real-
time data feeds.
Stream Computing
 Data Stream processing platforms – KAFKA:
 A single Kafka broker can handle hundreds of
megabytes of reads and writes per second from
thousands of clients.
 In order to support high availability and horizontal
scalability, data streams are partitioned and spread
over a cluster of machines.
Stream Computing
 Data Stream processing platforms – KAFKA:
 Kafka depends on Zookeeper from the Hadoop
ecosystem for coordination of processing nodes.
 The main uses of Kafka are in situations when
applications need a very high throughput for message
processing, while meeting low latency, high
availability, and high scalability requirements.
Stream Computing
 Data Stream processing platforms – Flume:
 Flume is a distributed, reliable, and available service
for efficiently collecting, aggregating, and moving
large amounts of log data.
 It is robust and fault tolerant with tunable reliability
mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model
that allows for online analytic application.
Stream Computing
 Data Stream processing platforms – Flume:
 While Flume and Kafka both can act as the event
backbone for real-time event processing, they have
different characteristics.
 Flume is better suited in cases when one needs to
support data ingestion and simple event processing.
Stream Computing
 Data Stream processing platforms – Amazon Kinesis:
 Amazon Kinesis is a cloud-based service for real-time data
processing over large, distributed data streams.
 Amazon Kinesis can continuously capture and store
terabytes of data per hour from hundreds of thousands of
sources such as website clickstreams, financial
transactions, social media feeds, IT logs, and location-
tracking events.
Stream Computing
 Data Stream processing platforms – Amazon Kinesis:
 Kinesis allows integration with Storm, as it provides a
Kinesis Storm Spout that fetches data from a Kinesis
stream and emits it as tuples.
The inclusion of this Kinesis component into a Storm
topology provides a reliable and scalable stream capture,
storage, and replay service.

CS8091_BDA_Unit_IV_Stream_Computing

  • 1.
    CS8091 / BigData Analytics III Year / VI Semester
  • 2.
    UNIT IV -STREAM MEMORY Introduction to Streams Concepts – Stream Data Model and Architecture - Stream Computing, Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in a Stream – Estimating moments – Counting oneness in a Window – Decaying Window – Real time Analytics Platform(RTAP) applications - Case Studies - Real Time Sentiment Analysis, Stock Market Predictions. Using Graph Analytics for Big Data: Graph Analytics.
  • 3.
    Stream Computing  Ahigh performance computer system that analyzes multiple data streams from many sources.  Stream computing is used to mean pulling in streams of data, processing the data and streaming it back out as a single flow.  It uses software algorithms that analyzes the data in real time as it streams in to increase and accuracy when dealing with data handling and analysis.
  • 4.
  • 5.
    Stream Computing  Streamcomputing delivers real-time analytic processing on constantly changing data in motion.  It allows to capture and analyze all data in all the time, just in time.
  • 6.
    Stream Computing Stream analyzesdata before you store it.  Analyze data that is in motion (Velocity)  Process any type of data (Variety)  Streams is designed to scale to process any size of data from Tera bytes to Zeta bytes per day.
  • 7.
    Stream Computing  Storeless  Analyze more  Make better decisions, faster
  • 8.
    Stream Computing  DataStream processing platforms:  Many of these are open source solutions. These platforms facilitate the construction of real-time applications, in particular message-oriented or event- driven applications which support ingress of messages or events at a very high rate, transfer to subsequent processing, and generation of alerts.
  • 9.
    Stream Computing  DataStream processing platforms:  These platforms are mostly focused on supporting event-driven data flow through nodes in a distributed system or within a cloud infrastructure platform.  The Hadoop ecosystem covers a family of projects that fall under the umbrella of infrastructure for distributed computing and large data processing.
  • 10.
    Stream Computing  DataStream processing platforms:  Hadoop includes a number of components, and below is the list of components:  MapReduce, a distributed data processing model and execution environment that runs on large clusters of commodity machines.  Hadoop Distributed File System (HDFS), a distributed file system that runs on large clusters of commodity machines
  • 11.
    Stream Computing  DataStream processing platforms:  Hadoop includes a number of components, and below is the list of components:  ZooKeeper, a distributed, highly available coordination service, providing primitives such as distributed locks that can be used for building distributed applications. Pig, a dataflow language and execution environment for exploring very large datasets. Pigs runs on HDFS and MapReduce clusters. Hive, a distributed data warehouse.
  • 12.
    Stream Computing  DataStream processing platforms:  It is developed to support processing large sets of structured, unstructured, and semi-structured data, but it was designed as a batch processing system.
  • 13.
    Stream Computing  DataStream processing platforms – SPARK:  Apache Spark is more recent framework that combines an engine for distributing programs across clusters of machines with a model for writing programs on top of it.  It is aimed at addressing the needs of the data scientist community, in particular in support of Read-Evaluate-Print Loop (REPL) approach for playing with data interactively.
  • 14.
    Stream Computing  DataStream processing platforms – SPARK:  Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in three important ways:  First, rather than relying on a rigid map-then-reduce format, its engine can execute a more general directed acyclic graph (DAG) of operators. This means that in situations where MapReduce must write out intermediate results to the distributed file system, Spark can pass them directly to the next step in the pipeline.
  • 15.
    Stream Computing  DataStream processing platforms – SPARK:  Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in three important ways:  Second, it complements this capability with a rich set of transformations that enable users to express computation more naturally.  Third, Spark supports in-memory processing across a cluster of machines, thus not relying on the use of storage for recording intermediate data, as in MapReduce.
  • 16.
    Stream Computing  DataStream processing platforms – SPARK:  Spark supports integration with the variety of tools in the Hadoop ecosystem.  It can read and write data in all of the data formats supported by MapReduce.  It can read from and write to NoSQL databases like HBase and Cassandra.  It is well suited for real-time processing and analysis, supporting scalable, high throughput, and fault-tolerant processing of live data streams.
  • 17.
    Stream Computing  DataStream processing platforms – SPARK:  Spark Streaming generates a discretized stream (DStream) as a continuous stream of data.  Regarding input stream, Spark Streaming receives live input data streams through a receiver and divides data into micro batches, which are then processed by the Spark engine to generate the final stream of results in batches.
  • 18.
    Stream Computing  DataStream processing platforms – SPARK:  Spark Streaming utilizes a small-interval (in seconds) deterministic batch to separate stream into processable units.  The size of the interval dictates throughput and latency, so the larger the interval, the higher the throughput and the latency.
  • 19.
    Stream Computing  DataStream processing platforms – SPARK:  Since Spark core framework exploits main memory (as opposed to Storm, which is using Zookeeper) its mini batch processing can appear as fast as “one at a time processing” adopted in Storm, despite of the fact that the RDD units are larger than Storm tuples.
  • 20.
    Stream Computing  DataStream processing platforms – SPARK:  The benefit from the mini batch is to enhance the throughput in internal engine by reducing data shipping overhead, such as lower overhead for the ISO/OSI transport layer header, which will allow the threads to concentrate on computation.  Spark was written in Scala, but it comes with libraries and wrappers that allow the use of R or Python.
  • 21.
    Stream Computing  DataStream processing platforms – Storm:  Storm is a distributed real-time computation system for processing large volumes of high-velocity data.  It makes it easy to reliably process unbounded streams of data and has a relatively simple processing model owing to the use of powerful abstractions.
  • 22.
    Stream Computing  DataStream processing platforms – Storm:  A spout is a source of streams in a computation.  Typically, a spout reads from a queuing broker, such as RabbitMQ, or Kafka, but a spout can also generate its own stream or read from somewhere like the Twitter streaming API.  Spout implementations already exist for most queuing systems.
  • 23.
    Stream Computing  DataStream processing platforms – Storm:  A bolt processes any number of input streams and produces any number of new output streams. They are event-driven components, and cannot be used to read data. This is what spouts are designed for. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.
  • 24.
    Stream Computing  DataStream processing platforms – Storm:  A topology is a DAG of spouts and bolts, with each edge in the DAG representing a bolt subscribing to the output stream of some other spout or bolt.  A topology is an arbitrarily complex multistage stream computation; topologies run indefinitely when deployed.
  • 25.
    Stream Computing  DataStream processing platforms – Storm:  Trident provides a set of high-level abstractions in Storm that were developed to facilitate programming of real-time applications on top of Storm infrastructure.  It supports joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful incremental processing on top of any database or persistence store
  • 26.
    Stream Computing  DataStream processing platforms – KAFKA:  Kafka is an open source message broker project developed by the Apache Software Foundation and written in Scala.  The project aims to provide a unified, high- throughput, low-latency platform for handling real- time data feeds.
  • 27.
    Stream Computing  DataStream processing platforms – KAFKA:  A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.  In order to support high availability and horizontal scalability, data streams are partitioned and spread over a cluster of machines.
  • 28.
    Stream Computing  DataStream processing platforms – KAFKA:  Kafka depends on Zookeeper from the Hadoop ecosystem for coordination of processing nodes.  The main uses of Kafka are in situations when applications need a very high throughput for message processing, while meeting low latency, high availability, and high scalability requirements.
  • 29.
    Stream Computing  DataStream processing platforms – Flume:  Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.  It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • 30.
    Stream Computing  DataStream processing platforms – Flume:  While Flume and Kafka both can act as the event backbone for real-time event processing, they have different characteristics.  Flume is better suited in cases when one needs to support data ingestion and simple event processing.
  • 31.
    Stream Computing  DataStream processing platforms – Amazon Kinesis:  Amazon Kinesis is a cloud-based service for real-time data processing over large, distributed data streams.  Amazon Kinesis can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location- tracking events.
  • 32.
    Stream Computing  DataStream processing platforms – Amazon Kinesis:  Kinesis allows integration with Storm, as it provides a Kinesis Storm Spout that fetches data from a Kinesis stream and emits it as tuples. The inclusion of this Kinesis component into a Storm topology provides a reliable and scalable stream capture, storage, and replay service.