SlideShare a Scribd company logo
CS8091 / Big Data Analytics
III Year / VI Semester
UNIT IV - STREAM MEMORY
Introduction to Streams Concepts – Stream Data Model and
Architecture - Stream Computing, Sampling Data in a Stream –
Filtering Streams – Counting Distinct Elements in a Stream –
Estimating moments – Counting oneness in a Window – Decaying
Window – Real time Analytics Platform(RTAP) applications - Case
Studies - Real Time Sentiment Analysis, Stock Market Predictions.
Using Graph Analytics for Big Data: Graph Analytics.
Stream Computing
 A high performance computer system that analyzes
multiple data streams from many sources.
 Stream computing is used to mean pulling in streams of
data, processing the data and streaming it back out as a
single flow.
 It uses software algorithms that analyzes the data in real
time as it streams in to increase and accuracy when dealing
with data handling and analysis.
Stream Computing
Stream Computing
 Stream computing delivers real-time analytic
processing on constantly changing data in
motion.
 It allows to capture and analyze all data in all
the time, just in time.
Stream Computing
Stream analyzes data before you store it.
 Analyze data that is in motion (Velocity)
 Process any type of data (Variety)
 Streams is designed to scale to process any size of
data from Tera bytes to Zeta bytes per day.
Stream Computing
 Store less
 Analyze more
 Make better decisions, faster
Stream Computing
 Data Stream processing platforms:
 Many of these are open source solutions.
These platforms facilitate the construction of real-time
applications, in particular message-oriented or event-
driven applications which support ingress of messages
or events at a very high rate, transfer to subsequent
processing, and generation of alerts.
Stream Computing
 Data Stream processing platforms:
 These platforms are mostly focused on supporting
event-driven data flow through nodes in a distributed
system or within a cloud infrastructure platform.
 The Hadoop ecosystem covers a family of projects
that fall under the umbrella of infrastructure for
distributed computing and large data processing.
Stream Computing
 Data Stream processing platforms:
 Hadoop includes a number of components, and below
is the list of components:
 MapReduce, a distributed data processing model and
execution environment that runs on large clusters of
commodity machines.
 Hadoop Distributed File System (HDFS), a distributed file
system that runs on large clusters of commodity machines
Stream Computing
 Data Stream processing platforms:
 Hadoop includes a number of components, and below is
the list of components:
 ZooKeeper, a distributed, highly available coordination service,
providing primitives such as distributed locks that can be used for
building distributed applications.
Pig, a dataflow language and execution environment for exploring
very large datasets. Pigs runs on HDFS and MapReduce clusters.
Hive, a distributed data warehouse.
Stream Computing
 Data Stream processing platforms:
 It is developed to support processing large sets of
structured, unstructured, and semi-structured data,
but it was designed as a batch processing system.
Stream Computing
 Data Stream processing platforms – SPARK:
 Apache Spark is more recent framework that combines an
engine for distributing programs across clusters of
machines with a model for writing programs on top of it.
 It is aimed at addressing the needs of the data scientist
community, in particular in support of Read-Evaluate-Print
Loop (REPL) approach for playing with data interactively.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark maintains MapReduce’s linear scalability and
fault tolerance, but extends it in three important ways:
 First, rather than relying on a rigid map-then-reduce format,
its engine can execute a more general directed acyclic graph
(DAG) of operators. This means that in situations where
MapReduce must write out intermediate results to the
distributed file system, Spark can pass them directly to the
next step in the pipeline.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark maintains MapReduce’s linear scalability and fault
tolerance, but extends it in three important ways:
 Second, it complements this capability with a rich set of
transformations that enable users to express computation more
naturally.
 Third, Spark supports in-memory processing across a cluster of
machines, thus not relying on the use of storage for recording
intermediate data, as in MapReduce.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark supports integration with the variety of tools in the
Hadoop ecosystem.
 It can read and write data in all of the data formats supported by
MapReduce.
 It can read from and write to NoSQL databases like HBase and
Cassandra.
 It is well suited for real-time processing and analysis, supporting
scalable, high throughput, and fault-tolerant processing of live data
streams.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark Streaming generates a discretized stream
(DStream) as a continuous stream of data.
 Regarding input stream, Spark Streaming receives live
input data streams through a receiver and divides data
into micro batches, which are then processed by the
Spark engine to generate the final stream of results in
batches.
Stream Computing
 Data Stream processing platforms – SPARK:
 Spark Streaming utilizes a small-interval (in seconds)
deterministic batch to separate stream into processable
units.
 The size of the interval dictates throughput and
latency, so the larger the interval, the higher the
throughput and the latency.
Stream Computing
 Data Stream processing platforms – SPARK:
 Since Spark core framework exploits main
memory (as opposed to Storm, which is using
Zookeeper) its mini batch processing can appear as
fast as “one at a time processing” adopted in
Storm, despite of the fact that the RDD units are
larger than Storm tuples.
Stream Computing
 Data Stream processing platforms – SPARK:
 The benefit from the mini batch is to enhance the
throughput in internal engine by reducing data
shipping overhead, such as lower overhead for the
ISO/OSI transport layer header, which will allow the
threads to concentrate on computation.
 Spark was written in Scala, but it comes with libraries
and wrappers that allow the use of R or Python.
Stream Computing
 Data Stream processing platforms – Storm:
 Storm is a distributed real-time computation system
for processing large volumes of high-velocity data.
 It makes it easy to reliably process unbounded streams
of data and has a relatively simple processing model
owing to the use of powerful abstractions.
Stream Computing
 Data Stream processing platforms – Storm:
 A spout is a source of streams in a computation.
 Typically, a spout reads from a queuing broker, such as
RabbitMQ, or Kafka, but a spout can also generate its own
stream or read from somewhere like the Twitter streaming
API.
 Spout implementations already exist for most queuing
systems.
Stream Computing
 Data Stream processing platforms – Storm:
 A bolt processes any number of input streams and
produces any number of new output streams.
They are event-driven components, and cannot be used to
read data. This is what spouts are designed for.
Most of the logic of a computation goes into bolts, such as
functions, filters, streaming joins, streaming aggregations,
talking to databases, and so on.
Stream Computing
 Data Stream processing platforms – Storm:
 A topology is a DAG of spouts and bolts, with each
edge in the DAG representing a bolt subscribing to the
output stream of some other spout or bolt.
 A topology is an arbitrarily complex multistage stream
computation; topologies run indefinitely when deployed.
Stream Computing
 Data Stream processing platforms – Storm:
 Trident provides a set of high-level abstractions in Storm
that were developed to facilitate programming of real-time
applications on top of Storm infrastructure.
 It supports joins, aggregations, grouping, functions, and
filters. In addition to these, Trident adds primitives for
doing stateful incremental processing on top of any
database or persistence store
Stream Computing
 Data Stream processing platforms – KAFKA:
 Kafka is an open source message broker project
developed by the Apache Software Foundation and
written in Scala.
 The project aims to provide a unified, high-
throughput, low-latency platform for handling real-
time data feeds.
Stream Computing
 Data Stream processing platforms – KAFKA:
 A single Kafka broker can handle hundreds of
megabytes of reads and writes per second from
thousands of clients.
 In order to support high availability and horizontal
scalability, data streams are partitioned and spread
over a cluster of machines.
Stream Computing
 Data Stream processing platforms – KAFKA:
 Kafka depends on Zookeeper from the Hadoop
ecosystem for coordination of processing nodes.
 The main uses of Kafka are in situations when
applications need a very high throughput for message
processing, while meeting low latency, high
availability, and high scalability requirements.
Stream Computing
 Data Stream processing platforms – Flume:
 Flume is a distributed, reliable, and available service
for efficiently collecting, aggregating, and moving
large amounts of log data.
 It is robust and fault tolerant with tunable reliability
mechanisms and many failover and recovery
mechanisms. It uses a simple extensible data model
that allows for online analytic application.
Stream Computing
 Data Stream processing platforms – Flume:
 While Flume and Kafka both can act as the event
backbone for real-time event processing, they have
different characteristics.
 Flume is better suited in cases when one needs to
support data ingestion and simple event processing.
Stream Computing
 Data Stream processing platforms – Amazon Kinesis:
 Amazon Kinesis is a cloud-based service for real-time data
processing over large, distributed data streams.
 Amazon Kinesis can continuously capture and store
terabytes of data per hour from hundreds of thousands of
sources such as website clickstreams, financial
transactions, social media feeds, IT logs, and location-
tracking events.
Stream Computing
 Data Stream processing platforms – Amazon Kinesis:
 Kinesis allows integration with Storm, as it provides a
Kinesis Storm Spout that fetches data from a Kinesis
stream and emits it as tuples.
The inclusion of this Kinesis component into a Storm
topology provides a reliable and scalable stream capture,
storage, and replay service.

More Related Content

What's hot

Congestion avoidance in TCP
Congestion avoidance in TCPCongestion avoidance in TCP
Congestion avoidance in TCP
selvakumar_b1985
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
Kathirvel Ayyaswamy
 
Machine Learning in 5 Minutes— Classification
Machine Learning in 5 Minutes— ClassificationMachine Learning in 5 Minutes— Classification
Machine Learning in 5 Minutes— Classification
Brian Lange
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representationSravanthi Emani
 
Federated Cloud Computing - The OpenNebula Experience v1.0s
Federated Cloud Computing  - The OpenNebula Experience v1.0sFederated Cloud Computing  - The OpenNebula Experience v1.0s
Federated Cloud Computing - The OpenNebula Experience v1.0s
Ignacio M. Llorente
 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS ArchitectureDistributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Gyanmanjari Institute Of Technology
 
Segmentation in Operating Systems.
Segmentation in Operating Systems.Segmentation in Operating Systems.
Segmentation in Operating Systems.
Muhammad SiRaj Munir
 
Distributed shared memory ch 5
Distributed shared memory ch 5Distributed shared memory ch 5
Distributed shared memory ch 5
Alagappa Government Arts College, Karaikudi
 
Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3
RMK ENGINEERING COLLEGE, CHENNAI
 
Unit 2 Virtualization Part I.pptx
Unit 2 Virtualization Part I.pptxUnit 2 Virtualization Part I.pptx
Unit 2 Virtualization Part I.pptx
Nayanrai14
 
Congestion control
Congestion controlCongestion control
Congestion control
Aman Jaiswal
 
CS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question BankCS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question Bank
pkaviya
 
Comet Cloud
Comet CloudComet Cloud
Comet Cloud
pradeepas7
 
SLA Management in Cloud
SLA Management in CloudSLA Management in Cloud
SLA Management in Cloud
Dr Neelesh Jain
 
Event management by using cloud computing
Event management by using cloud computingEvent management by using cloud computing
Event management by using cloud computing
Logesh Waran
 
Cloud computing What Why How
Cloud computing What Why HowCloud computing What Why How
Cloud computing What Why How
Asian Institute of Technology (AIT)
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
Syed Zaid Irshad
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
moni sindhu
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
Student
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memoryAshish Kumar
 

What's hot (20)

Congestion avoidance in TCP
Congestion avoidance in TCPCongestion avoidance in TCP
Congestion avoidance in TCP
 
CS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMSCS9222 ADVANCED OPERATING SYSTEMS
CS9222 ADVANCED OPERATING SYSTEMS
 
Machine Learning in 5 Minutes— Classification
Machine Learning in 5 Minutes— ClassificationMachine Learning in 5 Minutes— Classification
Machine Learning in 5 Minutes— Classification
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representation
 
Federated Cloud Computing - The OpenNebula Experience v1.0s
Federated Cloud Computing  - The OpenNebula Experience v1.0sFederated Cloud Computing  - The OpenNebula Experience v1.0s
Federated Cloud Computing - The OpenNebula Experience v1.0s
 
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS ArchitectureDistributed DBMS - Unit 3 - Distributed DBMS Architecture
Distributed DBMS - Unit 3 - Distributed DBMS Architecture
 
Segmentation in Operating Systems.
Segmentation in Operating Systems.Segmentation in Operating Systems.
Segmentation in Operating Systems.
 
Distributed shared memory ch 5
Distributed shared memory ch 5Distributed shared memory ch 5
Distributed shared memory ch 5
 
Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3Cs6703 grid and cloud computing unit 3
Cs6703 grid and cloud computing unit 3
 
Unit 2 Virtualization Part I.pptx
Unit 2 Virtualization Part I.pptxUnit 2 Virtualization Part I.pptx
Unit 2 Virtualization Part I.pptx
 
Congestion control
Congestion controlCongestion control
Congestion control
 
CS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question BankCS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question Bank
 
Comet Cloud
Comet CloudComet Cloud
Comet Cloud
 
SLA Management in Cloud
SLA Management in CloudSLA Management in Cloud
SLA Management in Cloud
 
Event management by using cloud computing
Event management by using cloud computingEvent management by using cloud computing
Event management by using cloud computing
 
Cloud computing What Why How
Cloud computing What Why HowCloud computing What Why How
Cloud computing What Why How
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
multi dimensional data model
multi dimensional data modelmulti dimensional data model
multi dimensional data model
 
Communication primitives
Communication primitivesCommunication primitives
Communication primitives
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memory
 

Similar to CS8091_BDA_Unit_IV_Stream_Computing

Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
Josef Adersberger
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
DataStax Academy
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
Tina Zhang
 
Kafka vs Spark vs Impala in bigdata .pptx
Kafka vs Spark vs Impala in bigdata .pptxKafka vs Spark vs Impala in bigdata .pptx
Kafka vs Spark vs Impala in bigdata .pptx
emmadoo192
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
Gabriele Modena
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
Joe Stein
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
HUJAK - Hrvatska udruga Java korisnika / Croatian Java User Association
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
cscpconf
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
Asis Mohanty
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
Mohammed Fazuluddin
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
Crishantha Nanayakkara
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
AgnihotriGhosh2
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 

Similar to CS8091_BDA_Unit_IV_Stream_Computing (20)

Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Apache frameworks for Big and Fast Data
Apache frameworks for Big and Fast DataApache frameworks for Big and Fast Data
Apache frameworks for Big and Fast Data
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
 
Kafka vs Spark vs Impala in bigdata .pptx
Kafka vs Spark vs Impala in bigdata .pptxKafka vs Spark vs Impala in bigdata .pptx
Kafka vs Spark vs Impala in bigdata .pptx
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
INFO491FinalPaper
INFO491FinalPaperINFO491FinalPaper
INFO491FinalPaper
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCENETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 
AWS Big Data Landscape
AWS Big Data LandscapeAWS Big Data Landscape
AWS Big Data Landscape
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 

More from Palani Kumar

CS8091_BDA_Unit_V_NoSQL
CS8091_BDA_Unit_V_NoSQLCS8091_BDA_Unit_V_NoSQL
CS8091_BDA_Unit_V_NoSQL
Palani Kumar
 
CS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_RecommendationCS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_Recommendation
Palani Kumar
 
CS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_ClusteringCS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_Clustering
Palani Kumar
 
CS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitectureCS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_Architecture
Palani Kumar
 
IT8005_EC_Unit_V_Features_Of_E_Commerce_Technology
IT8005_EC_Unit_V_Features_Of_E_Commerce_TechnologyIT8005_EC_Unit_V_Features_Of_E_Commerce_Technology
IT8005_EC_Unit_V_Features_Of_E_Commerce_Technology
Palani Kumar
 
IT8005_EC_Unit_IV_Internet_Marketing_Technologies
IT8005_EC_Unit_IV_Internet_Marketing_TechnologiesIT8005_EC_Unit_IV_Internet_Marketing_Technologies
IT8005_EC_Unit_IV_Internet_Marketing_Technologies
Palani Kumar
 
IT8005_EC_Unit_III_Securing_Communication_Channels
IT8005_EC_Unit_III_Securing_Communication_ChannelsIT8005_EC_Unit_III_Securing_Communication_Channels
IT8005_EC_Unit_III_Securing_Communication_Channels
Palani Kumar
 
IT8005_EC_Unit_II_Building_ECommerce
IT8005_EC_Unit_II_Building_ECommerceIT8005_EC_Unit_II_Building_ECommerce
IT8005_EC_Unit_II_Building_ECommerce
Palani Kumar
 
IT_8005_Electronic Commerce_Unit_I
IT_8005_Electronic Commerce_Unit_IIT_8005_Electronic Commerce_Unit_I
IT_8005_Electronic Commerce_Unit_I
Palani Kumar
 

More from Palani Kumar (9)

CS8091_BDA_Unit_V_NoSQL
CS8091_BDA_Unit_V_NoSQLCS8091_BDA_Unit_V_NoSQL
CS8091_BDA_Unit_V_NoSQL
 
CS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_RecommendationCS8091_BDA_Unit_III_Content_Based_Recommendation
CS8091_BDA_Unit_III_Content_Based_Recommendation
 
CS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_ClusteringCS8091_BDA_Unit_II_Clustering
CS8091_BDA_Unit_II_Clustering
 
CS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_ArchitectureCS8091_BDA_Unit_I_Analytical_Architecture
CS8091_BDA_Unit_I_Analytical_Architecture
 
IT8005_EC_Unit_V_Features_Of_E_Commerce_Technology
IT8005_EC_Unit_V_Features_Of_E_Commerce_TechnologyIT8005_EC_Unit_V_Features_Of_E_Commerce_Technology
IT8005_EC_Unit_V_Features_Of_E_Commerce_Technology
 
IT8005_EC_Unit_IV_Internet_Marketing_Technologies
IT8005_EC_Unit_IV_Internet_Marketing_TechnologiesIT8005_EC_Unit_IV_Internet_Marketing_Technologies
IT8005_EC_Unit_IV_Internet_Marketing_Technologies
 
IT8005_EC_Unit_III_Securing_Communication_Channels
IT8005_EC_Unit_III_Securing_Communication_ChannelsIT8005_EC_Unit_III_Securing_Communication_Channels
IT8005_EC_Unit_III_Securing_Communication_Channels
 
IT8005_EC_Unit_II_Building_ECommerce
IT8005_EC_Unit_II_Building_ECommerceIT8005_EC_Unit_II_Building_ECommerce
IT8005_EC_Unit_II_Building_ECommerce
 
IT_8005_Electronic Commerce_Unit_I
IT_8005_Electronic Commerce_Unit_IIT_8005_Electronic Commerce_Unit_I
IT_8005_Electronic Commerce_Unit_I
 

Recently uploaded

Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
Kerry Sado
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
WENKENLI1
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
SupreethSP4
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
BrazilAccount1
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
bakpo1
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 

Recently uploaded (20)

Hierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power SystemHierarchical Digital Twin of a Naval Power System
Hierarchical Digital Twin of a Naval Power System
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdfGoverning Equations for Fundamental Aerodynamics_Anderson2010.pdf
Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf
 
Runway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptxRunway Orientation Based on the Wind Rose Diagram.pptx
Runway Orientation Based on the Wind Rose Diagram.pptx
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
English lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdfEnglish lab ppt no titlespecENG PPTt.pdf
English lab ppt no titlespecENG PPTt.pdf
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 

CS8091_BDA_Unit_IV_Stream_Computing

  • 1. CS8091 / Big Data Analytics III Year / VI Semester
  • 2. UNIT IV - STREAM MEMORY Introduction to Streams Concepts – Stream Data Model and Architecture - Stream Computing, Sampling Data in a Stream – Filtering Streams – Counting Distinct Elements in a Stream – Estimating moments – Counting oneness in a Window – Decaying Window – Real time Analytics Platform(RTAP) applications - Case Studies - Real Time Sentiment Analysis, Stock Market Predictions. Using Graph Analytics for Big Data: Graph Analytics.
  • 3. Stream Computing  A high performance computer system that analyzes multiple data streams from many sources.  Stream computing is used to mean pulling in streams of data, processing the data and streaming it back out as a single flow.  It uses software algorithms that analyzes the data in real time as it streams in to increase and accuracy when dealing with data handling and analysis.
  • 5. Stream Computing  Stream computing delivers real-time analytic processing on constantly changing data in motion.  It allows to capture and analyze all data in all the time, just in time.
  • 6. Stream Computing Stream analyzes data before you store it.  Analyze data that is in motion (Velocity)  Process any type of data (Variety)  Streams is designed to scale to process any size of data from Tera bytes to Zeta bytes per day.
  • 7. Stream Computing  Store less  Analyze more  Make better decisions, faster
  • 8. Stream Computing  Data Stream processing platforms:  Many of these are open source solutions. These platforms facilitate the construction of real-time applications, in particular message-oriented or event- driven applications which support ingress of messages or events at a very high rate, transfer to subsequent processing, and generation of alerts.
  • 9. Stream Computing  Data Stream processing platforms:  These platforms are mostly focused on supporting event-driven data flow through nodes in a distributed system or within a cloud infrastructure platform.  The Hadoop ecosystem covers a family of projects that fall under the umbrella of infrastructure for distributed computing and large data processing.
  • 10. Stream Computing  Data Stream processing platforms:  Hadoop includes a number of components, and below is the list of components:  MapReduce, a distributed data processing model and execution environment that runs on large clusters of commodity machines.  Hadoop Distributed File System (HDFS), a distributed file system that runs on large clusters of commodity machines
  • 11. Stream Computing  Data Stream processing platforms:  Hadoop includes a number of components, and below is the list of components:  ZooKeeper, a distributed, highly available coordination service, providing primitives such as distributed locks that can be used for building distributed applications. Pig, a dataflow language and execution environment for exploring very large datasets. Pigs runs on HDFS and MapReduce clusters. Hive, a distributed data warehouse.
  • 12. Stream Computing  Data Stream processing platforms:  It is developed to support processing large sets of structured, unstructured, and semi-structured data, but it was designed as a batch processing system.
  • 13. Stream Computing  Data Stream processing platforms – SPARK:  Apache Spark is more recent framework that combines an engine for distributing programs across clusters of machines with a model for writing programs on top of it.  It is aimed at addressing the needs of the data scientist community, in particular in support of Read-Evaluate-Print Loop (REPL) approach for playing with data interactively.
  • 14. Stream Computing  Data Stream processing platforms – SPARK:  Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in three important ways:  First, rather than relying on a rigid map-then-reduce format, its engine can execute a more general directed acyclic graph (DAG) of operators. This means that in situations where MapReduce must write out intermediate results to the distributed file system, Spark can pass them directly to the next step in the pipeline.
  • 15. Stream Computing  Data Stream processing platforms – SPARK:  Spark maintains MapReduce’s linear scalability and fault tolerance, but extends it in three important ways:  Second, it complements this capability with a rich set of transformations that enable users to express computation more naturally.  Third, Spark supports in-memory processing across a cluster of machines, thus not relying on the use of storage for recording intermediate data, as in MapReduce.
  • 16. Stream Computing  Data Stream processing platforms – SPARK:  Spark supports integration with the variety of tools in the Hadoop ecosystem.  It can read and write data in all of the data formats supported by MapReduce.  It can read from and write to NoSQL databases like HBase and Cassandra.  It is well suited for real-time processing and analysis, supporting scalable, high throughput, and fault-tolerant processing of live data streams.
  • 17. Stream Computing  Data Stream processing platforms – SPARK:  Spark Streaming generates a discretized stream (DStream) as a continuous stream of data.  Regarding input stream, Spark Streaming receives live input data streams through a receiver and divides data into micro batches, which are then processed by the Spark engine to generate the final stream of results in batches.
  • 18. Stream Computing  Data Stream processing platforms – SPARK:  Spark Streaming utilizes a small-interval (in seconds) deterministic batch to separate stream into processable units.  The size of the interval dictates throughput and latency, so the larger the interval, the higher the throughput and the latency.
  • 19. Stream Computing  Data Stream processing platforms – SPARK:  Since Spark core framework exploits main memory (as opposed to Storm, which is using Zookeeper) its mini batch processing can appear as fast as “one at a time processing” adopted in Storm, despite of the fact that the RDD units are larger than Storm tuples.
  • 20. Stream Computing  Data Stream processing platforms – SPARK:  The benefit from the mini batch is to enhance the throughput in internal engine by reducing data shipping overhead, such as lower overhead for the ISO/OSI transport layer header, which will allow the threads to concentrate on computation.  Spark was written in Scala, but it comes with libraries and wrappers that allow the use of R or Python.
  • 21. Stream Computing  Data Stream processing platforms – Storm:  Storm is a distributed real-time computation system for processing large volumes of high-velocity data.  It makes it easy to reliably process unbounded streams of data and has a relatively simple processing model owing to the use of powerful abstractions.
  • 22. Stream Computing  Data Stream processing platforms – Storm:  A spout is a source of streams in a computation.  Typically, a spout reads from a queuing broker, such as RabbitMQ, or Kafka, but a spout can also generate its own stream or read from somewhere like the Twitter streaming API.  Spout implementations already exist for most queuing systems.
  • 23. Stream Computing  Data Stream processing platforms – Storm:  A bolt processes any number of input streams and produces any number of new output streams. They are event-driven components, and cannot be used to read data. This is what spouts are designed for. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.
  • 24. Stream Computing  Data Stream processing platforms – Storm:  A topology is a DAG of spouts and bolts, with each edge in the DAG representing a bolt subscribing to the output stream of some other spout or bolt.  A topology is an arbitrarily complex multistage stream computation; topologies run indefinitely when deployed.
  • 25. Stream Computing  Data Stream processing platforms – Storm:  Trident provides a set of high-level abstractions in Storm that were developed to facilitate programming of real-time applications on top of Storm infrastructure.  It supports joins, aggregations, grouping, functions, and filters. In addition to these, Trident adds primitives for doing stateful incremental processing on top of any database or persistence store
  • 26. Stream Computing  Data Stream processing platforms – KAFKA:  Kafka is an open source message broker project developed by the Apache Software Foundation and written in Scala.  The project aims to provide a unified, high- throughput, low-latency platform for handling real- time data feeds.
  • 27. Stream Computing  Data Stream processing platforms – KAFKA:  A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients.  In order to support high availability and horizontal scalability, data streams are partitioned and spread over a cluster of machines.
  • 28. Stream Computing  Data Stream processing platforms – KAFKA:  Kafka depends on Zookeeper from the Hadoop ecosystem for coordination of processing nodes.  The main uses of Kafka are in situations when applications need a very high throughput for message processing, while meeting low latency, high availability, and high scalability requirements.
  • 29. Stream Computing  Data Stream processing platforms – Flume:  Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.  It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
  • 30. Stream Computing  Data Stream processing platforms – Flume:  While Flume and Kafka both can act as the event backbone for real-time event processing, they have different characteristics.  Flume is better suited in cases when one needs to support data ingestion and simple event processing.
  • 31. Stream Computing  Data Stream processing platforms – Amazon Kinesis:  Amazon Kinesis is a cloud-based service for real-time data processing over large, distributed data streams.  Amazon Kinesis can continuously capture and store terabytes of data per hour from hundreds of thousands of sources such as website clickstreams, financial transactions, social media feeds, IT logs, and location- tracking events.
  • 32. Stream Computing  Data Stream processing platforms – Amazon Kinesis:  Kinesis allows integration with Storm, as it provides a Kinesis Storm Spout that fetches data from a Kinesis stream and emits it as tuples. The inclusion of this Kinesis component into a Storm topology provides a reliable and scalable stream capture, storage, and replay service.