SlideShare a Scribd company logo
1 of 20
Real-time stream processing
for Big Data
Presented by Luay AL-Assadi
INTRODUCTION
Rise of the web 2.0 and the Internet of things.
 Huge amounts of data. (ex sensors, social media, online marketing).
 Track all kinds of information that are only valuable for a short time and therefore have to be
processed immediately.
 Monitoring user activity to optimize product or video recommendations for the current user
context.
Traditional batch-oriented approaches.
 Complex Event Processing (CEP) engines and DBMSs.
Distributed data processing.
 MapReduce.
Real-time analytics: Big Data in motion
 Real time Data infrastructure:
 Built from distributed components.
 Communicate via asynchronous network.
 Engineered on top of the JVM(Java Virtual Machine).
 Real time Big Data Basic Architecture Model:
 Collecting data from various places.
 Moving data to streaming layer.
 Analyze data in stream processor.
 Forwarding outputs to serving layer.
Real-time analytics: Big Data in motion
 Big Data Architecture Model:
Collecting Data
Streaming Data
Batch processing
Store Data
Stream processing
Serving Layer
Lambda Architecture
Real-time analytics: Big Data in motion
 Big Data Architecture Models:
Collecting Data
Streaming Data
Stream processing
Serving Layer
Kappa Architecture
Store, retain Data
Real-time streamers
 RabbitMQ.
 Broker centric, message Acknowledgement.
 focused around delivery guarantees between producers and consumers.
 fall over if your consumers were too slow.
Producer ConsumerBROKER
Message
Ack
Real-time streamers
 Kafka.
Producer centric.
Online / Offline consumers.
Use Zookeeper to reliably maintain their state across a cluster.
Real-time processors:
Latency Throughput & Efficiency
Handling data items
immediately as they arrive.
buffering and processing them in
batches increased efficiency.
Low Latency High Throughput
SAMZA
STORM
SPARK
SPARK Streaming
Trident
Stream BatchMicro - Batch
groups tuples into batches
Restrict batch size
Real-time processors
 STORM
Storm was developed by
Nathan Marz as a BackType
project which was later
acquired by Twitter in the
year 2011.
initially promoted as the
“Hadoop of real-time”.
 The vital parts of a Storm
deployment are a ZooKeeper
cluster for reliable coordination.
Real-time processors
 STORM
Topology:
network made of spout and bolts
Similar to hadoop Map reduce.
Stream:
an unbounded pipeline of tuples
Spout & bolts:
receiving data continuously,
transforming those data into
actual stream of tuples and
finally sending them to the
bolts to be processed.
Real-time processors
 STORM
Nodes
Master Node:
runs a daemon called ‘Nimbus’,
which is similar to the ‘Job
Tracker’ of Hadoop cluster.
Assign Jobs.
Monitor performance.
Real-time processors
 STORM
Nodes
Worker Node:
runs a daemon called
‘Supervisor’.
run one or more worker
processes on its node.
Apache Zookeeper facilitates communication between
Nimbus and Supervisors with the help of message
acknowledgements and processing status.
Real-time processors
 SAMZA
It was initially created at LinkedIn, submitted to the Apache
Incubator in July 2013.
Samza was co-developed with the queueing system Kafka.
Samza requires a little more work than storm to deploy as it does
not only depend on a ZooKeeper cluster, but also runs on top of
Hadoop YARN.
Real-time processors
 SAMZA - YARN
cluster scheduler. It allows you to allocate a number
of containers (processes) in a cluster of machines, and execute
arbitrary commands on them, The Samza client uses YARN to run a
Samza job.
NodeManager: is responsible for launching processes on the
machine.
ResourceManager: Talks to all of the NodeManagers to tell
them what to run.
ApplicationMaster: is responsible for managing the
application’s workload, asking for containers, and handling
notifications when one of its containers fails.
Real-time processors
 SAMZA
decouples individual processing
steps.
buffering data between
processing steps makes
(intermediate) results available
to unrelated parties.
 Prevent data loss by periodically
checkpointing current progress
and reprocessing all data from
failure point.
Real-time processors
 SPARK
Is a batch-processing framework that is often mentioned as the in
official successor of Hadoop as it offers several benefits in
comparison.
significant performance improvements through in-memory
caching.
 Spark provides a variety of machine learning algorithms out-of-the-box
through the MLlib library.
Real-time processors
 SPARK – Architecture
Discussion
SPARKSAMZASTORM
Achievable latency
processing model
ordering guarantees
<< 100 ms < 100 ms < 1 s
one-at-a-time one-at-a-time Micro-batch
between batcheswithin stream partitionsNo
elasticity Yes YesNo
All these different systems show that low latency is involved in a
number of trade-offs with other desirable properties such as
throughput, fault-tolerance, reliability (processing guarantees) and
ease of development.
References
• https://www.quora.com/What-are-the-differences-between-
Apache-Spark-Storm-Samza-Flink-Beam-Apex
• https://www.quora.com/What-are-the-differences-between-batch-
processing-and-stream-processing-systems
• https://samza.apache.org/learn/documentation/0.10/introduction/
architecture.html
• https://dzone.com/articles/streaming-big-data-storm-spark
• Paper : Real-time stream processing for Big Data
Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg
Angemeldet
Heruntergeladen am | 13.10.16 19:14
THANKS

More Related Content

What's hot

Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
Databricks
 

What's hot (20)

Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Stream processing and managing real-time data
Stream processing and managing real-time dataStream processing and managing real-time data
Stream processing and managing real-time data
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Apache Sentry for Hadoop security
Apache Sentry for Hadoop securityApache Sentry for Hadoop security
Apache Sentry for Hadoop security
 
Apache avro and overview hadoop tools
Apache avro and overview hadoop toolsApache avro and overview hadoop tools
Apache avro and overview hadoop tools
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 

Viewers also liked

Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
nkallen
 
Wintpresen 150424142843-conversion-gate02
Wintpresen 150424142843-conversion-gate02Wintpresen 150424142843-conversion-gate02
Wintpresen 150424142843-conversion-gate02
carmstea
 
Hive Poster
Hive PosterHive Poster
Hive Poster
ragho
 

Viewers also liked (20)

Comparison of Open Source Frameworks for Integrating the Internet of Things
Comparison of Open Source Frameworks for Integrating the Internet of ThingsComparison of Open Source Frameworks for Integrating the Internet of Things
Comparison of Open Source Frameworks for Integrating the Internet of Things
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
 
Big Data in Real-Time at Twitter
Big Data in Real-Time at TwitterBig Data in Real-Time at Twitter
Big Data in Real-Time at Twitter
 
What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017 What to Expect for Big Data and Apache Spark in 2017
What to Expect for Big Data and Apache Spark in 2017
 
A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...A Novel methodology for handling Document Level Security in Search Based Appl...
A Novel methodology for handling Document Level Security in Search Based Appl...
 
FundRock Fact Sheet
FundRock Fact SheetFundRock Fact Sheet
FundRock Fact Sheet
 
Catalogue i-tec
Catalogue i-tec Catalogue i-tec
Catalogue i-tec
 
Conteúdo programático do curso de matemática básica
Conteúdo programático do curso de matemática básicaConteúdo programático do curso de matemática básica
Conteúdo programático do curso de matemática básica
 
CV
CVCV
CV
 
Link del blog
Link del blogLink del blog
Link del blog
 
Wintpresen 150424142843-conversion-gate02
Wintpresen 150424142843-conversion-gate02Wintpresen 150424142843-conversion-gate02
Wintpresen 150424142843-conversion-gate02
 
Práctica manejo de internet
Práctica manejo de internetPráctica manejo de internet
Práctica manejo de internet
 
Hive Poster
Hive PosterHive Poster
Hive Poster
 
Integrate ManifoldCF with Solr
Integrate ManifoldCF with SolrIntegrate ManifoldCF with Solr
Integrate ManifoldCF with Solr
 
Revisão 5 exercícios de leitura de gráficos (1)
Revisão 5   exercícios de leitura de gráficos (1)Revisão 5   exercícios de leitura de gráficos (1)
Revisão 5 exercícios de leitura de gráficos (1)
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)Introduction to Big Data processing (FGRE2016)
Introduction to Big Data processing (FGRE2016)
 
Flipped Classroom and blended learning, pros, cons, similarities and differences
Flipped Classroom and blended learning, pros, cons, similarities and differencesFlipped Classroom and blended learning, pros, cons, similarities and differences
Flipped Classroom and blended learning, pros, cons, similarities and differences
 
Aprendizes 1 ano aula 1 parte a aula pdf
Aprendizes 1 ano aula 1 parte a aula pdfAprendizes 1 ano aula 1 parte a aula pdf
Aprendizes 1 ano aula 1 parte a aula pdf
 

Similar to Real time big data stream processing

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 

Similar to Real time big data stream processing (20)

CS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_ComputingCS8091_BDA_Unit_IV_Stream_Computing
CS8091_BDA_Unit_IV_Stream_Computing
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Data Streaming in Kafka
Data Streaming in KafkaData Streaming in Kafka
Data Streaming in Kafka
 
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics RevisedSpark Streaming Recipes and "Exactly Once" Semantics Revised
Spark Streaming Recipes and "Exactly Once" Semantics Revised
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Cloud Lambda Architecture Patterns
Cloud Lambda Architecture PatternsCloud Lambda Architecture Patterns
Cloud Lambda Architecture Patterns
 
Dataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice WayDataservices - Processing Big Data The Microservice Way
Dataservices - Processing Big Data The Microservice Way
 
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
The sFlow Standard: Scalable, Unified Monitoring of Networks, Systems and App...
 
Webinar - Big Data: Let's SMACK - Jorg Schad
Webinar - Big Data: Let's SMACK - Jorg SchadWebinar - Big Data: Let's SMACK - Jorg Schad
Webinar - Big Data: Let's SMACK - Jorg Schad
 
Colorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca OperationsColorado OpenStack 5th Birthday Monasca Operations
Colorado OpenStack 5th Birthday Monasca Operations
 
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
Artur Borycki - Beyond Lambda - how to get from logical to physical - code.ta...
 
Scala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZScala in increasingly demanding environments - DATABIZ
Scala in increasingly demanding environments - DATABIZ
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environm...
 
Apache samza past, present and future
Apache samza  past, present and futureApache samza  past, present and future
Apache samza past, present and future
 
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_final
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
[DSC Europe 23] Pramod Immaneni - Real-time analytics at IoT scale
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 

Recently uploaded

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
pyhepag
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
cyebo
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 

Recently uploaded (20)

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
一比一原版(Monash毕业证书)莫纳什大学毕业证成绩单如何办理
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
AI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdfAI Imagen for data-storytelling Infographics.pdf
AI Imagen for data-storytelling Infographics.pdf
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
basics of data science with application areas.pdf
basics of data science with application areas.pdfbasics of data science with application areas.pdf
basics of data science with application areas.pdf
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 
Easy and simple project file on mp online
Easy and simple project file on mp onlineEasy and simple project file on mp online
Easy and simple project file on mp online
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 

Real time big data stream processing

  • 1. Real-time stream processing for Big Data Presented by Luay AL-Assadi
  • 2. INTRODUCTION Rise of the web 2.0 and the Internet of things.  Huge amounts of data. (ex sensors, social media, online marketing).  Track all kinds of information that are only valuable for a short time and therefore have to be processed immediately.  Monitoring user activity to optimize product or video recommendations for the current user context. Traditional batch-oriented approaches.  Complex Event Processing (CEP) engines and DBMSs. Distributed data processing.  MapReduce.
  • 3. Real-time analytics: Big Data in motion  Real time Data infrastructure:  Built from distributed components.  Communicate via asynchronous network.  Engineered on top of the JVM(Java Virtual Machine).  Real time Big Data Basic Architecture Model:  Collecting data from various places.  Moving data to streaming layer.  Analyze data in stream processor.  Forwarding outputs to serving layer.
  • 4. Real-time analytics: Big Data in motion  Big Data Architecture Model: Collecting Data Streaming Data Batch processing Store Data Stream processing Serving Layer Lambda Architecture
  • 5. Real-time analytics: Big Data in motion  Big Data Architecture Models: Collecting Data Streaming Data Stream processing Serving Layer Kappa Architecture Store, retain Data
  • 6. Real-time streamers  RabbitMQ.  Broker centric, message Acknowledgement.  focused around delivery guarantees between producers and consumers.  fall over if your consumers were too slow. Producer ConsumerBROKER Message Ack
  • 7. Real-time streamers  Kafka. Producer centric. Online / Offline consumers. Use Zookeeper to reliably maintain their state across a cluster.
  • 8. Real-time processors: Latency Throughput & Efficiency Handling data items immediately as they arrive. buffering and processing them in batches increased efficiency. Low Latency High Throughput SAMZA STORM SPARK SPARK Streaming Trident Stream BatchMicro - Batch groups tuples into batches Restrict batch size
  • 9. Real-time processors  STORM Storm was developed by Nathan Marz as a BackType project which was later acquired by Twitter in the year 2011. initially promoted as the “Hadoop of real-time”.  The vital parts of a Storm deployment are a ZooKeeper cluster for reliable coordination.
  • 10. Real-time processors  STORM Topology: network made of spout and bolts Similar to hadoop Map reduce. Stream: an unbounded pipeline of tuples Spout & bolts: receiving data continuously, transforming those data into actual stream of tuples and finally sending them to the bolts to be processed.
  • 11. Real-time processors  STORM Nodes Master Node: runs a daemon called ‘Nimbus’, which is similar to the ‘Job Tracker’ of Hadoop cluster. Assign Jobs. Monitor performance.
  • 12. Real-time processors  STORM Nodes Worker Node: runs a daemon called ‘Supervisor’. run one or more worker processes on its node. Apache Zookeeper facilitates communication between Nimbus and Supervisors with the help of message acknowledgements and processing status.
  • 13. Real-time processors  SAMZA It was initially created at LinkedIn, submitted to the Apache Incubator in July 2013. Samza was co-developed with the queueing system Kafka. Samza requires a little more work than storm to deploy as it does not only depend on a ZooKeeper cluster, but also runs on top of Hadoop YARN.
  • 14. Real-time processors  SAMZA - YARN cluster scheduler. It allows you to allocate a number of containers (processes) in a cluster of machines, and execute arbitrary commands on them, The Samza client uses YARN to run a Samza job. NodeManager: is responsible for launching processes on the machine. ResourceManager: Talks to all of the NodeManagers to tell them what to run. ApplicationMaster: is responsible for managing the application’s workload, asking for containers, and handling notifications when one of its containers fails.
  • 15. Real-time processors  SAMZA decouples individual processing steps. buffering data between processing steps makes (intermediate) results available to unrelated parties.  Prevent data loss by periodically checkpointing current progress and reprocessing all data from failure point.
  • 16. Real-time processors  SPARK Is a batch-processing framework that is often mentioned as the in official successor of Hadoop as it offers several benefits in comparison. significant performance improvements through in-memory caching.  Spark provides a variety of machine learning algorithms out-of-the-box through the MLlib library.
  • 17. Real-time processors  SPARK – Architecture
  • 18. Discussion SPARKSAMZASTORM Achievable latency processing model ordering guarantees << 100 ms < 100 ms < 1 s one-at-a-time one-at-a-time Micro-batch between batcheswithin stream partitionsNo elasticity Yes YesNo All these different systems show that low latency is involved in a number of trade-offs with other desirable properties such as throughput, fault-tolerance, reliability (processing guarantees) and ease of development.
  • 19. References • https://www.quora.com/What-are-the-differences-between- Apache-Spark-Storm-Samza-Flink-Beam-Apex • https://www.quora.com/What-are-the-differences-between-batch- processing-and-stream-processing-systems • https://samza.apache.org/learn/documentation/0.10/introduction/ architecture.html • https://dzone.com/articles/streaming-big-data-storm-spark • Paper : Real-time stream processing for Big Data Bereitgestellt von | Staats- und Universitätsbibliothek Hamburg Angemeldet Heruntergeladen am | 13.10.16 19:14