SlideShare a Scribd company logo
Streaming Data Analytics
Gizem Akman | Software Infrastructure
Nov. 21, 2017
Between 2017 and 2022, Gartner estimates that the market for event
stream processing (ESP) platforms will grow 15% year over year
(compound annual growth rate).
Streaming Analytics – What?
Analytics
Real Time
«Data in motion»
Batch
«Data at rest»
Streaming Analytics – What?
Software that can filter, aggregate, enrich, and analyze a high
throughput of data from multiple, disparate live data sources and in
any data format to identify simple and complex patterns to provide
applications with context to detect opportune situations, automate
immediate actions, and dynamically adapt.
What is Real Time?
Paradigms – Processing Types
• Atomic
• Micro Batching
• Windowing
Paradigms – Data Handling Guarantees
• At Most Once
• At Least Once
• Exactly Once
Requirements & Characteristics
• Data must be real time (current)
• High volume & High Velocity Data
• «Perishable Data»
• Analytic logic must be predefined
• Ultra-High performance messaging
• Unbounded - Execution never stops
Streaming Analytics – How?
• Analytic logic must be predefined
• in-memory
• Parallel – scale out
• faster chips or GPUs
• efficient algorithms (ex. minimizing context switches)
• Leveraging innovative data architectures (ex. hashing)
• Compromise on flexibility ( such as limiting random data access)
Zero-Copy
Popular Platforms
Open Source
Apache Storm
Apache Flink
Spark Streaming
Apache Samza
Vendors
IBM Streams
Software AG – Apama
Streaming Analytics
Azure Stream Analytics
SAP Event Stream Processor
Oracle Stream Analytics
SAS Event Stream Processing
Cisco Streaming Analytics
Amazon Kinesis
Google Cloud Dataflow
TIBCO Event Analytics
Informatica
Striim
DataTorrent
StreamAnalytix
SQLStream Blaze
Data Artisans
Impetus Technologies
EsperTech
Basics
• Free & open source distributed realtime computing
engine «Hadoop for real time»
• Fast over a 1M tuples processed per second per node
Architecture
Data Model
Groupings
Data Processing Guarantee
Fault Tolerance
«fail-fast, auto restart»
• When a worker dies: the supervisor will restart it. If it continuously fails on
startup and is unable to heartbeat to Nimbus, Nimbus will reschedule the
worker.
• When a node dies: The tasks assigned to that machine will time-out and
Nimbus will reassign those tasks to other machines.
• When Nimbus dies: The Nimbus is fail-fast (process self-destructs
whenever any unexpected situation is encountered) and stateless (all state
is kept in Zookeeper or on disk, so restart like nothing happened.
• If you lose the Nimbus node, the workers will still continue to function.
Additionally, supervisors will continue to restart workers if they die.
However, without Nimbus, workers won't be reassigned to other machines
when necessary (like if you lose a worker machine).
Parallelism
Parallelism
Integration
• Apache Kafka
• Apache Hbase
• Apache HDFS
• Apache Hive
• Apache Solr
• Apache Cassandra
• JDBC
• JMS
• Redis
• Event Hubs
• Elasticsearch
• MQTT
• Mongodb
• OpenTSDB
• Kinesis
• Druid
• Kestrel
With External Systems,
and Other Libraries
With Containers,
and Resource Management Systems
• YARN
• Mesos
• Docker
• Kubernetes
And many others>> http://storm.apache.org/Powered-By.html
Stream Analytics

More Related Content

What's hot

Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
Databricks
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
SingleStore
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
Data Con LA
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
Jen Aman
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
ITProceed
 
Building Software to Scale
Building Software to Scale Building Software to Scale
Building Software to Scale
SingleStore
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
Karthik Murugesan
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
NAVER D2
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Big Data Spain
 
Battling Model Decay with Deep Learning and Gamification
Battling Model Decay with Deep Learning and GamificationBattling Model Decay with Deep Learning and Gamification
Battling Model Decay with Deep Learning and Gamification
Databricks
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
Data Con LA
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
Spark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony Baer
Spark Summit
 
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DBBuilding near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
Timothy McAliley
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
Monisha Kanoth
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using InteranaPuree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using Interana
Jagjit Srawan
 
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
Amazon Web Services
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Albert Wong
 
Overkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark MeetupOverkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark Meetup
Claudiu Barbura
 
Fikrimuhal TRHUG 2016 Machine Learning
Fikrimuhal TRHUG 2016 Machine LearningFikrimuhal TRHUG 2016 Machine Learning
Fikrimuhal TRHUG 2016 Machine Learning
Sukru Hasdemir
 

What's hot (20)

Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
Azure stream analytics by Nico Jacobs
Azure stream analytics by Nico JacobsAzure stream analytics by Nico Jacobs
Azure stream analytics by Nico Jacobs
 
Building Software to Scale
Building Software to Scale Building Software to Scale
Building Software to Scale
 
Microsoft cosmos
Microsoft cosmosMicrosoft cosmos
Microsoft cosmos
 
[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data[2C6]Everyplay_Big_Data
[2C6]Everyplay_Big_Data
 
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...
 
Battling Model Decay with Deep Learning and Gamification
Battling Model Decay with Deep Learning and GamificationBattling Model Decay with Deep Learning and Gamification
Battling Model Decay with Deep Learning and Gamification
 
Machine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh PoduskaMachine Learning on Distributed Systems by Josh Poduska
Machine Learning on Distributed Systems by Josh Poduska
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Spark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony Baer
 
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DBBuilding near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
Building near real-time HTAP solutions using Synapse Link for Azure Cosmos DB
 
Rapid Data Analytics @ Netflix
Rapid Data Analytics @ NetflixRapid Data Analytics @ Netflix
Rapid Data Analytics @ Netflix
 
Puree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using InteranaPuree through Trillion of clicks in seconds using Interana
Puree through Trillion of clicks in seconds using Interana
 
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
(BDT207) Real-Time Analytics In Service Of Self-Healing Ecosystems
 
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
 
Overkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark MeetupOverkill Analytics Seattle Spark Meetup
Overkill Analytics Seattle Spark Meetup
 
Fikrimuhal TRHUG 2016 Machine Learning
Fikrimuhal TRHUG 2016 Machine LearningFikrimuhal TRHUG 2016 Machine Learning
Fikrimuhal TRHUG 2016 Machine Learning
 

Similar to Stream Analytics

Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
Streamlio
 
Traitement d'événements
Traitement d'événementsTraitement d'événements
Traitement d'événements
Amazon Web Services
 
Stream dataprocessing101
Stream dataprocessing101Stream dataprocessing101
Stream dataprocessing101
Sotaro Kimura
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
Amazon Web Services
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Julien SIMON
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Sylvain Kalache
 
RedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech StackRedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech Stack
Redis Labs
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
Amazon Web Services
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
John Adams
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit
 
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Cloud Native Day Tel Aviv
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Introduction to near real time computing
Introduction to near real time computingIntroduction to near real time computing
Introduction to near real time computing
Tao Li
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
SingleStore
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
Crate.io
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
DataStax Academy
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
DataStax Academy
 

Similar to Stream Analytics (20)

Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Traitement d'événements
Traitement d'événementsTraitement d'événements
Traitement d'événements
 
Stream dataprocessing101
Stream dataprocessing101Stream dataprocessing101
Stream dataprocessing101
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
 
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the CloudSkynet project: Monitor, analyze, scale, and maintain a system in the Cloud
Skynet project: Monitor, analyze, scale, and maintain a system in the Cloud
 
RedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech StackRedisConf17 - Redis in High Traffic Adtech Stack
RedisConf17 - Redis in High Traffic Adtech Stack
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
Distributesd Tracing in Serverless Systems - Shannon Hogue, Epsagon - Cloud N...
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Introduction to near real time computing
Introduction to near real time computingIntroduction to near real time computing
Introduction to near real time computing
 
Real-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQLReal-Time Analytics with Confluent and MemSQL
Real-Time Analytics with Confluent and MemSQL
 
Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?Webinar: SQL for Machine Data?
Webinar: SQL for Machine Data?
 
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in ProductionCassandra Day Atlanta 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
 
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in ProductionCassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
 

More from Software Infrastructure

Kotlin
KotlinKotlin
NoSql
NoSqlNoSql
Quartz Scheduler
Quartz SchedulerQuartz Scheduler
Quartz Scheduler
Software Infrastructure
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
Software Infrastructure
 
Deep Learning
Deep Learning Deep Learning
Deep Learning
Software Infrastructure
 
Progressive Web Apps
Progressive Web AppsProgressive Web Apps
Progressive Web Apps
Software Infrastructure
 
Java9
Java9Java9
Machine learning
Machine learningMachine learning
Machine learning
Software Infrastructure
 
Raspberry PI
Raspberry PIRaspberry PI
Golang
GolangGolang
Codename one
Codename oneCodename one
Hazelcast sunum
Hazelcast sunumHazelcast sunum
Hazelcast sunum
Software Infrastructure
 
Microsoft bot framework
Microsoft bot frameworkMicrosoft bot framework
Microsoft bot framework
Software Infrastructure
 
Blockchain use cases
Blockchain use casesBlockchain use cases
Blockchain use cases
Software Infrastructure
 
The Fintechs
The FintechsThe Fintechs
Server Side Swift
Server Side SwiftServer Side Swift
Server Side Swift
Software Infrastructure
 
Push Notification
Push NotificationPush Notification
Push Notification
Software Infrastructure
 
.Net Core
.Net Core.Net Core
Java Batch
Java BatchJava Batch
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
Software Infrastructure
 

More from Software Infrastructure (20)

Kotlin
KotlinKotlin
Kotlin
 
NoSql
NoSqlNoSql
NoSql
 
Quartz Scheduler
Quartz SchedulerQuartz Scheduler
Quartz Scheduler
 
Test Driven Development
Test Driven DevelopmentTest Driven Development
Test Driven Development
 
Deep Learning
Deep Learning Deep Learning
Deep Learning
 
Progressive Web Apps
Progressive Web AppsProgressive Web Apps
Progressive Web Apps
 
Java9
Java9Java9
Java9
 
Machine learning
Machine learningMachine learning
Machine learning
 
Raspberry PI
Raspberry PIRaspberry PI
Raspberry PI
 
Golang
GolangGolang
Golang
 
Codename one
Codename oneCodename one
Codename one
 
Hazelcast sunum
Hazelcast sunumHazelcast sunum
Hazelcast sunum
 
Microsoft bot framework
Microsoft bot frameworkMicrosoft bot framework
Microsoft bot framework
 
Blockchain use cases
Blockchain use casesBlockchain use cases
Blockchain use cases
 
The Fintechs
The FintechsThe Fintechs
The Fintechs
 
Server Side Swift
Server Side SwiftServer Side Swift
Server Side Swift
 
Push Notification
Push NotificationPush Notification
Push Notification
 
.Net Core
.Net Core.Net Core
.Net Core
 
Java Batch
Java BatchJava Batch
Java Batch
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 

Recently uploaded

Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
Kamal Acharya
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
DuvanRamosGarzon1
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 

Recently uploaded (20)

Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSETECHNICAL TRAINING MANUAL   GENERAL FAMILIARIZATION COURSE
TECHNICAL TRAINING MANUAL GENERAL FAMILIARIZATION COURSE
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 

Stream Analytics

  • 1. Streaming Data Analytics Gizem Akman | Software Infrastructure Nov. 21, 2017
  • 2. Between 2017 and 2022, Gartner estimates that the market for event stream processing (ESP) platforms will grow 15% year over year (compound annual growth rate).
  • 3. Streaming Analytics – What? Analytics Real Time «Data in motion» Batch «Data at rest»
  • 4.
  • 5. Streaming Analytics – What? Software that can filter, aggregate, enrich, and analyze a high throughput of data from multiple, disparate live data sources and in any data format to identify simple and complex patterns to provide applications with context to detect opportune situations, automate immediate actions, and dynamically adapt.
  • 6. What is Real Time?
  • 7.
  • 8.
  • 9. Paradigms – Processing Types • Atomic • Micro Batching • Windowing
  • 10. Paradigms – Data Handling Guarantees • At Most Once • At Least Once • Exactly Once
  • 11. Requirements & Characteristics • Data must be real time (current) • High volume & High Velocity Data • «Perishable Data» • Analytic logic must be predefined • Ultra-High performance messaging • Unbounded - Execution never stops
  • 12. Streaming Analytics – How? • Analytic logic must be predefined • in-memory • Parallel – scale out • faster chips or GPUs • efficient algorithms (ex. minimizing context switches) • Leveraging innovative data architectures (ex. hashing) • Compromise on flexibility ( such as limiting random data access)
  • 14. Popular Platforms Open Source Apache Storm Apache Flink Spark Streaming Apache Samza Vendors IBM Streams Software AG – Apama Streaming Analytics Azure Stream Analytics SAP Event Stream Processor Oracle Stream Analytics SAS Event Stream Processing Cisco Streaming Analytics Amazon Kinesis Google Cloud Dataflow TIBCO Event Analytics Informatica Striim DataTorrent StreamAnalytix SQLStream Blaze Data Artisans Impetus Technologies EsperTech
  • 15.
  • 16. Basics • Free & open source distributed realtime computing engine «Hadoop for real time» • Fast over a 1M tuples processed per second per node
  • 21. Fault Tolerance «fail-fast, auto restart» • When a worker dies: the supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus, Nimbus will reschedule the worker. • When a node dies: The tasks assigned to that machine will time-out and Nimbus will reassign those tasks to other machines. • When Nimbus dies: The Nimbus is fail-fast (process self-destructs whenever any unexpected situation is encountered) and stateless (all state is kept in Zookeeper or on disk, so restart like nothing happened. • If you lose the Nimbus node, the workers will still continue to function. Additionally, supervisors will continue to restart workers if they die. However, without Nimbus, workers won't be reassigned to other machines when necessary (like if you lose a worker machine).
  • 24. Integration • Apache Kafka • Apache Hbase • Apache HDFS • Apache Hive • Apache Solr • Apache Cassandra • JDBC • JMS • Redis • Event Hubs • Elasticsearch • MQTT • Mongodb • OpenTSDB • Kinesis • Druid • Kestrel With External Systems, and Other Libraries With Containers, and Resource Management Systems • YARN • Mesos • Docker • Kubernetes
  • 25. And many others>> http://storm.apache.org/Powered-By.html

Editor's Notes

  1. *visualize real time *detect urgent situations *automate immediate actions Normal Programming >> code execution controls data Streaming Applications >> incoming data controls the code
  2. «loosely real time»
  3. pipeline
  4. Atomic:  (aka one-key-value-at-a-time) Processes each inbound data event as a separate element.  Seems logical but this is the most computationally expensive design.  For example, it’s used to guarantee fastest processing of individual events with least delay in transmitting the event to the subscriber.  Seen often for customer transactional inputs so that if some element of the event block fails the entire block is not deleted but moved to a bad record file that can later be processed further.  Apache Storm uses this paradigm. Micro batching:  Yes these are batches of events but typically those that occur within only a few milliseconds.  You can adjust the time window.  This makes the process somewhat more efficient.  Spark Streaming uses this paradigm. Windowing:  Similar to batching, Windowing allows the design of micro batches that may be simple time-based batches, but also allows for many more sophisticated interpretations such as sliding windows (e.g. everything that occurred in the last X period of time).  This can be very useful for aggregating events or determining outliers when compared to averages or standard deviation.
  5. At most once: data loss possible At least once: no data loss- redelivery attemps made, duplication possible Exactly once: typically require use of idempotent updates
  6. Compute Intensity, the number of arithmetic operations per I/O or global memory reference. In many signal processing applications today it is well over 50:1 and increasing with algorithmic complexity. Data Parallelism exists in a kernel if the same function is applied to all records of an input stream and a number of records can be processed simultaneously without waiting for results from previous records. Data Locality is a specific type of temporal locality common in signal and media processing applications where data is produced once, read once or twice later in the application, and never read again. Intermediate streams passed between kernels as well as intermediate data within kernel functions can capture this locality directly using the stream processing programming model.
  7. Most of these can be hidden within modern analytics products so that the user does not have to be aware of exactly how they are being used. Analytic logic must be predefined – daha önceki datalara dayanmamalı, independent belirlenmiş kurallar pipelineından geçiyor Stream processing is essentially a compromise, driven by a data-centric model that works very well for traditional DSP or GPU-type applications (such as image, video and digital signal processing) but less so for general purpose processing with more randomized data access (such as databases). By sacrificing some flexibility in the model, the implications allow easier, faster and more efficient execution. Depending on the context, processor design may be tuned for maximum efficiency or a trade-off for flexibility.
  8. Zero-copy Kafka neden hızlı? **Arada context switch yok**   Zero Copy - basically it calls the OS kernel direct rather than at the application layer to move data fast. Batch Data in Chunks - Kafka is all about batching the data into chunks. This minimises cross machine latency with all the buffering/copying that accompanies this. Avoids Random Disk Access - as Kafka is an immutable commit log it does not need to rewind the disk and do many random I/O operations and can just access the disk in a sequential manner. This enables it to get similar speeds from a physical disk compared with memory. Can Scale Horizontally - The ability to have thousands of partitions for a single topic spread among thousands of machines means Kafka can handle huge loads
  9. Storm is designed for massive scalability, supports fault-tolerance with a “fail fast, auto restart” approach to processes, and offers a strong guarantee that every tuple will be processed. Storm defaults to an “at least once” guarantee for messages, but offers the ability to implement “exactly once” processing as well. Storm is written primarily in Clojure and is designed to support wiring “spouts” (think input streams) and “bolts” (processing and output modules) together as a directed acyclic graph (DAG) called a topology. Storm topologies run on clusters and the Storm scheduler distributes work to nodes around the cluster, based on the topology configuration. You can think of topologies as roughly analogous to a MapReduce job in Hadoop, except that given Storm’s focus on real-time, stream-based processing, topologies default to running forever or until manually terminated. Once a topology is started, the spouts bring data into the system and hand the data off to bolts (which may in turn hand data to subsequent bolts) where the main computational work is done. As processing progresses, one or more bolts may write data out to a database or file system, send a message to another external system, or otherwise make the results of the computation available to the users. Hadoop: Store & Query. Store the data first, and indefinitely. Analyse when you like. Storm: Analyze while data is being produced. No need to store anything at all.
  10. Spouttan her tuple unique bir messageid ile çıkıyor ve bu messageidnin ack veya fail edilmesi gerekiyor. Ack edilmeyenler failed sayılıp spouttan tekrar tetikleniyor. Her boltun işlediği tupleı acklemesi lazım.
  11. Storm  cbdevapp02 Redis  nsidevapp01 redis/redis-4.0.2/src/redis-cli psubscribe WordCountTopology redis/redis-4.0.2/src/redis-server /WebSphere85/AppServer/java_1.8_64/bin/java -jar storm/Storm.jar 1>/dev/null 2>&1