SlideShare a Scribd company logo
1 of 33
16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Processing with Apache Storm
Data Stream Processing
Slides @ https://goo.gl/BJRf9A
16BIT IITR
Data Collection ModuleData Collection Module
Overview
Data Stream Processing
• Streaming Data Processing
• What is Apache Storm?
• Storm Architecture and Key Concepts
• Monitoring of Storm Cluster
• Development of Storm Apps
• Comparison with other softwares
• Resources
16BIT IITR
Data Collection ModuleData Collection Module
Types of Processing of Big Data
Data Stream Processing
• Batch Processing
Takes large amount Data at a time, analyzes it and produces a large output.
• Real-Time Processing
Collects, analyzes and produces output in Real time.
16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Processing
Data Stream Processing
• Today, most data is continuously produced
user activity logs, web logs, sensors, database transactions, social data…
• The common approach to analyze such data so far
o Record data stream to stable storage (DBMS, HDFS, …)
o Periodically analyze data with batch processing engine
(DBMS, MapReduce, ...)
• Streaming processing engines analyze data while it arrives
16BIT IITR
Data Collection ModuleData Collection Module
Why do Stream Processing?
Data Stream Processing
• Decreases the overall latency to obtain results
o No need to persist data in stable storage
o No periodic batch analysis jobs
• Simplifies the data infrastructure
o Fewer moving parts to be maintained and coordinated
• Makes time dimension of data explicit
o Each event has a timestamp
o Data can be processed based on timestamps
16BIT IITR
Data Collection ModuleData Collection Module
What are the Requirements?
Data Stream Processing
• Large Scale
• Low latency
Results in millisecond
• High throughput
Millions of events per second
• Exactly-once consistency
Correct results in case of failures
• Out-of-order events
Process events based on their associated time
• Intuitive APIs
16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Architecture
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Stream Processing Technologies
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Apache Storm
Data Stream Processing
• Distributed, fault-tolerant and real-time computation
• Originated at BackType/Twitter, open sourced in late 2011
• Implemented in Clojure, some Java
• Supports APIs in many languages including Java, Python, Scala etc.
16BIT IITR
Data Collection ModuleData Collection Module
Use Cases of Apache Storm
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Storm Cluster Architecture
Data Stream Processing
Supervisor
Nimbus
ZooKeeper
ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus – The Master Node
o Distributes codes around cluster
o Assigns tasks to machines/supervisors
o Failure Monitoring
o Stateless
Apache Zookeeper
o Highly Robust
o Provides for Service Discovery and Coordination
o Stores the states
Supervisor
o Listens for work assigned to its machine
o Starts and stops worker processes based on instructions
from Nimbus
o Stateless
16BIT IITR
Data Collection ModuleData Collection Module
Storm Architecture – Fault Tolerance
Data Stream Processing
• What happens when Nimbus dies (master node)?
o If Nimbus is run under process supervision as recommended (e.g. via supervisord), it will restart like
nothing happened.
o While Nimbus is down:
Existing topologies will continue to run, but you cannot submit new topologies.
Running worker processes will not be affected. Also, Supervisors will restart their (local) workers if needed.
However, failed tasks will not be reassigned to other machines, as this is the responsibility of Nimbus.
• What happens when a Supervisor dies (slave node)?
o If Supervisor run under process supervision as recommended (e.g. via supervisord), will restart like
nothing happened.
o Running worker processes will not be affected.
• What happens when a worker process dies?
Its parent Supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus,
Nimbus will reassign the worker to another machine.
16BIT IITR
Data Collection ModuleData Collection Module
Key Concepts – Data Model
Data Stream Processing
Data Stream
Unbounded Sequence of Tuples
16BIT IITR
Data Collection ModuleData Collection Module
Key Concepts – Spouts and Bolts
Data Stream Processing
• Can do anything from running functions, filter tuples, joins, talk to DB, etc.
• Complex stream transformations often require multiple steps and thus multiple bolts.
Spouts
• Source of data streams
Example: Connect to the Twitter API and emit a stream of tweets.
Spout 1 Bolt 1
Bolts
• Consumes streams and potentially produces new streams
Spout 1 Bolt 1 Bolt 2
16BIT IITR
Data Collection ModuleData Collection Module
Key Concepts - Topology
Data Stream Processing
• Network of Spouts and Bolts
• Wires data and functions via a DAG.
• Executes forever and on many machines.
Spout 2
Bolt
3
Bolt
2
Bolt
4
Spout 1
Bolt
1
data
16BIT IITR
Data Collection ModuleData Collection Module
Deploying a Storm Cluster
Data Stream Processing
• http://storm.apache.org/about/deployment.html
• http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
• http://knowm.org/how-to-install-a-distributed-apache-storm-cluster/
• https://github.com/nathanmarz/storm-deploy
16BIT IITR
Data Collection ModuleData Collection Module
Monitoring your Storm Cluster
Data Stream Processing
• Storm UI
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps
Data Stream Processing
A trivial “Hello, Storm” topology
“emit random number <
100”
“multiply by
2”
(148)(74)
Spout Bolt
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Spouts
Data Stream Processing
• Multiple kinds of inbuilt spouts available to connect to various kinds of streams
Example of a basic spout which is generating data by itself
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
• Two main options for JVM users:
o Implement the IRichBolt or IBasicBolt interfaces
o Extend the BaseRichBolt or BaseBasicBolt abstract classes
• BaseRichBolt
o You must – and are able to – manually ack() an incoming tuple.
• BaseBasicBolt
o Auto-acks the incoming tuple at the end of its execute() method.
o These bolts are typically simple functions or filters.
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
Extending BaseRichBolt
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
execute() is the heart of the bolt.
This is where you will focus most of your attention when implementing your bolt or when trying to
understand somebody else’s bolt.
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
prepare() acts as a “second constructor” for the bolt’s class.
Because of Storm’s distributed execution model and serialization, prepare() is often needed to fully
initialize the bolt on the target JVM.
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
declareOutputFields() tells downstream bolts about this bolt’s output. What you declare must
match what you actually emit().
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing a Topology
Data Stream Processing
• When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use,
and how they interconnect.
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing a Topology
Data Stream Processing
• When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use,
and how they interconnect.
• You must specify the initial parallelism of the topology
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Submitting and Running a Topology
Data Stream Processing
• You submit a topology either to a “local” cluster or to a real cluster.
• To run a topology you must first package your code into a “fat jar”.
o You must includes all your code’s dependencies but:
o Exclude the Storm dependency itself, as the Storm cluster will provide this.
Note: You may need to tweak your build script so that your local tests do include the Storm dependency.
See e.g. assembly.sbt in kafka-storm-starter for an example.
• A topology is run via the storm jar command.
o Will connect to Nimbus, upload your jar, and run the topology.
o Use any machine that can run "storm jar" and talk to Nimbus' Thrift port.
o The configuration of the machine on which the topology is deployed is passed through another
config file
$ storm jar all-my-code.jar com.miguno.MyTopology arg1 arg2
16BIT IITR
Data Collection ModuleData Collection Module
Many Stream Processing software, which to use?
Data Stream Processing
The choice would depend on your use cases.
16BIT IITR
Data Collection ModuleData Collection Module
Resources
Data Stream Processing
• A few Storm books are already available.
• Storm documentation
http://storm.incubator.apache.org/documentation/Home.html
• Storm-kafka
https://github.com/apache/incubator-storm/tree/master/external/storm-kafka
• Mailing lists
http://storm.incubator.apache.org/community.html
• Code examples
https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
https://github.com/miguno/kafka-storm-starter/
16BIT IITR
Data Collection ModuleData Collection Module
Thank You!
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Extra Slides
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Use Cases of Apache Storm
Data Stream Processing
• Stream processing:
Storm is used to process a stream of data and update a variety of Databases in real time.
This processing occurs in real time and the processing speed needs to match the input data
speed.
• Continuous computation:
Storm can do continuous computation on data streams and stream the results into clients in
real time.
• Distributed RPC ()
Storm can parallelize an intense query so that you can compute it in real time.
• Real-time analytics:
Storm can analyze and respond to data that comes
from different data sources as they happen in real time.
16BIT IITR
Data Collection Module
What can I do with Wirbelsturm?
• Get a first impression of Storm
• Test-drive your topologies
• Test failure handling
• Stop/kill Nimbus, check what happens to Supervisors.
• Stop/kill ZooKeeper instances, check what happens to topology.
• Use as sandbox environment to test/validate deployments
• “What will actually happen when I deactivate this topology?”
• “Will my Hiera changes actually work?”
• Reproduce production issues, share results with Dev
• Also helpful when reporting back to Storm project and mailing lists.
• Any further cool ideas? 
33

More Related Content

What's hot

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
Open Analytics
 

What's hot (20)

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntopIT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Designing and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open SourceDesigning and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open Source
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWAREFIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
Hadoop at Lookout
Hadoop at LookoutHadoop at Lookout
Hadoop at Lookout
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi
 
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
NetApp Fabric Pool Deck
NetApp Fabric Pool DeckNetApp Fabric Pool Deck
NetApp Fabric Pool Deck
 
CodeStock - Exploring .NET memory management - a trip down memory lane
CodeStock - Exploring .NET memory management - a trip down memory laneCodeStock - Exploring .NET memory management - a trip down memory lane
CodeStock - Exploring .NET memory management - a trip down memory lane
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 

Viewers also liked

Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
Docker, Inc.
 
Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)  Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)
Gary Mitchell
 
CSEC630_TeamAssignment_TeamBlazer_FINAL
CSEC630_TeamAssignment_TeamBlazer_FINALCSEC630_TeamAssignment_TeamBlazer_FINAL
CSEC630_TeamAssignment_TeamBlazer_FINAL
Ronald Jackson, Jr
 
Resume_Pawan Jadhav_Testing
Resume_Pawan Jadhav_TestingResume_Pawan Jadhav_Testing
Resume_Pawan Jadhav_Testing
Pawan Jadhav
 
Dick Kramer 2016 Resume
Dick Kramer 2016 ResumeDick Kramer 2016 Resume
Dick Kramer 2016 Resume
Dick Kramer
 

Viewers also liked (16)

Product profile energi efisiensi
Product profile energi efisiensiProduct profile energi efisiensi
Product profile energi efisiensi
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Resume_PintuChattaraj
Resume_PintuChattarajResume_PintuChattaraj
Resume_PintuChattaraj
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
2 - SWOON BROCHURE
2 - SWOON BROCHURE2 - SWOON BROCHURE
2 - SWOON BROCHURE
 
RAGHUNATH_GORLA_RESUME
RAGHUNATH_GORLA_RESUMERAGHUNATH_GORLA_RESUME
RAGHUNATH_GORLA_RESUME
 
Cloud Price Comparison - AWS vs Azure vs Google
Cloud Price Comparison - AWS vs Azure vs GoogleCloud Price Comparison - AWS vs Azure vs Google
Cloud Price Comparison - AWS vs Azure vs Google
 
yasmin said
yasmin saidyasmin said
yasmin said
 
Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)  Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)
 
CSEC630_TeamAssignment_TeamBlazer_FINAL
CSEC630_TeamAssignment_TeamBlazer_FINALCSEC630_TeamAssignment_TeamBlazer_FINAL
CSEC630_TeamAssignment_TeamBlazer_FINAL
 
Wynia CV
Wynia CVWynia CV
Wynia CV
 
Resume_Pawan Jadhav_Testing
Resume_Pawan Jadhav_TestingResume_Pawan Jadhav_Testing
Resume_Pawan Jadhav_Testing
 
MySQL运维那些事
MySQL运维那些事 MySQL运维那些事
MySQL运维那些事
 
Rajni CV - PM-Final
Rajni CV - PM-FinalRajni CV - PM-Final
Rajni CV - PM-Final
 
Dick Kramer 2016 Resume
Dick Kramer 2016 ResumeDick Kramer 2016 Resume
Dick Kramer 2016 Resume
 

Similar to Workshop slides

MongoDB at MapMyFitness from a DevOps Perspective
MongoDB at MapMyFitness from a DevOps PerspectiveMongoDB at MapMyFitness from a DevOps Perspective
MongoDB at MapMyFitness from a DevOps Perspective
MongoDB
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server Internals
Praveen Gollakota
 

Similar to Workshop slides (20)

Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
 
Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetDeep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnet
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
 
MongoDB at MapMyFitness from a DevOps Perspective
MongoDB at MapMyFitness from a DevOps PerspectiveMongoDB at MapMyFitness from a DevOps Perspective
MongoDB at MapMyFitness from a DevOps Perspective
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
MongoDB at MapMyFitness
MongoDB at MapMyFitnessMongoDB at MapMyFitness
MongoDB at MapMyFitness
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server Internals
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013
 
Model driven telemetry
Model driven telemetryModel driven telemetry
Model driven telemetry
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 

Recently uploaded

ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Lovely Professional University
 

Recently uploaded (20)

ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Attraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptxAttraction and Repulsion type Moving Iron Instruments.pptx
Attraction and Repulsion type Moving Iron Instruments.pptx
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
Object Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docxObject Oriented Programming OOP Lab Manual.docx
Object Oriented Programming OOP Lab Manual.docx
 
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdfRESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
RESORT MANAGEMENT AND RESERVATION SYSTEM PROJECT REPORT.pdf
 
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWINGBRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
BRAKING SYSTEM IN INDIAN RAILWAY AutoCAD DRAWING
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
ANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdfANSI(ST)-III_Manufacturing-I_05052020.pdf
ANSI(ST)-III_Manufacturing-I_05052020.pdf
 
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptxROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdf
 
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
The battle for RAG, explore the pros and cons of using KnowledgeGraphs and Ve...
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbineLow rpm Generator for efficient energy harnessing from a two stage wind turbine
Low rpm Generator for efficient energy harnessing from a two stage wind turbine
 
Online book store management system project.pdf
Online book store management system project.pdfOnline book store management system project.pdf
Online book store management system project.pdf
 
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
Activity Planning: Objectives, Project Schedule, Network Planning Model. Time...
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptx
 
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdfBURGER ORDERING SYSYTEM PROJECT REPORT..pdf
BURGER ORDERING SYSYTEM PROJECT REPORT..pdf
 

Workshop slides

  • 1. 16BIT IITR Data Collection ModuleData Collection Module Streaming Data Processing with Apache Storm Data Stream Processing Slides @ https://goo.gl/BJRf9A
  • 2. 16BIT IITR Data Collection ModuleData Collection Module Overview Data Stream Processing • Streaming Data Processing • What is Apache Storm? • Storm Architecture and Key Concepts • Monitoring of Storm Cluster • Development of Storm Apps • Comparison with other softwares • Resources
  • 3. 16BIT IITR Data Collection ModuleData Collection Module Types of Processing of Big Data Data Stream Processing • Batch Processing Takes large amount Data at a time, analyzes it and produces a large output. • Real-Time Processing Collects, analyzes and produces output in Real time.
  • 4. 16BIT IITR Data Collection ModuleData Collection Module Streaming Data Processing Data Stream Processing • Today, most data is continuously produced user activity logs, web logs, sensors, database transactions, social data… • The common approach to analyze such data so far o Record data stream to stable storage (DBMS, HDFS, …) o Periodically analyze data with batch processing engine (DBMS, MapReduce, ...) • Streaming processing engines analyze data while it arrives
  • 5. 16BIT IITR Data Collection ModuleData Collection Module Why do Stream Processing? Data Stream Processing • Decreases the overall latency to obtain results o No need to persist data in stable storage o No periodic batch analysis jobs • Simplifies the data infrastructure o Fewer moving parts to be maintained and coordinated • Makes time dimension of data explicit o Each event has a timestamp o Data can be processed based on timestamps
  • 6. 16BIT IITR Data Collection ModuleData Collection Module What are the Requirements? Data Stream Processing • Large Scale • Low latency Results in millisecond • High throughput Millions of events per second • Exactly-once consistency Correct results in case of failures • Out-of-order events Process events based on their associated time • Intuitive APIs
  • 7. 16BIT IITR Data Collection ModuleData Collection Module Streaming Data Architecture Data Stream Processing
  • 8. 16BIT IITR Data Collection ModuleData Collection Module Stream Processing Technologies Data Stream Processing
  • 9. 16BIT IITR Data Collection ModuleData Collection Module Apache Storm Data Stream Processing • Distributed, fault-tolerant and real-time computation • Originated at BackType/Twitter, open sourced in late 2011 • Implemented in Clojure, some Java • Supports APIs in many languages including Java, Python, Scala etc.
  • 10. 16BIT IITR Data Collection ModuleData Collection Module Use Cases of Apache Storm Data Stream Processing
  • 11. 16BIT IITR Data Collection ModuleData Collection Module Storm Cluster Architecture Data Stream Processing Supervisor Nimbus ZooKeeper ZooKeeper ZooKeeper Supervisor Supervisor Supervisor Supervisor Nimbus – The Master Node o Distributes codes around cluster o Assigns tasks to machines/supervisors o Failure Monitoring o Stateless Apache Zookeeper o Highly Robust o Provides for Service Discovery and Coordination o Stores the states Supervisor o Listens for work assigned to its machine o Starts and stops worker processes based on instructions from Nimbus o Stateless
  • 12. 16BIT IITR Data Collection ModuleData Collection Module Storm Architecture – Fault Tolerance Data Stream Processing • What happens when Nimbus dies (master node)? o If Nimbus is run under process supervision as recommended (e.g. via supervisord), it will restart like nothing happened. o While Nimbus is down: Existing topologies will continue to run, but you cannot submit new topologies. Running worker processes will not be affected. Also, Supervisors will restart their (local) workers if needed. However, failed tasks will not be reassigned to other machines, as this is the responsibility of Nimbus. • What happens when a Supervisor dies (slave node)? o If Supervisor run under process supervision as recommended (e.g. via supervisord), will restart like nothing happened. o Running worker processes will not be affected. • What happens when a worker process dies? Its parent Supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus, Nimbus will reassign the worker to another machine.
  • 13. 16BIT IITR Data Collection ModuleData Collection Module Key Concepts – Data Model Data Stream Processing Data Stream Unbounded Sequence of Tuples
  • 14. 16BIT IITR Data Collection ModuleData Collection Module Key Concepts – Spouts and Bolts Data Stream Processing • Can do anything from running functions, filter tuples, joins, talk to DB, etc. • Complex stream transformations often require multiple steps and thus multiple bolts. Spouts • Source of data streams Example: Connect to the Twitter API and emit a stream of tweets. Spout 1 Bolt 1 Bolts • Consumes streams and potentially produces new streams Spout 1 Bolt 1 Bolt 2
  • 15. 16BIT IITR Data Collection ModuleData Collection Module Key Concepts - Topology Data Stream Processing • Network of Spouts and Bolts • Wires data and functions via a DAG. • Executes forever and on many machines. Spout 2 Bolt 3 Bolt 2 Bolt 4 Spout 1 Bolt 1 data
  • 16. 16BIT IITR Data Collection ModuleData Collection Module Deploying a Storm Cluster Data Stream Processing • http://storm.apache.org/about/deployment.html • http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/ • http://knowm.org/how-to-install-a-distributed-apache-storm-cluster/ • https://github.com/nathanmarz/storm-deploy
  • 17. 16BIT IITR Data Collection ModuleData Collection Module Monitoring your Storm Cluster Data Stream Processing • Storm UI
  • 18. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps Data Stream Processing A trivial “Hello, Storm” topology “emit random number < 100” “multiply by 2” (148)(74) Spout Bolt
  • 19. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Spouts Data Stream Processing • Multiple kinds of inbuilt spouts available to connect to various kinds of streams Example of a basic spout which is generating data by itself
  • 20. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing • Two main options for JVM users: o Implement the IRichBolt or IBasicBolt interfaces o Extend the BaseRichBolt or BaseBasicBolt abstract classes • BaseRichBolt o You must – and are able to – manually ack() an incoming tuple. • BaseBasicBolt o Auto-acks the incoming tuple at the end of its execute() method. o These bolts are typically simple functions or filters.
  • 21. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing Extending BaseRichBolt
  • 22. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing execute() is the heart of the bolt. This is where you will focus most of your attention when implementing your bolt or when trying to understand somebody else’s bolt.
  • 23. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing prepare() acts as a “second constructor” for the bolt’s class. Because of Storm’s distributed execution model and serialization, prepare() is often needed to fully initialize the bolt on the target JVM.
  • 24. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing declareOutputFields() tells downstream bolts about this bolt’s output. What you declare must match what you actually emit().
  • 25. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing a Topology Data Stream Processing • When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use, and how they interconnect.
  • 26. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing a Topology Data Stream Processing • When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use, and how they interconnect. • You must specify the initial parallelism of the topology
  • 27. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Submitting and Running a Topology Data Stream Processing • You submit a topology either to a “local” cluster or to a real cluster. • To run a topology you must first package your code into a “fat jar”. o You must includes all your code’s dependencies but: o Exclude the Storm dependency itself, as the Storm cluster will provide this. Note: You may need to tweak your build script so that your local tests do include the Storm dependency. See e.g. assembly.sbt in kafka-storm-starter for an example. • A topology is run via the storm jar command. o Will connect to Nimbus, upload your jar, and run the topology. o Use any machine that can run "storm jar" and talk to Nimbus' Thrift port. o The configuration of the machine on which the topology is deployed is passed through another config file $ storm jar all-my-code.jar com.miguno.MyTopology arg1 arg2
  • 28. 16BIT IITR Data Collection ModuleData Collection Module Many Stream Processing software, which to use? Data Stream Processing The choice would depend on your use cases.
  • 29. 16BIT IITR Data Collection ModuleData Collection Module Resources Data Stream Processing • A few Storm books are already available. • Storm documentation http://storm.incubator.apache.org/documentation/Home.html • Storm-kafka https://github.com/apache/incubator-storm/tree/master/external/storm-kafka • Mailing lists http://storm.incubator.apache.org/community.html • Code examples https://github.com/apache/incubator-storm/tree/master/examples/storm-starter https://github.com/miguno/kafka-storm-starter/
  • 30. 16BIT IITR Data Collection ModuleData Collection Module Thank You! Data Stream Processing
  • 31. 16BIT IITR Data Collection ModuleData Collection Module Extra Slides Data Stream Processing
  • 32. 16BIT IITR Data Collection ModuleData Collection Module Use Cases of Apache Storm Data Stream Processing • Stream processing: Storm is used to process a stream of data and update a variety of Databases in real time. This processing occurs in real time and the processing speed needs to match the input data speed. • Continuous computation: Storm can do continuous computation on data streams and stream the results into clients in real time. • Distributed RPC () Storm can parallelize an intense query so that you can compute it in real time. • Real-time analytics: Storm can analyze and respond to data that comes from different data sources as they happen in real time.
  • 33. 16BIT IITR Data Collection Module What can I do with Wirbelsturm? • Get a first impression of Storm • Test-drive your topologies • Test failure handling • Stop/kill Nimbus, check what happens to Supervisors. • Stop/kill ZooKeeper instances, check what happens to topology. • Use as sandbox environment to test/validate deployments • “What will actually happen when I deactivate this topology?” • “Will my Hiera changes actually work?” • Reproduce production issues, share results with Dev • Also helpful when reporting back to Storm project and mailing lists. • Any further cool ideas?  33

Editor's Notes

  1. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  2. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  3. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  4. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  5. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  6. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.