SlideShare a Scribd company logo
1 of 33
16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Processing with Apache Storm
Data Stream Processing
Slides @ https://goo.gl/BJRf9A
16BIT IITR
Data Collection ModuleData Collection Module
Overview
Data Stream Processing
• Streaming Data Processing
• What is Apache Storm?
• Storm Architecture and Key Concepts
• Monitoring of Storm Cluster
• Development of Storm Apps
• Comparison with other softwares
• Resources
16BIT IITR
Data Collection ModuleData Collection Module
Types of Processing of Big Data
Data Stream Processing
• Batch Processing
Takes large amount Data at a time, analyzes it and produces a large output.
• Real-Time Processing
Collects, analyzes and produces output in Real time.
16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Processing
Data Stream Processing
• Today, most data is continuously produced
user activity logs, web logs, sensors, database transactions, social data…
• The common approach to analyze such data so far
o Record data stream to stable storage (DBMS, HDFS, …)
o Periodically analyze data with batch processing engine
(DBMS, MapReduce, ...)
• Streaming processing engines analyze data while it arrives
16BIT IITR
Data Collection ModuleData Collection Module
Why do Stream Processing?
Data Stream Processing
• Decreases the overall latency to obtain results
o No need to persist data in stable storage
o No periodic batch analysis jobs
• Simplifies the data infrastructure
o Fewer moving parts to be maintained and coordinated
• Makes time dimension of data explicit
o Each event has a timestamp
o Data can be processed based on timestamps
16BIT IITR
Data Collection ModuleData Collection Module
What are the Requirements?
Data Stream Processing
• Large Scale
• Low latency
Results in millisecond
• High throughput
Millions of events per second
• Exactly-once consistency
Correct results in case of failures
• Out-of-order events
Process events based on their associated time
• Intuitive APIs
16BIT IITR
Data Collection ModuleData Collection Module
Streaming Data Architecture
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Stream Processing Technologies
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Apache Storm
Data Stream Processing
• Distributed, fault-tolerant and real-time computation
• Originated at BackType/Twitter, open sourced in late 2011
• Implemented in Clojure, some Java
• Supports APIs in many languages including Java, Python, Scala etc.
16BIT IITR
Data Collection ModuleData Collection Module
Use Cases of Apache Storm
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Storm Cluster Architecture
Data Stream Processing
Supervisor
Nimbus
ZooKeeper
ZooKeeper
ZooKeeper
Supervisor
Supervisor
Supervisor
Supervisor
Nimbus – The Master Node
o Distributes codes around cluster
o Assigns tasks to machines/supervisors
o Failure Monitoring
o Stateless
Apache Zookeeper
o Highly Robust
o Provides for Service Discovery and Coordination
o Stores the states
Supervisor
o Listens for work assigned to its machine
o Starts and stops worker processes based on instructions
from Nimbus
o Stateless
16BIT IITR
Data Collection ModuleData Collection Module
Storm Architecture – Fault Tolerance
Data Stream Processing
• What happens when Nimbus dies (master node)?
o If Nimbus is run under process supervision as recommended (e.g. via supervisord), it will restart like
nothing happened.
o While Nimbus is down:
Existing topologies will continue to run, but you cannot submit new topologies.
Running worker processes will not be affected. Also, Supervisors will restart their (local) workers if needed.
However, failed tasks will not be reassigned to other machines, as this is the responsibility of Nimbus.
• What happens when a Supervisor dies (slave node)?
o If Supervisor run under process supervision as recommended (e.g. via supervisord), will restart like
nothing happened.
o Running worker processes will not be affected.
• What happens when a worker process dies?
Its parent Supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus,
Nimbus will reassign the worker to another machine.
16BIT IITR
Data Collection ModuleData Collection Module
Key Concepts – Data Model
Data Stream Processing
Data Stream
Unbounded Sequence of Tuples
16BIT IITR
Data Collection ModuleData Collection Module
Key Concepts – Spouts and Bolts
Data Stream Processing
• Can do anything from running functions, filter tuples, joins, talk to DB, etc.
• Complex stream transformations often require multiple steps and thus multiple bolts.
Spouts
• Source of data streams
Example: Connect to the Twitter API and emit a stream of tweets.
Spout 1 Bolt 1
Bolts
• Consumes streams and potentially produces new streams
Spout 1 Bolt 1 Bolt 2
16BIT IITR
Data Collection ModuleData Collection Module
Key Concepts - Topology
Data Stream Processing
• Network of Spouts and Bolts
• Wires data and functions via a DAG.
• Executes forever and on many machines.
Spout 2
Bolt
3
Bolt
2
Bolt
4
Spout 1
Bolt
1
data
16BIT IITR
Data Collection ModuleData Collection Module
Deploying a Storm Cluster
Data Stream Processing
• http://storm.apache.org/about/deployment.html
• http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
• http://knowm.org/how-to-install-a-distributed-apache-storm-cluster/
• https://github.com/nathanmarz/storm-deploy
16BIT IITR
Data Collection ModuleData Collection Module
Monitoring your Storm Cluster
Data Stream Processing
• Storm UI
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps
Data Stream Processing
A trivial “Hello, Storm” topology
“emit random number <
100”
“multiply by
2”
(148)(74)
Spout Bolt
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Spouts
Data Stream Processing
• Multiple kinds of inbuilt spouts available to connect to various kinds of streams
Example of a basic spout which is generating data by itself
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
• Two main options for JVM users:
o Implement the IRichBolt or IBasicBolt interfaces
o Extend the BaseRichBolt or BaseBasicBolt abstract classes
• BaseRichBolt
o You must – and are able to – manually ack() an incoming tuple.
• BaseBasicBolt
o Auto-acks the incoming tuple at the end of its execute() method.
o These bolts are typically simple functions or filters.
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
Extending BaseRichBolt
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
execute() is the heart of the bolt.
This is where you will focus most of your attention when implementing your bolt or when trying to
understand somebody else’s bolt.
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
prepare() acts as a “second constructor” for the bolt’s class.
Because of Storm’s distributed execution model and serialization, prepare() is often needed to fully
initialize the bolt on the target JVM.
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing Bolts
Data Stream Processing
declareOutputFields() tells downstream bolts about this bolt’s output. What you declare must
match what you actually emit().
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing a Topology
Data Stream Processing
• When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use,
and how they interconnect.
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Writing a Topology
Data Stream Processing
• When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use,
and how they interconnect.
• You must specify the initial parallelism of the topology
16BIT IITR
Data Collection ModuleData Collection Module
Developing Storm Apps – Submitting and Running a Topology
Data Stream Processing
• You submit a topology either to a “local” cluster or to a real cluster.
• To run a topology you must first package your code into a “fat jar”.
o You must includes all your code’s dependencies but:
o Exclude the Storm dependency itself, as the Storm cluster will provide this.
Note: You may need to tweak your build script so that your local tests do include the Storm dependency.
See e.g. assembly.sbt in kafka-storm-starter for an example.
• A topology is run via the storm jar command.
o Will connect to Nimbus, upload your jar, and run the topology.
o Use any machine that can run "storm jar" and talk to Nimbus' Thrift port.
o The configuration of the machine on which the topology is deployed is passed through another
config file
$ storm jar all-my-code.jar com.miguno.MyTopology arg1 arg2
16BIT IITR
Data Collection ModuleData Collection Module
Many Stream Processing software, which to use?
Data Stream Processing
The choice would depend on your use cases.
16BIT IITR
Data Collection ModuleData Collection Module
Resources
Data Stream Processing
• A few Storm books are already available.
• Storm documentation
http://storm.incubator.apache.org/documentation/Home.html
• Storm-kafka
https://github.com/apache/incubator-storm/tree/master/external/storm-kafka
• Mailing lists
http://storm.incubator.apache.org/community.html
• Code examples
https://github.com/apache/incubator-storm/tree/master/examples/storm-starter
https://github.com/miguno/kafka-storm-starter/
16BIT IITR
Data Collection ModuleData Collection Module
Thank You!
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Extra Slides
Data Stream Processing
16BIT IITR
Data Collection ModuleData Collection Module
Use Cases of Apache Storm
Data Stream Processing
• Stream processing:
Storm is used to process a stream of data and update a variety of Databases in real time.
This processing occurs in real time and the processing speed needs to match the input data
speed.
• Continuous computation:
Storm can do continuous computation on data streams and stream the results into clients in
real time.
• Distributed RPC ()
Storm can parallelize an intense query so that you can compute it in real time.
• Real-time analytics:
Storm can analyze and respond to data that comes
from different data sources as they happen in real time.
16BIT IITR
Data Collection Module
What can I do with Wirbelsturm?
• Get a first impression of Storm
• Test-drive your topologies
• Test failure handling
• Stop/kill Nimbus, check what happens to Supervisors.
• Stop/kill ZooKeeper instances, check what happens to topology.
• Use as sandbox environment to test/validate deployments
• “What will actually happen when I deactivate this topology?”
• “Will my Hiera changes actually work?”
• Reproduce production issues, share results with Dev
• Also helpful when reporting back to Storm project and mailing lists.
• Any further cool ideas? 
33

More Related Content

What's hot

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntopIT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntopInfluxData
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleFlink Forward
 
Designing and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open SourceDesigning and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open SourceDataWorks Summit/Hadoop Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWAREFIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWAREFIWARE
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Apache Apex
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)Open Analytics
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataFabian Hueske
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi"Constantin \"Cristi\"" Stanca
 
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...Spark Summit
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPDataWorks Summit
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...InfluxData
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & StormOtto Mok
 
NetApp Fabric Pool Deck
NetApp Fabric Pool DeckNetApp Fabric Pool Deck
NetApp Fabric Pool DeckAlex Tsui
 
CodeStock - Exploring .NET memory management - a trip down memory lane
CodeStock - Exploring .NET memory management - a trip down memory laneCodeStock - Exploring .NET memory management - a trip down memory lane
CodeStock - Exploring .NET memory management - a trip down memory laneMaarten Balliauw
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleItai Yaffe
 

What's hot (20)

S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionS. Bartoli & F. Pompermaier – A Semantic Big Data Companion
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntopIT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
IT Monitoring in the Era of Containers | Luca Deri Founder & Project Lead | ntop
 
Assaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at ScaleAssaf Araki – Real Time Analytics at Scale
Assaf Araki – Real Time Analytics at Scale
 
Designing and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open SourceDesigning and Implementing your IOT Solutions with Open Source
Designing and Implementing your IOT Solutions with Open Source
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWAREFIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
FIWARE Wednesday Webinars - Architecting Your Smart Solution Powered by FIWARE
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
MOLOCH: Search for Full Packet Capture (OA Cyber Summit)
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
Hadoop at Lookout
Hadoop at LookoutHadoop at Lookout
Hadoop at Lookout
 
Juggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary dataJuggling with Bits and Bytes - How Apache Flink operates on binary data
Juggling with Bits and Bytes - How Apache Flink operates on binary data
 
Drag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFiDrag and Drop Open Source GeoTools ETL with Apache NiFi
Drag and Drop Open Source GeoTools ETL with Apache NiFi
 
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
The Next CERN Accelerator Logging Service—A Road to Big Data with Jakub Wozni...
 
STORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOPSTORM as an ETL Engine to HADOOP
STORM as an ETL Engine to HADOOP
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
 
Experience with Kafka & Storm
Experience with Kafka & StormExperience with Kafka & Storm
Experience with Kafka & Storm
 
NetApp Fabric Pool Deck
NetApp Fabric Pool DeckNetApp Fabric Pool Deck
NetApp Fabric Pool Deck
 
CodeStock - Exploring .NET memory management - a trip down memory lane
CodeStock - Exploring .NET memory management - a trip down memory laneCodeStock - Exploring .NET memory management - a trip down memory lane
CodeStock - Exploring .NET memory management - a trip down memory lane
 
Our journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scaleOur journey with druid - from initial research to full production scale
Our journey with druid - from initial research to full production scale
 

Viewers also liked

Product profile energi efisiensi
Product profile energi efisiensiProduct profile energi efisiensi
Product profile energi efisiensiwidyanto
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBDocker, Inc.
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connectKnoldus Inc.
 
Cloud Price Comparison - AWS vs Azure vs Google
Cloud Price Comparison - AWS vs Azure vs GoogleCloud Price Comparison - AWS vs Azure vs Google
Cloud Price Comparison - AWS vs Azure vs GoogleRightScale
 
Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)  Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516) Gary Mitchell
 
CSEC630_TeamAssignment_TeamBlazer_FINAL
CSEC630_TeamAssignment_TeamBlazer_FINALCSEC630_TeamAssignment_TeamBlazer_FINAL
CSEC630_TeamAssignment_TeamBlazer_FINALRonald Jackson, Jr
 
Resume_Pawan Jadhav_Testing
Resume_Pawan Jadhav_TestingResume_Pawan Jadhav_Testing
Resume_Pawan Jadhav_TestingPawan Jadhav
 
MySQL运维那些事
MySQL运维那些事 MySQL运维那些事
MySQL运维那些事 Leo Zhou
 
Dick Kramer 2016 Resume
Dick Kramer 2016 ResumeDick Kramer 2016 Resume
Dick Kramer 2016 ResumeDick Kramer
 

Viewers also liked (16)

Product profile energi efisiensi
Product profile energi efisiensiProduct profile energi efisiensi
Product profile energi efisiensi
 
Tupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FBTupperware: Containerized Deployment at FB
Tupperware: Containerized Deployment at FB
 
Introduction to Kafka connect
Introduction to Kafka connectIntroduction to Kafka connect
Introduction to Kafka connect
 
Resume_PintuChattaraj
Resume_PintuChattarajResume_PintuChattaraj
Resume_PintuChattaraj
 
Benefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a ServiceBenefits of Hadoop as Platform as a Service
Benefits of Hadoop as Platform as a Service
 
2 - SWOON BROCHURE
2 - SWOON BROCHURE2 - SWOON BROCHURE
2 - SWOON BROCHURE
 
RAGHUNATH_GORLA_RESUME
RAGHUNATH_GORLA_RESUMERAGHUNATH_GORLA_RESUME
RAGHUNATH_GORLA_RESUME
 
Cloud Price Comparison - AWS vs Azure vs Google
Cloud Price Comparison - AWS vs Azure vs GoogleCloud Price Comparison - AWS vs Azure vs Google
Cloud Price Comparison - AWS vs Azure vs Google
 
yasmin said
yasmin saidyasmin said
yasmin said
 
Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)  Resume For Gary W Mitchell (021516)
Resume For Gary W Mitchell (021516)
 
CSEC630_TeamAssignment_TeamBlazer_FINAL
CSEC630_TeamAssignment_TeamBlazer_FINALCSEC630_TeamAssignment_TeamBlazer_FINAL
CSEC630_TeamAssignment_TeamBlazer_FINAL
 
Wynia CV
Wynia CVWynia CV
Wynia CV
 
Resume_Pawan Jadhav_Testing
Resume_Pawan Jadhav_TestingResume_Pawan Jadhav_Testing
Resume_Pawan Jadhav_Testing
 
MySQL运维那些事
MySQL运维那些事 MySQL运维那些事
MySQL运维那些事
 
Rajni CV - PM-Final
Rajni CV - PM-FinalRajni CV - PM-Final
Rajni CV - PM-Final
 
Dick Kramer 2016 Resume
Dick Kramer 2016 ResumeDick Kramer 2016 Resume
Dick Kramer 2016 Resume
 

Similar to Streaming Data Processing with Apache Storm

Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsVoltDB
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop PlatformApache Apex
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...In-Memory Computing Summit
 
Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...vsoshnikov
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetDeep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetMarco Parenzan
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Mike Broberg
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case StudyHeinrich Hartmann
 
MongoDB at MapMyFitness from a DevOps Perspective
MongoDB at MapMyFitness from a DevOps PerspectiveMongoDB at MapMyFitness from a DevOps Perspective
MongoDB at MapMyFitness from a DevOps PerspectiveMongoDB
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)Spark Summit
 
MongoDB at MapMyFitness
MongoDB at MapMyFitnessMongoDB at MapMyFitness
MongoDB at MapMyFitnessMapMyFitness
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmDmitri Zimine
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server InternalsPraveen Gollakota
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Christopher Hogue
 
Model driven telemetry
Model driven telemetryModel driven telemetry
Model driven telemetryCisco Canada
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixC4Media
 

Similar to Streaming Data Processing with Apache Storm (20)

Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
IoT Ingestion & Analytics using Apache Apex - A Native Hadoop Platform
 
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
IMCSummit 2015 - Day 1 Developer Track - Implementing Operational Intelligenc...
 
Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...Master-Master Replication and Scaling of an Application Between Each of the I...
Master-Master Replication and Scaling of an Application Between Each of the I...
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S... New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
 
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetDeep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnet
 
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
Apache Spark™ + IBM Watson + Twitter DataPalooza SF 2015
 
Circonus: Design failures - A Case Study
Circonus: Design failures - A Case StudyCirconus: Design failures - A Case Study
Circonus: Design failures - A Case Study
 
MongoDB at MapMyFitness from a DevOps Perspective
MongoDB at MapMyFitness from a DevOps PerspectiveMongoDB at MapMyFitness from a DevOps Perspective
MongoDB at MapMyFitness from a DevOps Perspective
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
MongoDB at MapMyFitness
MongoDB at MapMyFitnessMongoDB at MapMyFitness
MongoDB at MapMyFitness
 
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker SwarmGenomic Computation at Scale with Serverless, StackStorm and Docker Swarm
Genomic Computation at Scale with Serverless, StackStorm and Docker Swarm
 
Tornado Web Server Internals
Tornado Web Server InternalsTornado Web Server Internals
Tornado Web Server Internals
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013
 
Model driven telemetry
Model driven telemetryModel driven telemetry
Model driven telemetry
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 

Recently uploaded

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...ranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learningmisbanausheenparvam
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSCAESB
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 

Recently uploaded (20)

The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
(SHREYA) Chakan Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Esc...
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
chaitra-1.pptx fake news detection using machine learning
chaitra-1.pptx  fake news detection using machine learningchaitra-1.pptx  fake news detection using machine learning
chaitra-1.pptx fake news detection using machine learning
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
GDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentationGDSC ASEB Gen AI study jams presentation
GDSC ASEB Gen AI study jams presentation
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 

Streaming Data Processing with Apache Storm

  • 1. 16BIT IITR Data Collection ModuleData Collection Module Streaming Data Processing with Apache Storm Data Stream Processing Slides @ https://goo.gl/BJRf9A
  • 2. 16BIT IITR Data Collection ModuleData Collection Module Overview Data Stream Processing • Streaming Data Processing • What is Apache Storm? • Storm Architecture and Key Concepts • Monitoring of Storm Cluster • Development of Storm Apps • Comparison with other softwares • Resources
  • 3. 16BIT IITR Data Collection ModuleData Collection Module Types of Processing of Big Data Data Stream Processing • Batch Processing Takes large amount Data at a time, analyzes it and produces a large output. • Real-Time Processing Collects, analyzes and produces output in Real time.
  • 4. 16BIT IITR Data Collection ModuleData Collection Module Streaming Data Processing Data Stream Processing • Today, most data is continuously produced user activity logs, web logs, sensors, database transactions, social data… • The common approach to analyze such data so far o Record data stream to stable storage (DBMS, HDFS, …) o Periodically analyze data with batch processing engine (DBMS, MapReduce, ...) • Streaming processing engines analyze data while it arrives
  • 5. 16BIT IITR Data Collection ModuleData Collection Module Why do Stream Processing? Data Stream Processing • Decreases the overall latency to obtain results o No need to persist data in stable storage o No periodic batch analysis jobs • Simplifies the data infrastructure o Fewer moving parts to be maintained and coordinated • Makes time dimension of data explicit o Each event has a timestamp o Data can be processed based on timestamps
  • 6. 16BIT IITR Data Collection ModuleData Collection Module What are the Requirements? Data Stream Processing • Large Scale • Low latency Results in millisecond • High throughput Millions of events per second • Exactly-once consistency Correct results in case of failures • Out-of-order events Process events based on their associated time • Intuitive APIs
  • 7. 16BIT IITR Data Collection ModuleData Collection Module Streaming Data Architecture Data Stream Processing
  • 8. 16BIT IITR Data Collection ModuleData Collection Module Stream Processing Technologies Data Stream Processing
  • 9. 16BIT IITR Data Collection ModuleData Collection Module Apache Storm Data Stream Processing • Distributed, fault-tolerant and real-time computation • Originated at BackType/Twitter, open sourced in late 2011 • Implemented in Clojure, some Java • Supports APIs in many languages including Java, Python, Scala etc.
  • 10. 16BIT IITR Data Collection ModuleData Collection Module Use Cases of Apache Storm Data Stream Processing
  • 11. 16BIT IITR Data Collection ModuleData Collection Module Storm Cluster Architecture Data Stream Processing Supervisor Nimbus ZooKeeper ZooKeeper ZooKeeper Supervisor Supervisor Supervisor Supervisor Nimbus – The Master Node o Distributes codes around cluster o Assigns tasks to machines/supervisors o Failure Monitoring o Stateless Apache Zookeeper o Highly Robust o Provides for Service Discovery and Coordination o Stores the states Supervisor o Listens for work assigned to its machine o Starts and stops worker processes based on instructions from Nimbus o Stateless
  • 12. 16BIT IITR Data Collection ModuleData Collection Module Storm Architecture – Fault Tolerance Data Stream Processing • What happens when Nimbus dies (master node)? o If Nimbus is run under process supervision as recommended (e.g. via supervisord), it will restart like nothing happened. o While Nimbus is down: Existing topologies will continue to run, but you cannot submit new topologies. Running worker processes will not be affected. Also, Supervisors will restart their (local) workers if needed. However, failed tasks will not be reassigned to other machines, as this is the responsibility of Nimbus. • What happens when a Supervisor dies (slave node)? o If Supervisor run under process supervision as recommended (e.g. via supervisord), will restart like nothing happened. o Running worker processes will not be affected. • What happens when a worker process dies? Its parent Supervisor will restart it. If it continuously fails on startup and is unable to heartbeat to Nimbus, Nimbus will reassign the worker to another machine.
  • 13. 16BIT IITR Data Collection ModuleData Collection Module Key Concepts – Data Model Data Stream Processing Data Stream Unbounded Sequence of Tuples
  • 14. 16BIT IITR Data Collection ModuleData Collection Module Key Concepts – Spouts and Bolts Data Stream Processing • Can do anything from running functions, filter tuples, joins, talk to DB, etc. • Complex stream transformations often require multiple steps and thus multiple bolts. Spouts • Source of data streams Example: Connect to the Twitter API and emit a stream of tweets. Spout 1 Bolt 1 Bolts • Consumes streams and potentially produces new streams Spout 1 Bolt 1 Bolt 2
  • 15. 16BIT IITR Data Collection ModuleData Collection Module Key Concepts - Topology Data Stream Processing • Network of Spouts and Bolts • Wires data and functions via a DAG. • Executes forever and on many machines. Spout 2 Bolt 3 Bolt 2 Bolt 4 Spout 1 Bolt 1 data
  • 16. 16BIT IITR Data Collection ModuleData Collection Module Deploying a Storm Cluster Data Stream Processing • http://storm.apache.org/about/deployment.html • http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/ • http://knowm.org/how-to-install-a-distributed-apache-storm-cluster/ • https://github.com/nathanmarz/storm-deploy
  • 17. 16BIT IITR Data Collection ModuleData Collection Module Monitoring your Storm Cluster Data Stream Processing • Storm UI
  • 18. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps Data Stream Processing A trivial “Hello, Storm” topology “emit random number < 100” “multiply by 2” (148)(74) Spout Bolt
  • 19. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Spouts Data Stream Processing • Multiple kinds of inbuilt spouts available to connect to various kinds of streams Example of a basic spout which is generating data by itself
  • 20. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing • Two main options for JVM users: o Implement the IRichBolt or IBasicBolt interfaces o Extend the BaseRichBolt or BaseBasicBolt abstract classes • BaseRichBolt o You must – and are able to – manually ack() an incoming tuple. • BaseBasicBolt o Auto-acks the incoming tuple at the end of its execute() method. o These bolts are typically simple functions or filters.
  • 21. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing Extending BaseRichBolt
  • 22. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing execute() is the heart of the bolt. This is where you will focus most of your attention when implementing your bolt or when trying to understand somebody else’s bolt.
  • 23. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing prepare() acts as a “second constructor” for the bolt’s class. Because of Storm’s distributed execution model and serialization, prepare() is often needed to fully initialize the bolt on the target JVM.
  • 24. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing Bolts Data Stream Processing declareOutputFields() tells downstream bolts about this bolt’s output. What you declare must match what you actually emit().
  • 25. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing a Topology Data Stream Processing • When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use, and how they interconnect.
  • 26. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Writing a Topology Data Stream Processing • When creating a topology you’re essentially defining the DAG – that is, which spouts and bolts to use, and how they interconnect. • You must specify the initial parallelism of the topology
  • 27. 16BIT IITR Data Collection ModuleData Collection Module Developing Storm Apps – Submitting and Running a Topology Data Stream Processing • You submit a topology either to a “local” cluster or to a real cluster. • To run a topology you must first package your code into a “fat jar”. o You must includes all your code’s dependencies but: o Exclude the Storm dependency itself, as the Storm cluster will provide this. Note: You may need to tweak your build script so that your local tests do include the Storm dependency. See e.g. assembly.sbt in kafka-storm-starter for an example. • A topology is run via the storm jar command. o Will connect to Nimbus, upload your jar, and run the topology. o Use any machine that can run "storm jar" and talk to Nimbus' Thrift port. o The configuration of the machine on which the topology is deployed is passed through another config file $ storm jar all-my-code.jar com.miguno.MyTopology arg1 arg2
  • 28. 16BIT IITR Data Collection ModuleData Collection Module Many Stream Processing software, which to use? Data Stream Processing The choice would depend on your use cases.
  • 29. 16BIT IITR Data Collection ModuleData Collection Module Resources Data Stream Processing • A few Storm books are already available. • Storm documentation http://storm.incubator.apache.org/documentation/Home.html • Storm-kafka https://github.com/apache/incubator-storm/tree/master/external/storm-kafka • Mailing lists http://storm.incubator.apache.org/community.html • Code examples https://github.com/apache/incubator-storm/tree/master/examples/storm-starter https://github.com/miguno/kafka-storm-starter/
  • 30. 16BIT IITR Data Collection ModuleData Collection Module Thank You! Data Stream Processing
  • 31. 16BIT IITR Data Collection ModuleData Collection Module Extra Slides Data Stream Processing
  • 32. 16BIT IITR Data Collection ModuleData Collection Module Use Cases of Apache Storm Data Stream Processing • Stream processing: Storm is used to process a stream of data and update a variety of Databases in real time. This processing occurs in real time and the processing speed needs to match the input data speed. • Continuous computation: Storm can do continuous computation on data streams and stream the results into clients in real time. • Distributed RPC () Storm can parallelize an intense query so that you can compute it in real time. • Real-time analytics: Storm can analyze and respond to data that comes from different data sources as they happen in real time.
  • 33. 16BIT IITR Data Collection Module What can I do with Wirbelsturm? • Get a first impression of Storm • Test-drive your topologies • Test failure handling • Stop/kill Nimbus, check what happens to Supervisors. • Stop/kill ZooKeeper instances, check what happens to topology. • Use as sandbox environment to test/validate deployments • “What will actually happen when I deactivate this topology?” • “Will my Hiera changes actually work?” • Reproduce production issues, share results with Dev • Also helpful when reporting back to Storm project and mailing lists. • Any further cool ideas?  33

Editor's Notes

  1. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  2. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  3. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  4. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  5. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.
  6. You will use this information in downstream bolts to “extract” the data from the emitted tuples. If your bolt only performs side effects (e.g. talk to a DB) but does not emit an actual tuple, override this method with an empty {} method.