Apache Storm
Presented By: Rajind Ruparathna (179349M)
Outline
• What is Storm?
• Who use Storm?
• Storm Vs Hadoop
• Storm Components
• Storm Topology
• Storm Primitives
• Why Storm is ideal for Real Time Processing?
What is Storm?
• Apache Storm is a free and open source distributed real-time computation
system.
• Storm makes it easy to reliably process unbounded streams of data.
• Storm does for real-time processing what Hadoop did for batch processing.
• Simple, can be used with any programming language.
Who use Storm?
Storm Vs Hadoop
Strom is to used to do real-time computation whereas Hadoop is used for batch
computation.
Storm Vs Hadoop contd.
Hadoop Storm
Components JobTracker Nimbus
TaskTracker Supervisor
Child Worker
Applications Job Topology
Primitives Mapper/Reducer Spout/Bolt
Batch Data Processing Architecture
Data Processing Architecture: Batch and Real-time
Storm Components
A Storm cluster has 3 sets of nodes
Nimbus Node (Master)
• Uploads computations for execution
• Distributes code across the cluster
• Launches workers across the cluster
• Monitors computation and reallocates workers as needed
Storm Components contd.
Zookeeper Nodes
• Coordinates the Storm cluster
Supervisor Nodes
• Communicates with Nimbus through Zookeeper, starts and stops workers
according to signals from Nimbus
Storm Topology
• The work is delegated to different types of components that are each responsible
for a simple specific processing task.
• The input stream of a Storm cluster is handled by a component called a spout.
Storm Topology contd.
• The spout passes the data to a component called a bolt, which transforms it in
some way.
• A bolt either persists the data in some sort of storage, or passes it to some
other bolt.
Storm Primitives
• Streams
• Spouts
• Bolts
• Topologies
Storm Primitives contd.
Streams - Unbounded sequence of tuples
Storm Primitives contd.
Spouts - Sources of streams
● Read from a kestrel/kafka queue. {tuples = events}
● Read from a http server log. {tuples = http requests}
● Read from twitter streaming api. {tuples = tweets}
Storm Primitives contd.
Bolts - Process input stream and produces an output
stream
● Filtering tuples in a stream
● Aggregation of tuples
● Joining multiple streams
● Arbitrary functions on streams
● Communication with external caches/dbs.
Storm Primitives contd.
Topology - Directed-acyclic-graph(DAG) of spouts
and bolts
Storm Sample - Word Count
https://docs.microsoft.com/en-us/azure/hdinsight/storm/apache-storm-develop-jav
a-topology
Why Storm is ideal for Real Time Processing?
• Fast – Benchmarked as processing one million, 100 byte messages, per second
per node.
• Scalable – With parallel calculations that run across a cluster of machines.
• Fault-tolerant – When workers die, Storm will automatically restart them. If a
node dies, the worker will be restarted on another node.
• Reliable – Storm guarantees that each unit of data (tuple) will be processed at
least once or exactly once. Messages are only replayed when there are failures.
• Easy to operate – Standard configurations are suitable for production on day
one. Once deployed, Storm is easy to operate.
Storm Use Cases @Twitter
Storm Use Cases @Twitter contd.
• Discovery of emerging topics/stories.
• Online learning of tweet features for search result ranking.
• Real-time analytics for ads.
• Internal log processing.
References
http://storm.apache.org/index.html
https://docs.microsoft.com/en-us/azure/hdinsight/storm/apache-storm-develop-jav
a-topology
https://www.tutorialspoint.com/apache_storm/index.htm
https://github.com/apache/storm
Thank you all for your time!

Apache Storm

  • 1.
    Apache Storm Presented By:Rajind Ruparathna (179349M)
  • 2.
    Outline • What isStorm? • Who use Storm? • Storm Vs Hadoop • Storm Components • Storm Topology • Storm Primitives • Why Storm is ideal for Real Time Processing?
  • 3.
    What is Storm? •Apache Storm is a free and open source distributed real-time computation system. • Storm makes it easy to reliably process unbounded streams of data. • Storm does for real-time processing what Hadoop did for batch processing. • Simple, can be used with any programming language.
  • 4.
  • 5.
    Storm Vs Hadoop Stromis to used to do real-time computation whereas Hadoop is used for batch computation.
  • 6.
    Storm Vs Hadoopcontd. Hadoop Storm Components JobTracker Nimbus TaskTracker Supervisor Child Worker Applications Job Topology Primitives Mapper/Reducer Spout/Bolt
  • 7.
  • 8.
    Data Processing Architecture:Batch and Real-time
  • 9.
    Storm Components A Stormcluster has 3 sets of nodes Nimbus Node (Master) • Uploads computations for execution • Distributes code across the cluster • Launches workers across the cluster • Monitors computation and reallocates workers as needed
  • 10.
    Storm Components contd. ZookeeperNodes • Coordinates the Storm cluster Supervisor Nodes • Communicates with Nimbus through Zookeeper, starts and stops workers according to signals from Nimbus
  • 11.
    Storm Topology • Thework is delegated to different types of components that are each responsible for a simple specific processing task. • The input stream of a Storm cluster is handled by a component called a spout.
  • 12.
    Storm Topology contd. •The spout passes the data to a component called a bolt, which transforms it in some way. • A bolt either persists the data in some sort of storage, or passes it to some other bolt.
  • 13.
    Storm Primitives • Streams •Spouts • Bolts • Topologies
  • 14.
    Storm Primitives contd. Streams- Unbounded sequence of tuples
  • 15.
    Storm Primitives contd. Spouts- Sources of streams ● Read from a kestrel/kafka queue. {tuples = events} ● Read from a http server log. {tuples = http requests} ● Read from twitter streaming api. {tuples = tweets}
  • 16.
    Storm Primitives contd. Bolts- Process input stream and produces an output stream ● Filtering tuples in a stream ● Aggregation of tuples ● Joining multiple streams ● Arbitrary functions on streams ● Communication with external caches/dbs.
  • 17.
    Storm Primitives contd. Topology- Directed-acyclic-graph(DAG) of spouts and bolts
  • 18.
    Storm Sample -Word Count https://docs.microsoft.com/en-us/azure/hdinsight/storm/apache-storm-develop-jav a-topology
  • 19.
    Why Storm isideal for Real Time Processing? • Fast – Benchmarked as processing one million, 100 byte messages, per second per node. • Scalable – With parallel calculations that run across a cluster of machines. • Fault-tolerant – When workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node. • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures. • Easy to operate – Standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate.
  • 20.
  • 21.
    Storm Use Cases@Twitter contd. • Discovery of emerging topics/stories. • Online learning of tweet features for search result ranking. • Real-time analytics for ads. • Internal log processing.
  • 22.
  • 23.
    Thank you allfor your time!