SlideShare a Scribd company logo
1 of 21
Download to read offline
APACHE STORM
Viet-Dung TRINH (Bill), 03/2016
Saltlux – Vietnam Development Center
Agenda
•  Overview
•  Core Storm Concepts
•  Components of Storm Cluster
•  Example
Overview
•  Apache Storm is a free and open source distributed real-time
computation system.
•  Storm makes it easy to reliably process unbounded streams of
data, doing for real-time processing what Hadoop did for batch
processing.
•  Storm is fast (million tuples processed/second/node)
•  Can be used with any programming language
Overview (cont)
•  Use cases:
•  Real-time analytics,
•  Online machine learning,
•  Continuous computation
•  …
•  Integration: with any queueing and any database system
such as:
•  Kafka
•  Kestrel
•  RabbitMG/ AMQP
•  JMS
•  Amazon Kinesis
Core Storm Concepts
•  Topology
•  Tuple
•  Stream
•  Spout
•  Bolt
•  Stream grouping
Core Storm Concepts: Topology (cont)
•  Topology: is a graph of computation, consits of NODEs
and EDGEs.
•  Nodes: represent some individual computations.
•  Edges: represent the data being passed between nodes.
Core Storm Concepts: Tuple (cont)
•  Nodes in topology send data in form of tuples
•  Tuple: is ordered list of values, where each value is
assigned a name
•  Processing of sending a tuple is called emitting tuple
Core Storm Concepts: Stream (cont)
•  Stream: is an unbounded sequence of tuples between two
nodes in topology.
•  A topology can contain any number of streams
Core Storm Concepts: Spout (cont)
•  Spout: is the source of stream in topology
•  Read data from external data source and emits tuples into
topology.
Core Storm Concepts: Bolt (cont)
•  Bolt: accepts a tuple from its input stream, performs some
computation or transformation – filtering, aggregation, join
– on tuple, and optional emits a new tuple(s)
Core Storm Concepts: Stream Grouping
•  Defines how tuples are sent between instance of spouts
and bolts.
•  Two most common groupings: shuffle grouping and fields
grouping
•  SHUFFLE GROUPING: type of stream grouping where
tuples are emitted to bolts at random.
•  FIELDS GROUPING: ensures that tuples with the same
value for a particular field name are always emitted to the
same bolt.
Components of Storm Cluster
•  Two kinds of nodes: Master and Worker
•  Master node runs daemon called Nimbus
•  Worker node runs daemon called Supervisor
•  All coordination between Nimbus and Supervisor is done
through Zookeeper.
Example: GitHub Commit Feed
Example: GitHub Commit Feed (cont)
•  Each commit comes into feed as single string containing
COMMIT_ID, followed by a SPACE, followed by EMAIL.
Breaking Down the Problem
•  Component: reads from live feed of
commits and produces single
commit message
•  Component: accepts single commit
message, extracts the developer’s
email from that commit, produces
email
•  Component: accepts developer’s
email and updates in-memory map
where key is email and value is
number of commits for that email.
Breaking Down the Problem (cont)
Tuples
•  Two types of tuple in
topology
•  COMMIT: contain
commit_id and email
•  EMAIL: developer
email
Spout
•  Listen to real-time feed of
commits being made to
repository
Bolts
•  1st Bolt: extracts developer’s
email
•  2nd Bolt: updates map of
emails to commit counts
References
[1]. Apache Storm, http://storm.apache.org
[2]. Sean T. Allen, Matthew Jankowski, Peter Pathirana,
Storm Applied, 2015
Thank you!

More Related Content

What's hot

Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
confluent
 

What's hot (20)

Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
CNIT 141: 4. Block Ciphers
CNIT 141: 4. Block CiphersCNIT 141: 4. Block Ciphers
CNIT 141: 4. Block Ciphers
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 
kafka
kafkakafka
kafka
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
Kafka Connect: Real-time Data Integration at Scale with Apache Kafka, Ewen Ch...
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Vector clock algorithm
Vector clock algorithmVector clock algorithm
Vector clock algorithm
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
System models in distributed system
System models in distributed systemSystem models in distributed system
System models in distributed system
 

Similar to Apache Storm

Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
nathanmarz
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

Similar to Apache Storm (20)

Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Ruby Microservices with RabbitMQ
Ruby Microservices with RabbitMQRuby Microservices with RabbitMQ
Ruby Microservices with RabbitMQ
 
Storm
StormStorm
Storm
 
1 storm-intro
1 storm-intro1 storm-intro
1 storm-intro
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in Hadoop
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Simulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud InfrastructuresSimulation of Heterogeneous Cloud Infrastructures
Simulation of Heterogeneous Cloud Infrastructures
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Apache Storm Basics
Apache Storm BasicsApache Storm Basics
Apache Storm Basics
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Storm
StormStorm
Storm
 
TensorFlow and Keras: An Overview
TensorFlow and Keras: An OverviewTensorFlow and Keras: An Overview
TensorFlow and Keras: An Overview
 

More from Nguyen Quang

X Query for beginner
X Query for beginnerX Query for beginner
X Query for beginner
Nguyen Quang
 
Redistributable introtoscrum
Redistributable introtoscrumRedistributable introtoscrum
Redistributable introtoscrum
Nguyen Quang
 
A holistic lexicon based approach to opinion mining
A holistic lexicon based approach to opinion miningA holistic lexicon based approach to opinion mining
A holistic lexicon based approach to opinion mining
Nguyen Quang
 

More from Nguyen Quang (13)

Apache Zookeeper
Apache ZookeeperApache Zookeeper
Apache Zookeeper
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Deep Dialog System Review
Deep Dialog System ReviewDeep Dialog System Review
Deep Dialog System Review
 
Sequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural NetworksSequence to Sequence Learning with Neural Networks
Sequence to Sequence Learning with Neural Networks
 
Introduction to cassandra
Introduction to cassandraIntroduction to cassandra
Introduction to cassandra
 
Web browser architecture
Web browser architectureWeb browser architecture
Web browser architecture
 
Eclipse orion
Eclipse orionEclipse orion
Eclipse orion
 
X Query for beginner
X Query for beginnerX Query for beginner
X Query for beginner
 
Html 5
Html 5Html 5
Html 5
 
Redistributable introtoscrum
Redistributable introtoscrumRedistributable introtoscrum
Redistributable introtoscrum
 
Text categorization
Text categorizationText categorization
Text categorization
 
A holistic lexicon based approach to opinion mining
A holistic lexicon based approach to opinion miningA holistic lexicon based approach to opinion mining
A holistic lexicon based approach to opinion mining
 
Overview of NoSQL
Overview of NoSQLOverview of NoSQL
Overview of NoSQL
 

Recently uploaded

AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

A Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data MigrationA Guideline to Zendesk to Re:amaze Data Migration
A Guideline to Zendesk to Re:amaze Data Migration
 
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
KLARNA -  Language Models and Knowledge Graphs: A Systems ApproachKLARNA -  Language Models and Knowledge Graphs: A Systems Approach
KLARNA - Language Models and Knowledge Graphs: A Systems Approach
 
AI Hackathon.pptx
AI                        Hackathon.pptxAI                        Hackathon.pptx
AI Hackathon.pptx
 
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
COMPUTER AND ITS COMPONENTS PPT.by naitik sharma Class 9th A mittal internati...
 
How to pick right visual testing tool.pdf
How to pick right visual testing tool.pdfHow to pick right visual testing tool.pdf
How to pick right visual testing tool.pdf
 
INGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by DesignINGKA DIGITAL: Linked Metadata by Design
INGKA DIGITAL: Linked Metadata by Design
 
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product UpdatesGraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
GraphSummit Stockholm - Neo4j - Knowledge Graphs and Product Updates
 
How to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabberHow to install and activate eGrabber JobGrabber
How to install and activate eGrabber JobGrabber
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdfImplementing KPIs and Right Metrics for Agile Delivery Teams.pdf
Implementing KPIs and Right Metrics for Agile Delivery Teams.pdf
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
Entropy, Software Quality, and Innovation (presented at Princeton Plasma Phys...
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand5 Reasons Driving Warehouse Management Systems Demand
5 Reasons Driving Warehouse Management Systems Demand
 
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdfA Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
A Comprehensive Appium Guide for Hybrid App Automation Testing.pdf
 
What need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java DevelopersWhat need to be mastered as AI-Powered Java Developers
What need to be mastered as AI-Powered Java Developers
 
Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024Top Mobile App Development Companies 2024
Top Mobile App Development Companies 2024
 

Apache Storm

  • 1. APACHE STORM Viet-Dung TRINH (Bill), 03/2016 Saltlux – Vietnam Development Center
  • 2. Agenda •  Overview •  Core Storm Concepts •  Components of Storm Cluster •  Example
  • 3. Overview •  Apache Storm is a free and open source distributed real-time computation system. •  Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. •  Storm is fast (million tuples processed/second/node) •  Can be used with any programming language
  • 4. Overview (cont) •  Use cases: •  Real-time analytics, •  Online machine learning, •  Continuous computation •  … •  Integration: with any queueing and any database system such as: •  Kafka •  Kestrel •  RabbitMG/ AMQP •  JMS •  Amazon Kinesis
  • 5. Core Storm Concepts •  Topology •  Tuple •  Stream •  Spout •  Bolt •  Stream grouping
  • 6. Core Storm Concepts: Topology (cont) •  Topology: is a graph of computation, consits of NODEs and EDGEs. •  Nodes: represent some individual computations. •  Edges: represent the data being passed between nodes.
  • 7. Core Storm Concepts: Tuple (cont) •  Nodes in topology send data in form of tuples •  Tuple: is ordered list of values, where each value is assigned a name •  Processing of sending a tuple is called emitting tuple
  • 8. Core Storm Concepts: Stream (cont) •  Stream: is an unbounded sequence of tuples between two nodes in topology. •  A topology can contain any number of streams
  • 9. Core Storm Concepts: Spout (cont) •  Spout: is the source of stream in topology •  Read data from external data source and emits tuples into topology.
  • 10. Core Storm Concepts: Bolt (cont) •  Bolt: accepts a tuple from its input stream, performs some computation or transformation – filtering, aggregation, join – on tuple, and optional emits a new tuple(s)
  • 11. Core Storm Concepts: Stream Grouping •  Defines how tuples are sent between instance of spouts and bolts. •  Two most common groupings: shuffle grouping and fields grouping •  SHUFFLE GROUPING: type of stream grouping where tuples are emitted to bolts at random. •  FIELDS GROUPING: ensures that tuples with the same value for a particular field name are always emitted to the same bolt.
  • 12. Components of Storm Cluster •  Two kinds of nodes: Master and Worker •  Master node runs daemon called Nimbus •  Worker node runs daemon called Supervisor •  All coordination between Nimbus and Supervisor is done through Zookeeper.
  • 14. Example: GitHub Commit Feed (cont) •  Each commit comes into feed as single string containing COMMIT_ID, followed by a SPACE, followed by EMAIL.
  • 15. Breaking Down the Problem •  Component: reads from live feed of commits and produces single commit message •  Component: accepts single commit message, extracts the developer’s email from that commit, produces email •  Component: accepts developer’s email and updates in-memory map where key is email and value is number of commits for that email.
  • 16. Breaking Down the Problem (cont)
  • 17. Tuples •  Two types of tuple in topology •  COMMIT: contain commit_id and email •  EMAIL: developer email
  • 18. Spout •  Listen to real-time feed of commits being made to repository
  • 19. Bolts •  1st Bolt: extracts developer’s email •  2nd Bolt: updates map of emails to commit counts
  • 20. References [1]. Apache Storm, http://storm.apache.org [2]. Sean T. Allen, Matthew Jankowski, Peter Pathirana, Storm Applied, 2015