SlideShare a Scribd company logo
1 of 75
Download to read offline
Storm
Distributed and fault-tolerant realtime computation




                                          Nathan Marz
                                            Twitter
Storm at Twitter




  Twitter Web Analytics
Before Storm



Queues        Workers
Example




 (simplified)
Example




Workers schemify tweets
 and append to Hadoop
Example




Workers update statistics on URLs by
incrementing counters in Cassandra
Example




Distribute tweets randomly
    on multiple queues
Example




Workers share the load of
  schemifying tweets
Example




Desire all updates for same
 URL go to same worker
Message locality

• Because:
 • No transactions in Cassandra (and no
    atomic increments at the time)
 • More effective batching of updates
Implementing message
        locality


• Have a queue for each consuming worker
• Choose queue for a URL using consistent hashing
Example




Workers choose queue to enqueue
   to using hash/mod of URL
Example




    All updates for same URL
guaranteed to go to same worker
Adding a worker
Adding a worker
                      Deploy




Reconfigure/redeploy
Problems

• Scaling is painful
• Poor fault-tolerance
• Coding is tedious
What we want
• Guaranteed data processing
• Horizontal scalability
• Fault-tolerance
• No intermediate message brokers!
• Higher level abstraction than message passing
• “Just works”
Storm
Guaranteed data processing
Horizontal scalability
Fault-tolerance
No intermediate message brokers!
Higher level abstraction than message passing
“Just works”
Use cases



  Stream      Distributed   Continuous
processing       RPC        computation
Storm Cluster
Storm Cluster




Master node (similar to Hadoop JobTracker)
Storm Cluster




Used for cluster coordination
Storm Cluster




 Run worker processes
Starting a topology
Killing a topology
Concepts

• Streams
• Spouts
• Bolts
• Topologies
Streams


Tuple   Tuple   Tuple   Tuple   Tuple   Tuple   Tuple




          Unbounded sequence of tuples
Spouts




Source of streams
Spout examples


• Read from Kestrel queue
• Read from Twitter streaming API
Bolts




Processes input streams and produces new streams
Bolts
• Functions
• Filters
• Aggregation
• Joins
• Talk to databases
Topology




Network of spouts and bolts
Tasks




Spouts and bolts execute as
many tasks across the cluster
Stream grouping




When a tuple is emitted, which task does it go to?
Stream grouping

• Shuffle grouping: pick a random task
• Fields grouping: consistent hashing on a
  subset of tuple fields
• All grouping: send to all tasks
• Global grouping: pick task with lowest id
Topology
shuffle      [“id1”, “id2”]




           shuffle
[“url”]


  shuffle

              all
Streaming word count




TopologyBuilder is used to construct topologies in Java
Streaming word count




Define a spout in the topology with parallelism of 5 tasks
Streaming word count




Split sentences into words with parallelism of 8 tasks
Streaming word count



Consumer decides what data it receives and how it gets grouped




Split sentences into words with parallelism of 8 tasks
Streaming word count




   Create a word count stream
Streaming word count




      splitsentence.py
Streaming word count
Streaming word count




  Submitting topology to a cluster
Streaming word count




  Running topology in local mode
Demo
Traditional data processing
Traditional data processing




   Intense processing (Hadoop, databases, etc.)
Traditional data processing




Light processing on a single machine to resolve queries
Distributed RPC




Distributed RPC lets you do intense processing at query-time
Game changer
Distributed RPC




Data flow for Distributed RPC
DRPC Example


Computing “reach” of a URL on the fly
Reach


Reach is the number of unique people
    exposed to a URL on Twitter
Computing reach
                Follower
                           Distinct
      Tweeter   Follower   follower

                Follower
                           Distinct
URL   Tweeter              follower   Count   Reach
                Follower

                Follower   Distinct
      Tweeter              follower
                Follower
Reach topology
Guaranteeing message
     processing




       “Tuple tree”
Guaranteeing message
     processing

• A spout tuple is not fully processed until all
  tuples in the tree have been completed
Guaranteeing message
     processing

• If the tuple tree is not completed within a
  specified timeout, the spout tuple is replayed
Guaranteeing message
     processing




      Reliability API
Guaranteeing message
     processing




“Anchoring” creates a new edge in the tuple tree
Guaranteeing message
     processing




 Marks a single node in the tree as complete
Guaranteeing message
     processing

• Storm tracks tuple trees for you in an
  extremely efficient way
Storm UI
Storm UI
Storm UI
Storm on EC2


https://github.com/nathanmarz/storm-deploy




          One-click deploy tool
Documentation
State spout (almost done)


       Synchronize a large amount of
  frequently changing state into a topology
State spout (almost done)




Optimizing reach topology by eliminating the database calls
State spout (almost done)




  Each GetFollowers task keeps a synchronous
     cache of a subset of the social graph
State spout (almost done)




This works because GetFollowers repartitions the social
 graph the same way it partitions GetTweeter’s stream
Future work

• Storm on Mesos
• “Swapping”
• Auto-scaling
• Higher level abstractions
Questions?


http://github.com/nathanmarz/storm
What Storm does
•   Distributes code and configurations

•   Robust process management

•   Monitors topologies and reassigns failed tasks

•   Provides reliability by tracking tuple trees

•   Routing and partitioning of streams

•   Serialization

•   Fine-grained performance stats of topologies

More Related Content

What's hot

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
Guido Schmutz
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
DataWorks Summit
 

What's hot (20)

Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
 
Kafka Streams State Stores Being Persistent
Kafka Streams State Stores Being PersistentKafka Streams State Stores Being Persistent
Kafka Streams State Stores Being Persistent
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache HadoopTez Shuffle Handler: Shuffling at Scale with Apache Hadoop
Tez Shuffle Handler: Shuffling at Scale with Apache Hadoop
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Cassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.comCassandra and Riak at BestBuy.com
Cassandra and Riak at BestBuy.com
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
 
kafka
kafkakafka
kafka
 
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
Reduce Storage Costs by 5x Using The New HDFS Tiered Storage Feature
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Kafka: Internals
Kafka: InternalsKafka: Internals
Kafka: Internals
 

Similar to Storm: distributed and fault-tolerant realtime computation

Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
Vineet Gupta
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 

Similar to Storm: distributed and fault-tolerant realtime computation (20)

Storm
StormStorm
Storm
 
Jan 2012 HUG: Storm
Jan 2012 HUG: StormJan 2012 HUG: Storm
Jan 2012 HUG: Storm
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
Cleveland HUG - Storm
Cleveland HUG - StormCleveland HUG - Storm
Cleveland HUG - Storm
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Hadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming ArchitectureHadoop Ecosystem and Low Latency Streaming Architecture
Hadoop Ecosystem and Low Latency Streaming Architecture
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in Hadoop
 
Apache Storm Internals
Apache Storm InternalsApache Storm Internals
Apache Storm Internals
 
Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Handling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web SystemsHandling Data in Mega Scale Web Systems
Handling Data in Mega Scale Web Systems
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
 
High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014High Performance Systems in Go - GopherCon 2014
High Performance Systems in Go - GopherCon 2014
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 

More from nathanmarz

Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
nathanmarz
 

More from nathanmarz (16)

Demystifying Data Engineering
Demystifying Data EngineeringDemystifying Data Engineering
Demystifying Data Engineering
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Using Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems EasyUsing Simplicity to Make Hard Big Data Problems Easy
Using Simplicity to Make Hard Big Data Problems Easy
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
Your Code is Wrong
Your Code is WrongYour Code is Wrong
Your Code is Wrong
 
Runaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop itRunaway complexity in Big Data... and a plan to stop it
Runaway complexity in Big Data... and a plan to stop it
 
ElephantDB
ElephantDBElephantDB
ElephantDB
 
Become Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackTypeBecome Efficient or Die: The Story of BackType
Become Efficient or Die: The Story of BackType
 
The Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data SystemsThe Secrets of Building Realtime Big Data Systems
The Secrets of Building Realtime Big Data Systems
 
Clojure at BackType
Clojure at BackTypeClojure at BackType
Clojure at BackType
 
Cascalog workshop
Cascalog workshopCascalog workshop
Cascalog workshop
 
Cascalog at Strange Loop
Cascalog at Strange LoopCascalog at Strange Loop
Cascalog at Strange Loop
 
Cascalog at Hadoop Day
Cascalog at Hadoop DayCascalog at Hadoop Day
Cascalog at Hadoop Day
 
Cascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User GroupCascalog at May Bay Area Hadoop User Group
Cascalog at May Bay Area Hadoop User Group
 
Cascalog
CascalogCascalog
Cascalog
 
Cascading
CascadingCascading
Cascading
 

Recently uploaded

Recently uploaded (20)

DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 

Storm: distributed and fault-tolerant realtime computation