SlideShare a Scribd company logo
1 of 16
Download to read offline
Apache Storm 
© Hortonworks Inc. 2013 
Page 1 
Justin Leet 
jleet@hortonworks.com
What is Storm? 
• Real time stream processing framework 
• Scalable 
–Up to 1 million tuples per second per node 
• Fault Tolerant 
–Tasks reassigned on failure 
• Guaranteed Processing 
–At least once processing 
–Exactly once processing with some more work 
• Relatively language agnostic 
–Primarily JVM based 
– Thrift API for defining and submitting topologies 
–JSON based protocol for defining components in other languages 
© Hortonworks Inc. 2013 
Page 2
Motivation 
• Process large amount of incoming data real time 
• Classic use case is processing streams of tweets 
–Calculate trending users 
–Calculate reach of a tweet 
• Data cleansing and normalization 
• Personalization and recommendation 
• Log processing 
© Hortonworks Inc. 2013 
Page 3
Lambda Architecture 
© Hortonworks Inc. 2013 
Page 4 
Source: http://swaroopch.com/2013/01/12/big-data-nathan-marz/ 
• Most useful when 
– Batch & speed layers do 
essentially the same 
computation 
– Sample use case: KPI 
dashboard 
• Less useful when 
– When batch & speed layers 
do different computation 
– Sample use case: Real-time 
model scoring
Basic Concepts 
© Hortonworks Inc. 2013 
Tuple: Most fundamental data structure 
and is a named list of values that can be of 
any datatype 
Page 5 
Streams: Groups of tuples 
Spouts: Generate streams. 
Bolts: Contain data processing, 
persistence and alerting logic. Can also 
emit tuples for downstream bolts 
Tuple Tree: First tuple and all the tuples 
that were emitted by the bolts that 
processed it 
Topology: Group of spouts and bolts wired 
together into a workflow
Architecture 
© Hortonworks Inc. 2013 
Nimbus(Management server) 
• Similar to job tracker 
• Distributes code around cluster 
• Assigns tasks 
• Handles failures 
Supervisor(Worker nodes): 
• Similar to task tracker 
• Run bolts and spouts as ‘tasks’ 
ZooKeeper: 
• Cluster co-ordination 
• Nimbus HA 
• Stores cluster metrics 
• Consumption related metadata 
for Trident topologies
Relationship Between Supervisors, Workers, Executors 
& Tasks 
supervisor 
© Hortonworks Inc. 2013 
Each supervisor machine in storm has specific 
Predefined ports to which a worker process is assigned 
Page 7 
Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
Tuple Routing 
Fields grouping provides various ways to control tuple routing to bolts. 
© Hortonworks Inc. 2013 
Page 8 
Grouping type What it does When to use 
Shuffle Grouping Sends tuple to a bolt in random 
round robin sequence 
- Doing atomic operations eg. math 
operations. 
Fields Grouping Sends tuples to a bolt based on one 
or or more field's in the tuple 
- Segmentation of the incoming stream. 
- Counting tuples of a certain type. 
All grouping Sends a single copy of each tuple to 
all instances of a receiving bolt 
- Send some signal to all bolts like clear 
cache or refresh state etc. 
- Send ticker tuple to signal bolts to save 
state etc. 
Custom grouping Implement your own field grouping 
so tuples are routed based on 
custom logic 
- Used to get max flexibility to change 
processing sequence, logic etc. based 
on different factors like data types, load, 
seasonality etc. 
Direct grouping Source decides which bolt will 
receive tuple 
- Depends. 
Global grouping Global Grouping sends tuples 
generated by all instances of the 
source to a single target instance 
(specifically, the task with lowest ID) 
- Global counts.
Topology creation example 
Get Tweet Find Hashtags Count Hashtags Report Findings 
Kafka Spout 
"reader" 
© Hortonworks Inc. 2013 
Page 9 
Bolt 
"normalizer" 
Removes non-alphanumeric 
characters, extracts 
hashtag values and 
emits them. 
Bolt 
"enumerator" 
Keeps track of how 
many instances of 
each hashtag have 
occurred. 
TopologyBuilder builder = new TopologyBuilder(); 
builder.setSpout("spout", kafkaSpout); 
builder.setBolt("normalizer", new HashTagNormalizer(),2).shuffleGrouping("spout"); 
builder.setBolt("enumerator", new 
HashTagEnumerator(),2).fieldsGrouping("normalizer", new Fields("hashtag")); 
builder.setBolt("reporter", new ResultsReporter(),1).globalGrouping("enumerator"); 
Bolt 
"reporter" 
Regularly 
creates report 
and uploads it 
to Amazon S3.
What happens on failure? 
• Run everything with monitoring 
–E.g. daemontools or monit 
–Restarts Nimbus and Supervisors on failure 
• Nimbus 
–Stateless (kept in either ZooKeeper or on disk) 
–Single Point of Failure, Sort Of 
– Workers still function, but can’t be reassigned when a node fails 
– Supervisors continue as normal 
• Supervisor 
–Stateless 
• Entire Node 
–Nimbus reassigns tasks on that machine after timeout 
© Hortonworks Inc. 2013 
Page 10
Guaranteed Processing 
• Tuples from Spout are tagged with a message ID 
• Each of these tuples can result in a tuple tree 
• Once every tuple in the tuple tree is processed, the 
original tuple is considered to be processed. 
• Requires two pieces from the user 
–Explicitly anchoring an emitted tuple to the input tuple(s) 
–Ack or fail every tuple. 
• If a tuple isn’t processed quickly enough, a timeout value 
will cause a failure. 
• Spouts like the Kafka spout can replay tuples on failure, 
either as explicitly indicated by bolts or from timeouts. 
–At least once processing! 
© Hortonworks Inc. 2013 
Page 11
What is Trident? 
• Provides exactly once processing semantics in Storm 
• Core concept is to process a group of tuples as a batch 
rather than process tuple at a time like core Storm does. 
• Higher level API for defining topologies. 
• All Trident topologies under the covers are automatically 
converted into Spouts and Bolts. 
© Hortonworks Inc. 2013 
Page 12
Parallelism 
• Three basic variables: # Slots, # Workers, # Tasks 
–No general way to answer beyond profiling and adjusting. 
• Can set the number of executors (threads) 
• Can set the number of tasks 
–Tasks are NOT parallel within an executor 
–More than one task for executor is useful for rebalancing while the 
topology is running 
• Number of workers 
–Increase when bottlenecked on CPU and each worker has many 
tuples to process 
© Hortonworks Inc. 2013 
Page 13
Patterns – Streaming Joins 
• Combine two or more data streams 
• Unlike database join, streaming join has infinite input, and 
unclear semantics. 
• Different types of joins for different use cases 
• Partition input streams the same way 
Fields groupbuilder.setBolt("join", new MyJoiner(), parallelism) 
.fieldsGrouping("1", new Fields("joinfield1", "joinfield2")) 
.fieldsGrouping("2", new Fields("joinfield1", "joinfield2")) 
.fieldsGrouping("3", new Fields("joinfield1", "joinfield2")); 
© Hortonworks Inc. 2013 
Page 14
Patterns – Batching 
• For efficiency 
–E.g. Elasticsearch bulk API 
• Hold on to tuples in instance variable 
• Process tuples 
• Ack all the instance tuples 
• When emitting, consider multi-anchored tuple to ensure 
reliability. 
–Anchor to batched tuples to ensure all batched tuples are 
replayed. 
© Hortonworks Inc. 2013 
Page 15
Patterns – Streaming Top N 
• Simplest way is to have a bolt that does global grouping 
on stream and maintains list in memory of top N items 
– Doesn’t scale because whole stream goes through one task 
• Alternative: Do many top N’s across partitions of stream 
•Merge each partition top N to get global top N 
• Use fields grouping to get partitioning 
builder.setBolt("rank", new RankObjects(), parallelism) 
.fieldsGrouping("objects", new Fields("value")); 
builder.setBolt("merge", new MergeObjects()) 
.globalGrouping("rank"); 
© Hortonworks Inc. 2013 
Page 16

More Related Content

What's hot

Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleDung Ngua
 
Storm presentation
Storm presentationStorm presentation
Storm presentationShyam Raj
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time ComputationSonal Raj
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Robert Evans
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridDataWorks Summit
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Stormviirya
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationUday Vakalapudi
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm ConceptsAndré Dias
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormLester Martin
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaAndrew Montalenti
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm CrawlerJulien Nioche
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormJulien Nioche
 

What's hot (20)

Introduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & ExampleIntroduction to Apache Storm - Concept & Example
Introduction to Apache Storm - Concept & Example
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Storm
StormStorm
Storm
 
Storm presentation
Storm presentationStorm presentation
Storm presentation
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 
Storm Real Time Computation
Storm Real Time ComputationStorm Real Time Computation
Storm Real Time Computation
 
Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)Scaling Apache Storm (Hadoop Summit 2015)
Scaling Apache Storm (Hadoop Summit 2015)
 
Multi-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop GridMulti-Tenant Storm Service on Hadoop Grid
Multi-Tenant Storm Service on Hadoop Grid
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
Apache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integrationApache Storm and twitter Streaming API integration
Apache Storm and twitter Streaming API integration
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Apache Storm Concepts
Apache Storm ConceptsApache Storm Concepts
Apache Storm Concepts
 
Developing Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache StormDeveloping Java Streaming Applications with Apache Storm
Developing Java Streaming Applications with Apache Storm
 
Real-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and KafkaReal-time streams and logs with Storm and Kafka
Real-time streams and logs with Storm and Kafka
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 
A quick introduction to Storm Crawler
A quick introduction to Storm CrawlerA quick introduction to Storm Crawler
A quick introduction to Storm Crawler
 
Low latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache StormLow latency scalable web crawling on Apache Storm
Low latency scalable web crawling on Apache Storm
 
Storm and Cassandra
Storm and Cassandra Storm and Cassandra
Storm and Cassandra
 

Similar to Cleveland HUG - Storm

Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormDavorin Vukelic
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Stormthe100rabh
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Bryan Bende
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemAndrii Gakhov
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsthelabdude
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014StampedeCon
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopInSemble
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# WayBishnu Rawal
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPrashant Rane
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 

Similar to Cleveland HUG - Storm (20)

Mhug apache storm
Mhug apache stormMhug apache storm
Mhug apache storm
 
Real-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache StormReal-Time Streaming with Apache Spark Streaming and Apache Storm
Real-Time Streaming with Apache Spark Streaming and Apache Storm
 
Storm
StormStorm
Storm
 
Apache Storm
Apache StormApache Storm
Apache Storm
 
1 storm-intro
1 storm-intro1 storm-intro
1 storm-intro
 
Distributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache StormDistributed Realtime Computation using Apache Storm
Distributed Realtime Computation using Apache Storm
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 
Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache HadoopHortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
Hortonworks Technical Workshop: Real Time Monitoring with Apache Hadoop
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
BWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation systemBWB Meetup: Storm - distributed realtime computation system
BWB Meetup: Storm - distributed realtime computation system
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014Storm – Streaming Data Analytics at Scale - StampedeCon 2014
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
 
Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Low Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in HadoopLow Latency Streaming Data Processing in Hadoop
Low Latency Streaming Data Processing in Hadoop
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
 
Pune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCDPune-Cocoa: Blocks and GCD
Pune-Cocoa: Blocks and GCD
 
Storm - SpaaS
Storm - SpaaSStorm - SpaaS
Storm - SpaaS
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 

Recently uploaded

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
20200723_insight_release_plan
20200723_insight_release_plan20200723_insight_release_plan
20200723_insight_release_planJamie (Taka) Wang
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 

Recently uploaded (20)

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
20200723_insight_release_plan
20200723_insight_release_plan20200723_insight_release_plan
20200723_insight_release_plan
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 

Cleveland HUG - Storm

  • 1. Apache Storm © Hortonworks Inc. 2013 Page 1 Justin Leet jleet@hortonworks.com
  • 2. What is Storm? • Real time stream processing framework • Scalable –Up to 1 million tuples per second per node • Fault Tolerant –Tasks reassigned on failure • Guaranteed Processing –At least once processing –Exactly once processing with some more work • Relatively language agnostic –Primarily JVM based – Thrift API for defining and submitting topologies –JSON based protocol for defining components in other languages © Hortonworks Inc. 2013 Page 2
  • 3. Motivation • Process large amount of incoming data real time • Classic use case is processing streams of tweets –Calculate trending users –Calculate reach of a tweet • Data cleansing and normalization • Personalization and recommendation • Log processing © Hortonworks Inc. 2013 Page 3
  • 4. Lambda Architecture © Hortonworks Inc. 2013 Page 4 Source: http://swaroopch.com/2013/01/12/big-data-nathan-marz/ • Most useful when – Batch & speed layers do essentially the same computation – Sample use case: KPI dashboard • Less useful when – When batch & speed layers do different computation – Sample use case: Real-time model scoring
  • 5. Basic Concepts © Hortonworks Inc. 2013 Tuple: Most fundamental data structure and is a named list of values that can be of any datatype Page 5 Streams: Groups of tuples Spouts: Generate streams. Bolts: Contain data processing, persistence and alerting logic. Can also emit tuples for downstream bolts Tuple Tree: First tuple and all the tuples that were emitted by the bolts that processed it Topology: Group of spouts and bolts wired together into a workflow
  • 6. Architecture © Hortonworks Inc. 2013 Nimbus(Management server) • Similar to job tracker • Distributes code around cluster • Assigns tasks • Handles failures Supervisor(Worker nodes): • Similar to task tracker • Run bolts and spouts as ‘tasks’ ZooKeeper: • Cluster co-ordination • Nimbus HA • Stores cluster metrics • Consumption related metadata for Trident topologies
  • 7. Relationship Between Supervisors, Workers, Executors & Tasks supervisor © Hortonworks Inc. 2013 Each supervisor machine in storm has specific Predefined ports to which a worker process is assigned Page 7 Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/
  • 8. Tuple Routing Fields grouping provides various ways to control tuple routing to bolts. © Hortonworks Inc. 2013 Page 8 Grouping type What it does When to use Shuffle Grouping Sends tuple to a bolt in random round robin sequence - Doing atomic operations eg. math operations. Fields Grouping Sends tuples to a bolt based on one or or more field's in the tuple - Segmentation of the incoming stream. - Counting tuples of a certain type. All grouping Sends a single copy of each tuple to all instances of a receiving bolt - Send some signal to all bolts like clear cache or refresh state etc. - Send ticker tuple to signal bolts to save state etc. Custom grouping Implement your own field grouping so tuples are routed based on custom logic - Used to get max flexibility to change processing sequence, logic etc. based on different factors like data types, load, seasonality etc. Direct grouping Source decides which bolt will receive tuple - Depends. Global grouping Global Grouping sends tuples generated by all instances of the source to a single target instance (specifically, the task with lowest ID) - Global counts.
  • 9. Topology creation example Get Tweet Find Hashtags Count Hashtags Report Findings Kafka Spout "reader" © Hortonworks Inc. 2013 Page 9 Bolt "normalizer" Removes non-alphanumeric characters, extracts hashtag values and emits them. Bolt "enumerator" Keeps track of how many instances of each hashtag have occurred. TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", kafkaSpout); builder.setBolt("normalizer", new HashTagNormalizer(),2).shuffleGrouping("spout"); builder.setBolt("enumerator", new HashTagEnumerator(),2).fieldsGrouping("normalizer", new Fields("hashtag")); builder.setBolt("reporter", new ResultsReporter(),1).globalGrouping("enumerator"); Bolt "reporter" Regularly creates report and uploads it to Amazon S3.
  • 10. What happens on failure? • Run everything with monitoring –E.g. daemontools or monit –Restarts Nimbus and Supervisors on failure • Nimbus –Stateless (kept in either ZooKeeper or on disk) –Single Point of Failure, Sort Of – Workers still function, but can’t be reassigned when a node fails – Supervisors continue as normal • Supervisor –Stateless • Entire Node –Nimbus reassigns tasks on that machine after timeout © Hortonworks Inc. 2013 Page 10
  • 11. Guaranteed Processing • Tuples from Spout are tagged with a message ID • Each of these tuples can result in a tuple tree • Once every tuple in the tuple tree is processed, the original tuple is considered to be processed. • Requires two pieces from the user –Explicitly anchoring an emitted tuple to the input tuple(s) –Ack or fail every tuple. • If a tuple isn’t processed quickly enough, a timeout value will cause a failure. • Spouts like the Kafka spout can replay tuples on failure, either as explicitly indicated by bolts or from timeouts. –At least once processing! © Hortonworks Inc. 2013 Page 11
  • 12. What is Trident? • Provides exactly once processing semantics in Storm • Core concept is to process a group of tuples as a batch rather than process tuple at a time like core Storm does. • Higher level API for defining topologies. • All Trident topologies under the covers are automatically converted into Spouts and Bolts. © Hortonworks Inc. 2013 Page 12
  • 13. Parallelism • Three basic variables: # Slots, # Workers, # Tasks –No general way to answer beyond profiling and adjusting. • Can set the number of executors (threads) • Can set the number of tasks –Tasks are NOT parallel within an executor –More than one task for executor is useful for rebalancing while the topology is running • Number of workers –Increase when bottlenecked on CPU and each worker has many tuples to process © Hortonworks Inc. 2013 Page 13
  • 14. Patterns – Streaming Joins • Combine two or more data streams • Unlike database join, streaming join has infinite input, and unclear semantics. • Different types of joins for different use cases • Partition input streams the same way Fields groupbuilder.setBolt("join", new MyJoiner(), parallelism) .fieldsGrouping("1", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("2", new Fields("joinfield1", "joinfield2")) .fieldsGrouping("3", new Fields("joinfield1", "joinfield2")); © Hortonworks Inc. 2013 Page 14
  • 15. Patterns – Batching • For efficiency –E.g. Elasticsearch bulk API • Hold on to tuples in instance variable • Process tuples • Ack all the instance tuples • When emitting, consider multi-anchored tuple to ensure reliability. –Anchor to batched tuples to ensure all batched tuples are replayed. © Hortonworks Inc. 2013 Page 15
  • 16. Patterns – Streaming Top N • Simplest way is to have a bolt that does global grouping on stream and maintains list in memory of top N items – Doesn’t scale because whole stream goes through one task • Alternative: Do many top N’s across partitions of stream •Merge each partition top N to get global top N • Use fields grouping to get partitioning builder.setBolt("rank", new RankObjects(), parallelism) .fieldsGrouping("objects", new Fields("value")); builder.setBolt("merge", new MergeObjects()) .globalGrouping("rank"); © Hortonworks Inc. 2013 Page 16

Editor's Notes

  1. Nimbus similar to JobTracker Distributes code around cluster, assign tasks, handle failures Supervisor similar to TaskTracker ZooKeeper for Nimbus to Supervisor co-ordination…. The nimbus and supervisors find each other through zookeper. So they use zookeper to find each other.. A hugely common use case for zookeeper Communication between workers using ZeroMQ
  2. The number of tasks for a spout/bolt is always the same throughout the lifetime of a topology, but the number of executors (threads) for a spout/bolt can change over time. This allows a topology to scale to more or less resources without redeploying the topology or violating the constraints of Storm (such as a fields grouping guaranteeing that the same value goes to the same task). Supervisor has one or more worker processes. Each worker process has one or more executor threads and each executor runs one or more tasks of the same component ( spout or bolt) NOTE: Running multiple tasks per executor thread does not give any performance benefit at all but it was built more so that you can expand the cluster – processing capacity wise NOTE: Having less number of tasks than executors does not make sense. Typically #tasks = #executors Source: http://www.michael-noll.com/blog/2012/10/16/understanding-the-parallelism-of-a-storm-topology/ Executors Supervisors Tasks Workers #threads Slots? Nimbus