SlideShare a Scribd company logo
1 of 80
Scaling Apache Storm 
P. Taylor Goetz, Hortonworks 
@ptgoetz
About Me 
Member of Technical Staff / Storm Tech Lead 
@ Hortonworks 
Apache Storm PMC Chair 
@ Apache
About Me 
Member of Technical Staff / Storm Tech Lead 
@ Hortonworks 
Apache Storm PMC Chair 
@ Apache 
Volunteer Firefighter since 2004
1M+ messages / sec. on a 10-15 
node cluster 
How do you get there?
How do you fight fire?
Put the wet stuff on the red stuff. 
Water, and lots of it.
When you're dealing with big fire, you 
need big water.
Static Water Sources 
Lakes 
Streams 
Reservoirs, Pools, Ponds
Data Hydrant 
Active source 
Under pressure
How does this relate to Storm?
Little’s Law 
L=λW 
The long-term average number of customers in a stable system L 
is equal to the long-term average effective arrival rate, λ, multiplied 
by the average time a customer spends in the system, W; or 
expressed algebraically: L = λW. 
http://en.wikipedia.org/wiki/Little's_law
Batch vs. Streaming
Batch Processing 
Operates on data at rest 
Velocity is a function of 
performance 
Poor performance costs you time
Stream Processing 
Data in motion 
At the mercy of your data source 
Velocity fluctuates over time 
Poor performance….
Poor performance bursts the pipes. 
Buffers fill up and eat memory 
Timeouts / Replays 
“Sink” systems overwhelmed
What can developers do?
Keep tuple processing code tight 
public class MyBolt extends BaseRichBolt { 
! 
public void prepare(Map stormConf, 
TopologyContext context, 
OutputCollector collector) { 
// initialize task 
} 
! 
public void execute(Tuple input) { 
// process input — QUICKLY! 
} 
! 
public void declareOutputFields(OutputFieldsDeclarer declarer) { 
// declare output 
} 
! 
} 
Worry about this!
Keep tuple processing code tight 
public class MyBolt extends BaseRichBolt { 
! 
public void prepare(Map stormConf, 
TopologyContext context, 
OutputCollector collector) { 
// initialize task 
} 
! 
public void execute(Tuple input) { 
// process input — QUICKLY! 
} 
! 
public void declareOutputFields(OutputFieldsDeclarer declarer) { 
// declare output 
} 
! 
} 
Not this.
Know your latencies 
L1 
cache 
reference 
0.5 
ns 
Branch 
mispredict 
5 
ns 
L2 
cache 
reference 
7 
ns 
14x 
L1 
cache 
Mutex 
lock/unlock 
25 
ns 
Main 
memory 
reference 
100 
ns 
20x 
L2 
cache, 
200x 
L1 
cache 
Compress 
1K 
bytes 
with 
Zippy 
3,000 
ns 
Send 
1K 
bytes 
over 
1 
Gbps 
network 
10,000 
ns 
0.01 
ms 
Read 
4K 
randomly 
from 
SSD* 
150,000 
ns 
0.15 
ms 
Read 
1 
MB 
sequentially 
from 
memory 
250,000 
ns 
0.25 
ms 
Round 
trip 
within 
same 
datacenter 
500,000 
ns 
0.5 
ms 
Read 
1 
MB 
sequentially 
from 
SSD* 
1,000,000 
ns 
1 
ms 
4X 
memory 
Disk 
seek 
10,000,000 
ns 
10 
ms 
20x 
datacenter 
roundtrip 
Read 
1 
MB 
sequentially 
from 
disk 
20,000,000 
ns 
20 
ms 
80x 
memory, 
20X 
SSD 
Send 
packet 
CA-­‐>Netherlands-­‐>CA 
150,000,000 
ns 
150 
ms 
https://gist.github.com/jboner/2841832
Use a Cache 
Guava is your friend.
Expose your knobs and gauges. 
DevOps will appreciate it.
Externalize Configuration 
Hard-coded values require 
recompilation/repackaging. 
conf.setNumWorkers(3); 
builder.setSpout("spout", new RandomSentenceSpout(), 5); 
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); 
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); 
Values from external config. 
No repackaging! 
conf.setNumWorkers(props.get(“num.workers")); 
builder.setSpout("spout", new RandomSentenceSpout(), props.get(“spout.paralellism”)); 
builder.setBolt("split", new SplitSentence(), props.get(“split.paralellism”)).shuffleGrouping("spout"); 
builder.setBolt("count", new WordCount(), props.get(“count.paralellism”)).fieldsGrouping("split", new Fields("word"));
What can DevOps do?
How big is your hose?
Text 
Find out!
Performance testing is essential! 
Text
How to deal with small pipes? 
(i.e. When your output is more like a garden hose.)
Parallelize 
Slow sinks
Parallelism == Manifold 
Take input from one big pipe and 
distribute it to many smaller pipes 
The bigger the size difference, the 
more parallelism you will need
Sizeup 
Initial assessment
Every fire is different.
Text
Every streaming use case is different.
Sizeup — Fire 
What are my water 
sources? What GPM 
can they support? 
How many lines (hoses) 
do I need? 
How much water will I 
need to flow to put this 
fire out?
Sizeup — Storm 
What are my input 
sources? 
At what rate do they 
deliver messages? 
What size are the 
messages? 
What's my slowest data 
sink?
There is no magic bullet.
But there are good starting points.
Numbers 
Where to start.
1 Worker / Machine / Topology 
Keep unnecessary network transfer to a minimum
1 Acker / Worker 
Default in Storm 0.9.x
1 Executor / CPU Core 
Optimize Thread/CPU usage
1 Executor / CPU Core 
(for CPU-bound use cases)
1 Executor / CPU Core 
Multiply by 10x-100x for I/O bound use cases
Example 
10 Worker Nodes 
16 Cores / Machine 
10 * 16 = 160 “Parallelism Units” available
Example 
10 Worker Nodes 
16 Cores / Machine 
10 * 16 = 160 “Parallelism Units” available 
! 
Subtract # Ackers: 160 - 10 = 150 Units.
Example 
10 Worker Nodes 
16 Cores / Machine 
(10 * 16) - 10 = 150 “Parallelism Units” available
Example 
10 Worker Nodes 
16 Cores / Machine 
(10 * 16) - 10 = 150 “Parallelism Units” available (* 10-100 if I/O bound) 
Distrubte this among tasks in topology. Higher for slow tasks, lower for fast tasks.
Example 
150 “Parallelism Units” available 
Emit Calculate Persist 
10 40 100
Watch Storm’s “capacity” metric 
This tells you how hard components are working. 
Adjust parallelism unit distribution accordingly.
This is just a starting point. 
Test, test, test. Measure, measure, measure.
Internal 
Messaging 
Handling backpressure.
Internal Messaging (Intra-worker)
Key Settings 
topology.max.spout.pending 
Spout/Bolt API: Controls how many tuples are in-flight (not ack’ed) 
Trident API: Controls how many batches are in flight (not committed)
Key Settings 
topology.max.spout.pending 
When reached, Storm will temporarily stop emitting data from Spout(s) 
WARNING: Default is “unset” (i.e. no limit)
Key Settings 
topology.max.spout.pending 
Spout/Bolt API: Start High (~1,000) 
Trident API: Start Low (~1-5)
Key Settings 
topology.message.timeout.secs 
Controls how long a tuple tree (Spout/Bolt API) or batch (Trident API) has to 
complete processing before Storm considers it timed out and fails it. 
Default value is 30 seconds.
Key Settings 
topology.message.timeout.secs 
Q: “Why am I getting tuple/batch failures for no apparent reason?” 
A: Timeouts due to a bottleneck. 
Solution: Look at the “Complete Latency” metric. Increase timeout and/or 
increase component parallelism to address the bottleneck.
Turn knobs slowly, one at a time.
Don't mess with settings you don't 
understand.
Storm ships with sane defaults 
Override only as necessary
Hardware 
Considerations
Nimbus 
Generally light load 
Can collocate Storm UI service 
m1.xlarge (or equivalent) should suffice 
Save the big metal for Supervisor/Worker machines…
Supervisor/Worker Nodes 
Where hardware choices have the most impact.
CPU Cores 
More is usually better 
The more you have the more 
threads you can support (i.e. 
parallelism) 
Storm potentially uses a lot of 
threads
Memory 
Highly use-case specific 
How many workers (JVMs) per 
node? 
Are you caching and/or holding 
in-memory state? 
Tests/metrics are your friends
Network 
Use bonded NICs if necessary 
Keep nodes “close”
Other performance considerations
Don’t “Pancake!” 
Separate concerns.
Don’t “Pancake!” 
Separate concerns. 
CPU Contention 
I/O Contention 
Disk Seeks (ZooKeeper)
Keep this guy happy. 
He has big boots and a shovel.
ZooKeeper Considerations 
Use dedicated machines, preferably 
bare-metal if an option 
Start with 3 node ensemble 
(can tolerate 1 node loss) 
I/O is ZooKeeper’s main bottleneck 
Dedicated disk for ZK storage 
SSDs greatly improve performance
Recap 
Know/track your latencies and code appropriately 
Externalize configuration 
Scaling is a factor of balancing the I/O and CPU requirements of your use 
case 
Dev + DevOps + Ops coordination and collaboration is essential
Thanks! 
P. Taylor Goetz, Hortonworks 
@ptgoetz

More Related Content

What's hot

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Cloudera, Inc.
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationDatabricks
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Ray Jenkins
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Valeriy Kravchuk
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Ludovico Caldara
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
wire-all-the-things-lambda-days-2023.pdf
wire-all-the-things-lambda-days-2023.pdfwire-all-the-things-lambda-days-2023.pdf
wire-all-the-things-lambda-days-2023.pdfEric Torreborre
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConJérôme Petazzoni
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Yoshiyasu SAEKI
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KernelThomas Graf
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconPeter Lawrey
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 

What's hot (20)

Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
 
Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!Understanding eBPF in a Hurry!
Understanding eBPF in a Hurry!
 
Logstash
LogstashLogstash
Logstash
 
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
 
Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?Oracle Drivers configuration for High Availability, is it a developer's job?
Oracle Drivers configuration for High Availability, is it a developer's job?
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
wire-all-the-things-lambda-days-2023.pdf
wire-all-the-things-lambda-days-2023.pdfwire-all-the-things-lambda-days-2023.pdf
wire-all-the-things-lambda-days-2023.pdf
 
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxConAnatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
Anatomy of a Container: Namespaces, cgroups & Some Filesystem Magic - LinuxCon
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
eBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux KerneleBPF - Rethinking the Linux Kernel
eBPF - Rethinking the Linux Kernel
 
GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 

Viewers also liked

Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationnathanmarz
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopDataWorks Summit
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Kafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersKafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersJean-Paul Azar
 

Viewers also liked (7)

Storm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computationStorm: distributed and fault-tolerant realtime computation
Storm: distributed and fault-tolerant realtime computation
 
Resource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache StormResource Aware Scheduling in Apache Storm
Resource Aware Scheduling in Apache Storm
 
Realtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and HadoopRealtime Analytics with Storm and Hadoop
Realtime Analytics with Storm and Hadoop
 
Hadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm ArchitectureHadoop Summit Europe 2014: Apache Storm Architecture
Hadoop Summit Europe 2014: Apache Storm Architecture
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Kafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka ConsumersKafka Tutorial Advanced Kafka Consumers
Kafka Tutorial Advanced Kafka Consumers
 

Similar to Scaling Apache Storm - Strata + Hadoop World 2014

How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)Doug Burns
 
Oracle Performance Tuning DE(v1.2)-part2.ppt
Oracle Performance Tuning DE(v1.2)-part2.pptOracle Performance Tuning DE(v1.2)-part2.ppt
Oracle Performance Tuning DE(v1.2)-part2.pptVenugopalChattu1
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Scalable Apache for Beginners
Scalable Apache for BeginnersScalable Apache for Beginners
Scalable Apache for Beginnerswebhostingguy
 
Stress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year EvesStress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year EvesHerval Freire
 
Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipcPeter Lawrey
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsKonrad Malawski
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldKonrad Malawski
 
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareBeyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareMiro Samek
 
Intro to Cascading
Intro to CascadingIntro to Cascading
Intro to CascadingBen Speakmon
 
FPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchYutaka Yasuda
 
A Scalable I/O Manager for GHC
A Scalable I/O Manager for GHCA Scalable I/O Manager for GHC
A Scalable I/O Manager for GHCJohan Tibell
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure wayBahadir Cambel
 

Similar to Scaling Apache Storm - Strata + Hadoop World 2014 (20)

How Many Slaves (Ukoug)
How Many Slaves (Ukoug)How Many Slaves (Ukoug)
How Many Slaves (Ukoug)
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Oracle Performance Tuning DE(v1.2)-part2.ppt
Oracle Performance Tuning DE(v1.2)-part2.pptOracle Performance Tuning DE(v1.2)-part2.ppt
Oracle Performance Tuning DE(v1.2)-part2.ppt
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Scalable Apache for Beginners
Scalable Apache for BeginnersScalable Apache for Beginners
Scalable Apache for Beginners
 
storm-170531123446.pptx
storm-170531123446.pptxstorm-170531123446.pptx
storm-170531123446.pptx
 
Stress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year EvesStress Testing at Twitter: a tale of New Year Eves
Stress Testing at Twitter: a tale of New Year Eves
 
Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipc
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
Need for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applicationsNeed for Async: Hot pursuit for scalable applications
Need for Async: Hot pursuit for scalable applications
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Storm
StormStorm
Storm
 
Mutiny + quarkus
Mutiny + quarkusMutiny + quarkus
Mutiny + quarkus
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorld
 
Tuning Java Servers
Tuning Java Servers Tuning Java Servers
Tuning Java Servers
 
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded SoftwareBeyond the RTOS: A Better Way to Design Real-Time Embedded Software
Beyond the RTOS: A Better Way to Design Real-Time Embedded Software
 
Intro to Cascading
Intro to CascadingIntro to Cascading
Intro to Cascading
 
FPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow SwitchFPGA based 10G Performance Tester for HW OpenFlow Switch
FPGA based 10G Performance Tester for HW OpenFlow Switch
 
A Scalable I/O Manager for GHC
A Scalable I/O Manager for GHCA Scalable I/O Manager for GHC
A Scalable I/O Manager for GHC
 
Onyx data processing the clojure way
Onyx   data processing  the clojure wayOnyx   data processing  the clojure way
Onyx data processing the clojure way
 

More from P. Taylor Goetz

Flux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & DeploymentFlux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & DeploymentP. Taylor Goetz
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...P. Taylor Goetz
 
Past, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormPast, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormP. Taylor Goetz
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphP. Taylor Goetz
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache StormP. Taylor Goetz
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark StreamingP. Taylor Goetz
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceP. Taylor Goetz
 

More from P. Taylor Goetz (7)

Flux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & DeploymentFlux: Apache Storm Frictionless Topology Configuration & Deployment
Flux: Apache Storm Frictionless Topology Configuration & Deployment
 
From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...From Device to Data Center to Insights: Architectural Considerations for the ...
From Device to Data Center to Insights: Architectural Considerations for the ...
 
Past, Present, and Future of Apache Storm
Past, Present, and Future of Apache StormPast, Present, and Future of Apache Storm
Past, Present, and Future of Apache Storm
 
Large Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
 
The Future of Apache Storm
The Future of Apache StormThe Future of Apache Storm
The Future of Apache Storm
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Cassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market SceinceCassandra and Storm at Health Market Sceince
Cassandra and Storm at Health Market Sceince
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Scaling Apache Storm - Strata + Hadoop World 2014

  • 1. Scaling Apache Storm P. Taylor Goetz, Hortonworks @ptgoetz
  • 2. About Me Member of Technical Staff / Storm Tech Lead @ Hortonworks Apache Storm PMC Chair @ Apache
  • 3. About Me Member of Technical Staff / Storm Tech Lead @ Hortonworks Apache Storm PMC Chair @ Apache Volunteer Firefighter since 2004
  • 4. 1M+ messages / sec. on a 10-15 node cluster How do you get there?
  • 5. How do you fight fire?
  • 6.
  • 7.
  • 8. Put the wet stuff on the red stuff. Water, and lots of it.
  • 9.
  • 10. When you're dealing with big fire, you need big water.
  • 11. Static Water Sources Lakes Streams Reservoirs, Pools, Ponds
  • 12.
  • 13. Data Hydrant Active source Under pressure
  • 14.
  • 15.
  • 16.
  • 17. How does this relate to Storm?
  • 18. Little’s Law L=λW The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: L = λW. http://en.wikipedia.org/wiki/Little's_law
  • 20. Batch Processing Operates on data at rest Velocity is a function of performance Poor performance costs you time
  • 21. Stream Processing Data in motion At the mercy of your data source Velocity fluctuates over time Poor performance….
  • 22. Poor performance bursts the pipes. Buffers fill up and eat memory Timeouts / Replays “Sink” systems overwhelmed
  • 24. Keep tuple processing code tight public class MyBolt extends BaseRichBolt { ! public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { // initialize task } ! public void execute(Tuple input) { // process input — QUICKLY! } ! public void declareOutputFields(OutputFieldsDeclarer declarer) { // declare output } ! } Worry about this!
  • 25. Keep tuple processing code tight public class MyBolt extends BaseRichBolt { ! public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { // initialize task } ! public void execute(Tuple input) { // process input — QUICKLY! } ! public void declareOutputFields(OutputFieldsDeclarer declarer) { // declare output } ! } Not this.
  • 26. Know your latencies L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA-­‐>Netherlands-­‐>CA 150,000,000 ns 150 ms https://gist.github.com/jboner/2841832
  • 27. Use a Cache Guava is your friend.
  • 28. Expose your knobs and gauges. DevOps will appreciate it.
  • 29. Externalize Configuration Hard-coded values require recompilation/repackaging. conf.setNumWorkers(3); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word")); Values from external config. No repackaging! conf.setNumWorkers(props.get(“num.workers")); builder.setSpout("spout", new RandomSentenceSpout(), props.get(“spout.paralellism”)); builder.setBolt("split", new SplitSentence(), props.get(“split.paralellism”)).shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), props.get(“count.paralellism”)).fieldsGrouping("split", new Fields("word"));
  • 31. How big is your hose?
  • 33. Performance testing is essential! Text
  • 34. How to deal with small pipes? (i.e. When your output is more like a garden hose.)
  • 36. Parallelism == Manifold Take input from one big pipe and distribute it to many smaller pipes The bigger the size difference, the more parallelism you will need
  • 38. Every fire is different.
  • 39. Text
  • 40. Every streaming use case is different.
  • 41. Sizeup — Fire What are my water sources? What GPM can they support? How many lines (hoses) do I need? How much water will I need to flow to put this fire out?
  • 42. Sizeup — Storm What are my input sources? At what rate do they deliver messages? What size are the messages? What's my slowest data sink?
  • 43. There is no magic bullet.
  • 44. But there are good starting points.
  • 46. 1 Worker / Machine / Topology Keep unnecessary network transfer to a minimum
  • 47. 1 Acker / Worker Default in Storm 0.9.x
  • 48. 1 Executor / CPU Core Optimize Thread/CPU usage
  • 49. 1 Executor / CPU Core (for CPU-bound use cases)
  • 50. 1 Executor / CPU Core Multiply by 10x-100x for I/O bound use cases
  • 51. Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160 “Parallelism Units” available
  • 52. Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160 “Parallelism Units” available ! Subtract # Ackers: 160 - 10 = 150 Units.
  • 53. Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150 “Parallelism Units” available
  • 54. Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150 “Parallelism Units” available (* 10-100 if I/O bound) Distrubte this among tasks in topology. Higher for slow tasks, lower for fast tasks.
  • 55. Example 150 “Parallelism Units” available Emit Calculate Persist 10 40 100
  • 56. Watch Storm’s “capacity” metric This tells you how hard components are working. Adjust parallelism unit distribution accordingly.
  • 57. This is just a starting point. Test, test, test. Measure, measure, measure.
  • 60. Key Settings topology.max.spout.pending Spout/Bolt API: Controls how many tuples are in-flight (not ack’ed) Trident API: Controls how many batches are in flight (not committed)
  • 61. Key Settings topology.max.spout.pending When reached, Storm will temporarily stop emitting data from Spout(s) WARNING: Default is “unset” (i.e. no limit)
  • 62. Key Settings topology.max.spout.pending Spout/Bolt API: Start High (~1,000) Trident API: Start Low (~1-5)
  • 63. Key Settings topology.message.timeout.secs Controls how long a tuple tree (Spout/Bolt API) or batch (Trident API) has to complete processing before Storm considers it timed out and fails it. Default value is 30 seconds.
  • 64. Key Settings topology.message.timeout.secs Q: “Why am I getting tuple/batch failures for no apparent reason?” A: Timeouts due to a bottleneck. Solution: Look at the “Complete Latency” metric. Increase timeout and/or increase component parallelism to address the bottleneck.
  • 65. Turn knobs slowly, one at a time.
  • 66. Don't mess with settings you don't understand.
  • 67. Storm ships with sane defaults Override only as necessary
  • 69. Nimbus Generally light load Can collocate Storm UI service m1.xlarge (or equivalent) should suffice Save the big metal for Supervisor/Worker machines…
  • 70. Supervisor/Worker Nodes Where hardware choices have the most impact.
  • 71. CPU Cores More is usually better The more you have the more threads you can support (i.e. parallelism) Storm potentially uses a lot of threads
  • 72. Memory Highly use-case specific How many workers (JVMs) per node? Are you caching and/or holding in-memory state? Tests/metrics are your friends
  • 73. Network Use bonded NICs if necessary Keep nodes “close”
  • 76. Don’t “Pancake!” Separate concerns. CPU Contention I/O Contention Disk Seeks (ZooKeeper)
  • 77. Keep this guy happy. He has big boots and a shovel.
  • 78. ZooKeeper Considerations Use dedicated machines, preferably bare-metal if an option Start with 3 node ensemble (can tolerate 1 node loss) I/O is ZooKeeper’s main bottleneck Dedicated disk for ZK storage SSDs greatly improve performance
  • 79. Recap Know/track your latencies and code appropriately Externalize configuration Scaling is a factor of balancing the I/O and CPU requirements of your use case Dev + DevOps + Ops coordination and collaboration is essential
  • 80. Thanks! P. Taylor Goetz, Hortonworks @ptgoetz