SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.
SlideShare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.
Successfully reported this slideshow.
Activate your 30 day free trial to unlock unlimited reading.
11.
Little’s Law
L=λW
The long-term average number of customers in a stable system L
is equal to the long-term average effective arrival rate, λ, multiplied
by the average time a customer spends in the system, W; or
expressed algebraically: L = λW.
http://en.wikipedia.org/wiki/Little's_law
21.
Expose your knobs and gauges.
DevOps will appreciate it.
22.
Externalize Configuration
Hard-coded values require
recompilation/repackaging.
conf.setNumWorkers(3);
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Values from external config.
No repackaging!
conf.setNumWorkers(props.get(“num.workers"));
builder.setSpout("spout", new RandomSentenceSpout(), props.get(“spout.paralellism”));
builder.setBolt("split", new SplitSentence(), props.get(“split.paralellism”)).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), props.get(“count.paralellism”)).fieldsGrouping("split", new Fields("word"));
29.
Parallelism == Manifold
Take input from one big pipe and
distribute it to many smaller pipes
The bigger the size difference, the
more parallelism you will need
34.
Sizeup — Fire
What are my water
sources? What GPM
can they support?
How many lines (hoses)
do I need?
How much water will I
need to flow to put this
fire out?
35.
Sizeup — Storm
What are my input
sources?
At what rate do they
deliver messages?
What size are the
messages?
What's my slowest data
sink?
46.
Example
10 Worker Nodes
16 Cores / Machine
(10 * 16) - 10 = 150 “Parallelism Units” available
47.
Example
10 Worker Nodes
16 Cores / Machine
(10 * 16) - 10 = 150 “Parallelism Units” available (* 10-100 if I/O bound)
Distrubte this among tasks in topology. Higher for slow tasks, lower for fast tasks.
48.
Example
150 “Parallelism Units” available
Emit Calculate Persist
10 40 100
49.
Watch Storm’s “capacity” metric
This tells you how hard components are working.
Adjust parallelism unit distribution accordingly.
50.
This is just a starting point.
Test, test, test. Measure, measure, measure.
53.
Key Settings
topology.max.spout.pending
Spout/Bolt API: Controls how many tuples are in-flight (not ack’ed)
Trident API: Controls how many batches are in flight (not committed)
54.
Key Settings
topology.max.spout.pending
When reached, Storm will temporarily stop emitting data from Spout(s)
WARNING: Default is “unset” (i.e. no limit)
56.
Key Settings
topology.message.timeout.secs
Controls how long a tuple tree (Spout/Bolt API) or batch (Trident API) has to
complete processing before Storm considers it timed out and fails it.
Default value is 30 seconds.
57.
Key Settings
topology.message.timeout.secs
Q: “Why am I getting tuple/batch failures for no apparent reason?”
A: Timeouts due to a bottleneck.
Solution: Look at the “Complete Latency” metric. Increase timeout and/or
increase component parallelism to address the bottleneck.
62.
Nimbus
Generally light load
Can collocate Storm UI service
m1.xlarge (or equivalent) should suffice
Save the big metal for Supervisor/Worker machines…
63.
Supervisor/Worker Nodes
Where hardware choices have the most impact.
64.
CPU Cores
More is usually better
The more you have the more
threads you can support (i.e.
parallelism)
Storm potentially uses a lot of
threads
65.
Memory
Highly use-case specific
How many workers (JVMs) per
node?
Are you caching and/or holding
in-memory state?
Tests/metrics are your friends
66.
Network
Use bonded NICs if necessary
Keep nodes “close”
69.
Don’t “Pancake!”
Separate concerns.
CPU Contention
I/O Contention
Disk Seeks (ZooKeeper)
70.
Keep this guy happy.
He has big boots and a shovel.
71.
ZooKeeper Considerations
Use dedicated machines, preferably
bare-metal if an option
Start with 3 node ensemble
(can tolerate 1 node loss)
I/O is ZooKeeper’s main bottleneck
Dedicated disk for ZK storage
SSDs greatly improve performance
72.
Recap
Know/track your latencies and code appropriately
Externalize configuration
Scaling is a factor of balancing the I/O and CPU requirements of your use
case
Dev + DevOps + Ops coordination and collaboration is essential