Scaling Apache Storm - Strata + Hadoop World 2014

Scaling Apache Storm
P. Taylor Goetz, Hortonworks
@ptgoetz

About Me
Member of Technical Staff / Storm Tech Lead
@ Hortonworks
Apache Storm PMC Chair
@ Apache

About Me
Member of Technical Staff / Storm Tech Lead
@ Hortonworks
Apache Storm PMC Chair
@ Apache
Volunteer Firefighter since 2004

1M+ messages / sec. on a 10-15
node cluster
How do you get there?

Put the wet stuff on the red stuff.
Water, and lots of it.

When you're dealing with big fire, you
need big water.

Static Water Sources
Lakes
Streams
Reservoirs, Pools, Ponds

Data Hydrant
Active source
Under pressure

How does this relate to Storm?

Little’s Law
L=λW
The long-term average number of customers in a stable system L
is equal to the long-term average effective arrival rate, λ, multiplied
by the average time a customer spends in the system, W; or
expressed algebraically: L = λW.
http://en.wikipedia.org/wiki/Little's_law

Batch Processing
Operates on data at rest
Velocity is a function of
performance
Poor performance costs you time

Stream Processing
Data in motion
At the mercy of your data source
Velocity fluctuates over time
Poor performance….

Poor performance bursts the pipes.
Buffers fill up and eat memory
Timeouts / Replays
“Sink” systems overwhelmed

Keep tuple processing code tight
public class MyBolt extends BaseRichBolt {
!
public void prepare(Map stormConf,
TopologyContext context,
OutputCollector collector) {
// initialize task
}
!
public void execute(Tuple input) {
// process input — QUICKLY!
}
!
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// declare output
}
!
}
Worry about this!

Keep tuple processing code tight
public class MyBolt extends BaseRichBolt {
!
public void prepare(Map stormConf,
TopologyContext context,
OutputCollector collector) {
// initialize task
}
!
public void execute(Tuple input) {
// process input — QUICKLY!
}
!
public void declareOutputFields(OutputFieldsDeclarer declarer) {
// declare output
}
!
}
Not this.

Know your latencies
L1
cache
reference
0.5
ns
Branch
mispredict
5
ns
L2
cache
reference
7
ns
14x
L1
cache
Mutex
lock/unlock
25
ns
Main
memory
reference
100
ns
20x
L2
cache,
200x
L1
cache
Compress
1K
bytes
with
Zippy
3,000
ns
Send
1K
bytes
over
1
Gbps
network
10,000
ns
0.01
ms
Read
4K
randomly
from
SSD*
150,000
ns
0.15
ms
Read
1
MB
sequentially
from
memory
250,000
ns
0.25
ms
Round
trip
within
same
datacenter
500,000
ns
0.5
ms
Read
1
MB
sequentially
from
SSD*
1,000,000
ns
1
ms
4X
memory
Disk
seek
10,000,000
ns
10
ms
20x
datacenter
roundtrip
Read
1
MB
sequentially
from
disk
20,000,000
ns
20
ms
80x
memory,
20X
SSD
Send
packet
CA-‐>Netherlands-‐>CA
150,000,000
ns
150
ms
https://gist.github.com/jboner/2841832

Use a Cache
Guava is your friend.

Expose your knobs and gauges.
DevOps will appreciate it.

Externalize Configuration
Hard-coded values require
recompilation/repackaging.
conf.setNumWorkers(3);
builder.setSpout("spout", new RandomSentenceSpout(), 5);
builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));
Values from external config.
No repackaging!
conf.setNumWorkers(props.get(“num.workers"));
builder.setSpout("spout", new RandomSentenceSpout(), props.get(“spout.paralellism”));
builder.setBolt("split", new SplitSentence(), props.get(“split.paralellism”)).shuffleGrouping("spout");
builder.setBolt("count", new WordCount(), props.get(“count.paralellism”)).fieldsGrouping("split", new Fields("word"));

Performance testing is essential!
Text

How to deal with small pipes?
(i.e. When your output is more like a garden hose.)

Parallelism == Manifold
Take input from one big pipe and
distribute it to many smaller pipes
The bigger the size difference, the
more parallelism you will need

Every streaming use case is different.

Sizeup — Fire
What are my water
sources? What GPM
can they support?
How many lines (hoses)
do I need?
How much water will I
need to flow to put this
fire out?

Sizeup — Storm
What are my input
sources?
At what rate do they
deliver messages?
What size are the
messages?
What's my slowest data
sink?

But there are good starting points.

1 Worker / Machine / Topology
Keep unnecessary network transfer to a minimum

1 Acker / Worker
Default in Storm 0.9.x

1 Executor / CPU Core
Optimize Thread/CPU usage

(for CPU-bound use cases)

Multiply by 10x-100x for I/O bound use cases

Example
10 Worker Nodes
16 Cores / Machine
10 * 16 = 160 “Parallelism Units” available

Example
10 Worker Nodes
16 Cores / Machine
10 * 16 = 160 “Parallelism Units” available
!
Subtract # Ackers: 160 - 10 = 150 Units.

Example
10 Worker Nodes
16 Cores / Machine
(10 * 16) - 10 = 150 “Parallelism Units” available

Example
10 Worker Nodes
16 Cores / Machine
(10 * 16) - 10 = 150 “Parallelism Units” available (* 10-100 if I/O bound)
Distrubte this among tasks in topology. Higher for slow tasks, lower for fast tasks.

Example
150 “Parallelism Units” available
Emit Calculate Persist
10 40 100

Watch Storm’s “capacity” metric
This tells you how hard components are working.
Adjust parallelism unit distribution accordingly.

This is just a starting point.
Test, test, test. Measure, measure, measure.

Internal
Messaging
Handling backpressure.

Internal Messaging (Intra-worker)

Key Settings
topology.max.spout.pending
Spout/Bolt API: Controls how many tuples are in-flight (not ack’ed)
Trident API: Controls how many batches are in flight (not committed)

Key Settings
When reached, Storm will temporarily stop emitting data from Spout(s)
WARNING: Default is “unset” (i.e. no limit)

Key Settings
Spout/Bolt API: Start High (~1,000)
Trident API: Start Low (~1-5)

Key Settings
topology.message.timeout.secs
Controls how long a tuple tree (Spout/Bolt API) or batch (Trident API) has to
complete processing before Storm considers it timed out and fails it.
Default value is 30 seconds.

Key Settings
topology.message.timeout.secs
Q: “Why am I getting tuple/batch failures for no apparent reason?”
A: Timeouts due to a bottleneck.
Solution: Look at the “Complete Latency” metric. Increase timeout and/or
increase component parallelism to address the bottleneck.

Turn knobs slowly, one at a time.

Don't mess with settings you don't
understand.

Storm ships with sane defaults
Override only as necessary

Nimbus
Generally light load
Can collocate Storm UI service
m1.xlarge (or equivalent) should suffice
Save the big metal for Supervisor/Worker machines…

Supervisor/Worker Nodes
Where hardware choices have the most impact.

CPU Cores
More is usually better
The more you have the more
threads you can support (i.e.
parallelism)
Storm potentially uses a lot of
threads

Memory
Highly use-case specific
How many workers (JVMs) per
node?
Are you caching and/or holding
in-memory state?
Tests/metrics are your friends

Network
Use bonded NICs if necessary
Keep nodes “close”

Other performance considerations

Don’t “Pancake!”
Separate concerns.

Don’t “Pancake!”
Separate concerns.
CPU Contention
I/O Contention
Disk Seeks (ZooKeeper)

Keep this guy happy.
He has big boots and a shovel.

ZooKeeper Considerations
Use dedicated machines, preferably
bare-metal if an option
Start with 3 node ensemble
(can tolerate 1 node loss)
I/O is ZooKeeper’s main bottleneck
Dedicated disk for ZK storage
SSDs greatly improve performance

Recap
Know/track your latencies and code appropriately
Externalize configuration
Scaling is a factor of balancing the I/O and CPU requirements of your use
case
Dev + DevOps + Ops coordination and collaboration is essential

Thanks!
P. Taylor Goetz, Hortonworks
@ptgoetz

Scaling Apache Storm - Strata + Hadoop World 2014

More Related Content

What's hot

Similar to Scaling Apache Storm - Strata + Hadoop World 2014

Recently uploaded

Scaling Apache Storm - Strata + Hadoop World 2014