Scaling Apache Storm - Hadoop Summit 2014
Upcoming SlideShare
Loading in...5
×
 

Scaling Apache Storm - Hadoop Summit 2014

on

  • 1,919 views

Slides form my Hadoop Summit presentation on scaling Apache Storm. Presented at Hadoop Summit 2014, San Jose.

Slides form my Hadoop Summit presentation on scaling Apache Storm. Presented at Hadoop Summit 2014, San Jose.

Statistics

Views

Total Views
1,919
Views on SlideShare
1,661
Embed Views
258

Actions

Likes
8
Downloads
77
Comments
0

4 Embeds 258

http://www.scoop.it 150
https://twitter.com 102
http://www.slideee.com 5
http://dschool.co 1

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Welcome. Thank you for sticking around. <br /> Since you’re here to find out about scaling storm, I’m assuming you’re already using storm and know the basics. If not, bear with me. At 3:30 today there will be a birds of a feather session where we can dig deeper, cover some basics, and answer any specific questions you may have.
  • I’ve also been a volunteer firefighter for about the last 10 years.
  • The machines in that cluster didn’t even break a sweat. In fact that was a problem. We couldn’t get them to break a sweat. An in the next release of Storm we’ll have a fix to better exhaust all available resources.
  • Are there any firefighters here today? <br /> Please stand up. Give these guys a round of applause. These are the guys who come save you when you set your house on fire. <br /> <br /> # of alarms <br /> Challenges of fighting large fires bears similarities to dealing with large volumes of streaming data.
  • What do you do?
  • You put the wet stuff on the red stuff.
  • Shout-out to the San Jose Fire Dept. <br /> Lucky for us we didn’t need them at the conference. But if we did, these are the guys that would show up to help.
  • That water has to come from somewhere.
  • Where do you get it? <br /> If you haven’t heard any of these terms this week, you haven’t been listening. <br /> What do all these things have in common?
  • They are all static water sources. The water is at rest. To move it to where you need it you need to do some work. <br /> That means you need a pumper to set it in motion.
  • If you are lucky, you have one of these nearby. The water is under pressure, so there’s less work involved in pushing it where it needs to go. It also doesn’t run out (unless you pump the entire city water supply dry — which is rare). With a small static source like a pool or pond, you could easily run it dry.
  • So you take water from one or more sources, and distribute it one or more outputs.
  • Once you have a water source established, you need to distribute it to a number of outlets.
  • That responsibility falls on this guy. He’s known as a pump operator or engineer. He has to make sure the firefighters at the other end of the lines have enough water flowing at all times. He uses various valves to distribute the flow, and has gauges that give him feedback on the status of the various outputs. <br /> Engineer has a big job. He has to constantly monitor his gauges and adjust for situational dynamics. <br /> When you’re in a burning building, the last thing you want to happen is to lose water pressure. If your hose goes limp, you have no choice but to back out and wait for the water supply to return.
  • Little’s law deals with queueing theory. In simple terms, the system is stable if the output rate can keep up with the input. With firefighting, this means that your pumper has enough volume and velocity to keep water flowing to all the lines in use. If it doesn’t, someone’s hose goes limp — you lose water. And if you’re inside a burning building, that’s one of the last things you want to have happen. <br /> <br /> With Storm this concept is inverted — if your input rate exceeds the output rate the system becomes unstable and you end up with back pressure.
  • Scaling batch vs. real time <br /> -very different constraints in terms of scaling
  • It’s very important to know your latencies since they can add up quickly. <br /> http example <br /> If you make an HTTP request for every tuple, and it takes 100ms for it to complete, how many can you process sequentially in 1 second? 10.
  • Use Storm’s built-in metrics system. Ganglia, Graphite, etc. It doesn’t matter what, just use something. <br /> Don’t hard-code configuration related to scaling, parallelism, etc.
  • Are you dealing garden hose, or a large diameter firehose?
  • Performance test every component in the system. Outside of storm. Only when you find out the limits can you begin to address them and scale around them.
  • Water hammer. <br /> In firefighting, if the pump operator suddenly increases pressure, the guys at the end of the hose are going to lift off the ground and won’t be happy about it.

Scaling Apache Storm - Hadoop Summit 2014 Scaling Apache Storm - Hadoop Summit 2014 Presentation Transcript

  • Scaling Apache Storm P. Taylor Goetz, Hortonworks @ptgoetz
  • About Me Member of Technical Staff / Storm Tech Lead @ Hortonworks Storm Committer / PPMC Member / Release Mgr. @ Apache
  • About Me Member of Technical Staff / Storm Tech Lead @ Hortonworks Storm Committer / PPMC Member / Release Mgr. @ Apache Volunteer Firefighter since 2004
  • 1M+ messages / sec. on a 10-15 node cluster How do you get there?
  • How do you fight fire?
  • Put the wet stuff on the red stuff. Water, and lots of it.
  • When you're dealing with big fire, you need big water.
  • Water Sources Lakes Streams Reservoirs, Pools, Ponds
  • Data Hydrant You heard it here first.
  • How does this relate to Storm?
  • Little’s Law L=λW The long-term average number of customers in a stable system L is equal to the long-term average effective arrival rate, λ, multiplied by the average time a customer spends in the system, W; or expressed algebraically: L = λW. http://en.wikipedia.org/wiki/Little's_law
  • Batch vs. Streaming
  • Batch Processing Typically operates on data at rest Velocity is a function of performance Poor performance costs you time
  • Stream Processing At the mercy of your data source Velocity fluctuates over time Poor performance….
  • Poor performance bursts the pipes. Buffers fill up and eat memory Timeouts / Replays “Sink” systems overwhelmed
  • What can developers do?
  • public class MyBolt extends BaseRichBolt { public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { // initialize task } public void execute(Tuple input) { // process input — QUICKLY! } public void declareOutputFields(OutputFieldsDeclarer declarer) { // declare output } } Keep tuple processing code tight Worry about this!
  • public class MyBolt extends BaseRichBolt { public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { // initialize task } public void execute(Tuple input) { // process input — QUICKLY! } public void declareOutputFields(OutputFieldsDeclarer declarer) { // declare output } } Keep tuple processing code tight Not this.
  • Know your latencies L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms https://gist.github.com/jboner/2841832
  • Use a Cache Guava is your friend.
  • DevOps will appreciate it. Expose your knobs and gauges.
  • What can DevOps do?
  • How big is your hose?
  • Text Find out!
  • Text Performance testing is essential!
  • How to deal with small pipes? (i.e. When your output is more like a garden hose.)
  • Parallelize Slow sinks
  • Parallelism == Manifold Take input from one big pipe and distribute it to many smaller pipes The bigger the size difference, the more parallelism you will need
  • Sizeup Initial assessment
  • Every fire is different.
  • Text
  • Every Storm use case is different.
  • Sizeup — Fire What are my water sources? What GPM can they support? How many lines (hoses) will I need? How much water will I need to flow to put this fire out?
  • Sizeup — Storm What are my input sources? At what rate do they deliver messages? What size are the messages? What's my slowest data sink?
  • There is no magic bullet.
  • But there are good starting points.
  • Numbers Where to start.
  • 1 Worker / Machine / Topology Keep unnecessary network transfer to a minimum
  • 1 Acker / Worker Default in Storm 0.9.x
  • 1 Executor / CPU Core Optimize Thread/CPU usage
  • 1 Executor / CPU Core (for CPU-bound use cases)
  • 1 Executor / CPU Core Multiply by 10x-100x for I/O bound use cases
  • Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160 “Parallelism Units” available
  • Example 10 Worker Nodes 16 Cores / Machine 10 * 16 = 160 “Parallelism Units” available Subtract # Ackers: 160 - 10 = 150 Units.
  • Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150 “Parallelism Units” available
  • Example 10 Worker Nodes 16 Cores / Machine (10 * 16) - 10 = 150 “Parallelism Units” available (* 10-100 if I/O bound) Distrubte this among tasks in topology. Higher for slow tasks, lower for fast tasks.
  • This is just a starting point. Test, test, test. Measure, measure, measure.
  • Internal Messaging Handling backpressure.
  • Internal Messaging (Intra-worker)
  • Turn knobs slowly, one at a time.
  • Don't mess with settings you don't understand.
  • Storm ships with sane defaults Override only as necessary
  • Hardware Considerations
  • Minimum Hardware Requirements
  • CPU Cores More is usually better The more you have the more threads you can support (i.e. parallelism) Storm potentially uses a LOT of threads
  • Memory Highly use-case specific How many workers (JVMs) per node? Are you caching and/or holding in-memory state? Tests/metrics are your friends
  • Network Use bonded NICs if necessary Keep nodes “close”
  • Other performance considerations
  • Don’t “Pancake!” Separate concerns.
  • Keep this guy happy. He has big boots and a shovel. He will hurt you if you piss him off.
  • Shameless Plug http://www.packtpub.com/sto rm-distributed-real-time- computation-blueprints/book
  • Thanks! Questions? Storm BoF Session — 3:30 Room 230A