• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Mathematics of Batch Processing
 

Mathematics of Batch Processing

on

  • 10,358 views

Slides for a presentation I gave at Cloud Connect on March 18th, 2010. Based upon an article I wrote http://nathanmarz.com/blog/hadoop-mathematics/

Slides for a presentation I gave at Cloud Connect on March 18th, 2010. Based upon an article I wrote http://nathanmarz.com/blog/hadoop-mathematics/

Statistics

Views

Total Views
10,358
Views on SlideShare
4,639
Embed Views
5,719

Actions

Likes
17
Downloads
202
Comments
0

12 Embeds 5,719

http://tech.backtype.com 5499
http://nosql.mypopescu.com 135
http://localhost 51
http://cv.upt.ro 13
http://web.archive.org 7
http://www.slideshare.net 4
http://translate.googleusercontent.com 3
http://webcache.googleusercontent.com 3
https://server16.kproxy.com 1
https://twitter.com 1
http://feeds.feedburner.com 1
http://techspottr.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />
  • <br />

Mathematics of Batch Processing Mathematics of Batch Processing Presentation Transcript

  • The Mathematics of Batch Processing Nathan Marz BackType
  • Motivation A story about Timmy the software engineer
  • Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  • Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data
  • Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data • Current turnaround is 10 hours
  • Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data • Current turnaround is 10 hours • Plenty of extra capacity!
  • Timmy @ Big Data Inc. • Company increases data collection rate by 10%
  • Timmy @ Big Data Inc. • Company increases data collection rate by 10% Surprise! Turnaround time explodes to 30 hours!
  • Timmy @ Big Data Inc. Fix it ASAP! We’re losing customers!
  • Timmy @ Big Data Inc. We need 2 times more machines!
  • Timmy @ Big Data Inc. We don’t even have that much space in the datacenter!
  • Timmy @ Big Data Inc. Rack 1 Rack 2 Data Center
  • Timmy @ Big Data Inc. Rack 1 Rack 2 New Rack Data Center
  • Timmy @ Big Data Inc. • Turnaround drops to 6 hours!! ?? ??
  • False Assumptions • Will take 10% longer to process 10% more data • 50% more machines only creates 50% more performance
  • What is a batch processing system? while (true) { processNewData() }
  • “Hours of Data” • Assume constant rate of new data • Measure amount of data in terms of hours
  • Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%?
  • Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%? • Why doesn’t the speed of my workflow double when I double the number of machines?
  • Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%? • Why doesn’t the speed of my workflow double when I double the number of machines? • How many machines do I need for my workflow to perform well and be fault- tolerant?
  • Example • Workflow that runs in 10 hours • 10 hours of data processed each run
  • Example
  • Example • Suppose you extend workflow with a component that will take 2 hours on 10 hour dataset
  • Example • Suppose you extend workflow with a component that will take 2 hours on 10 hour dataset • Workflow runtime may increase by a lot more than 2 hours!
  • Example
  • Example • Will it increase by 3 hours?
  • Example • Will it increase by 3 hours? • Will it increase by 50 hours?
  • Example • Will it increase by 3 hours? • Will it increase by 50 hours? • Will it get longer and longer each iteration forever?
  • Example
  • Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours
  • Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours • Next run, there will be 12 hours of data to process
  • Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours • Next run, there will be 12 hours of data to process • Because more data, will take longer to run
  • Example • Which means next iteration will have even more data
  • Example • Which means next iteration will have even more data • And so on...
  • Example • Which means next iteration will have even more data • And so on... Does the runtime ever stabilize?
  • Example • Which means next iteration will have even more data • And so on... Does the runtime ever stabilize? If so, when?
  • Math Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) Runtime for a single run of a workflow
  • Math Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P Runtime for a single run of a workflow
  • Overhead (O) • Fixed time in workflow – Job startup time – Time spent independent of amount of data Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime • P=2 -> Each hour adds two hours to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime • P=2 -> Each hour adds two hours to runtime • P = 0.5 -> Each hour adds 30 minutes to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  • Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP
  • Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP Stabilizes when: Runtime (T) = Hours of data processed (H)
  • Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP Stabilizes when: Runtime (T) = Hours of data processed (H) T=O+TxP
  • Stable Runtime T=O+TxP
  • Stable Runtime T=O+TxP O T= 1-P
  • Stable Runtime T=O+TxP O T= 1-P Overhead Stable Runtime = 1 - (Time to process one hour of data)
  • Stable Runtime • Linearly proportional to Overhead (O) • Non-linearly proportional to P – Diminishing returns on each new machine
  • Double # machines • Why doesn’t the speed of my workflow double when I double the number of machines?
  • Double # machines O Old runtime = 1-P
  • Double # machines O Old runtime = 1-P O New runtime = 1 - P/2
  • Double # machines O Old runtime = 1-P O New runtime = 1 - P/2 New runtime 1-P = Old runtime 1 - P/2
  • Double # machines New runtime Old runtime Time to process one hour of data (P)
  • Double # machines • P = 0.9 (54 minutes / hour of data) -> Runtime decreases by 80%
  • Double # machines • P = 0.9 (54 minutes / hour of data) -> Runtime decreases by 80% • P = 0.2 (12 minutes / hour of data) -> Runtime decreases by 10%
  • Increase in Data • Why does a 10% increase in data cause my turnaround to increase by 200%?
  • Increase in Data O Old runtime = 1-P
  • Increase in Data O Old runtime = 1-P O New runtime = 1 - 1.1*P
  • Increase in Data O Old runtime = 1-P O New runtime = 1 - 1.1*P New runtime 1-P = Old runtime 1 - 1.1*P
  • Increase in Data New runtime Old runtime Time to process one hour of data (P)
  • Increase in Data • Less “extra capacity” -> more dramatic deterioration in performance
  • Increase in Data • Less “extra capacity” -> more dramatic deterioration in performance • Effect can also happen: • Increase in hardware/software failures • Sharing cluster
  • Real life example • How does optimizing out 30% of my workflow runtime cause the runtime to decrease by 80%?
  • Real life example • 30 hour workflow
  • Real life example • 30 hour workflow • Remove bottleneck causing 10 hours of overhead
  • Real life example • 30 hour workflow • Remove bottleneck causing 10 hours of overhead • Runtime dropped to 6 hours
  • Real life example O 30 = 1-P O - 10 6= 1-P O = 12.5, P = 0.58
  • Takeaways
  • Takeaways • You should measure the O and P values of your workflow to avoid disasters
  • Takeaways • You should measure the O and P values of your workflow to avoid disasters • When P is high: – Expand cluster – OR: Optimize code that touches data
  • Takeaways • You should measure the O and P values of your workflow to avoid disasters • When P is high: – Expand cluster – OR: Optimize code that touches data • When P is low: – Optimize overhead (i.e., reduce job startup time)
  • Questions? Nathan Marz BackType nathan.marz@gmail.com Twitter: @nathanmarz http://nathanmarz.com/blog