Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Mathematics of Batch Processing

10,535 views

Published on

Slides for a presentation I gave at Cloud Connect on March 18th, 2010. Based upon an article I wrote http://nathanmarz.com/blog/hadoop-mathematics/

Published in: Technology, Business

Mathematics of Batch Processing

  1. 1. The Mathematics of Batch Processing Nathan Marz BackType
  2. 2. Motivation A story about Timmy the software engineer
  3. 3. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  4. 4. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  5. 5. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  6. 6. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  7. 7. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  8. 8. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  9. 9. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  10. 10. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  11. 11. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  12. 12. Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data
  13. 13. Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data • Current turnaround is 10 hours
  14. 14. Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data • Current turnaround is 10 hours • Plenty of extra capacity!
  15. 15. Timmy @ Big Data Inc. • Company increases data collection rate by 10%
  16. 16. Timmy @ Big Data Inc. • Company increases data collection rate by 10% Surprise! Turnaround time explodes to 30 hours!
  17. 17. Timmy @ Big Data Inc. Fix it ASAP! We’re losing customers!
  18. 18. Timmy @ Big Data Inc. We need 2 times more machines!
  19. 19. Timmy @ Big Data Inc. We don’t even have that much space in the datacenter!
  20. 20. Timmy @ Big Data Inc. Rack 1 Rack 2 Data Center
  21. 21. Timmy @ Big Data Inc. Rack 1 Rack 2 New Rack Data Center
  22. 22. Timmy @ Big Data Inc. • Turnaround drops to 6 hours!! ?? ??
  23. 23. False Assumptions • Will take 10% longer to process 10% more data • 50% more machines only creates 50% more performance
  24. 24. What is a batch processing system? while (true) { processNewData() }
  25. 25. “Hours of Data” • Assume constant rate of new data • Measure amount of data in terms of hours
  26. 26. Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%?
  27. 27. Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%? • Why doesn’t the speed of my workflow double when I double the number of machines?
  28. 28. Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%? • Why doesn’t the speed of my workflow double when I double the number of machines? • How many machines do I need for my workflow to perform well and be fault- tolerant?
  29. 29. Example • Workflow that runs in 10 hours • 10 hours of data processed each run
  30. 30. Example
  31. 31. Example • Suppose you extend workflow with a component that will take 2 hours on 10 hour dataset
  32. 32. Example • Suppose you extend workflow with a component that will take 2 hours on 10 hour dataset • Workflow runtime may increase by a lot more than 2 hours!
  33. 33. Example
  34. 34. Example • Will it increase by 3 hours?
  35. 35. Example • Will it increase by 3 hours? • Will it increase by 50 hours?
  36. 36. Example • Will it increase by 3 hours? • Will it increase by 50 hours? • Will it get longer and longer each iteration forever?
  37. 37. Example
  38. 38. Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours
  39. 39. Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours • Next run, there will be 12 hours of data to process
  40. 40. Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours • Next run, there will be 12 hours of data to process • Because more data, will take longer to run
  41. 41. Example • Which means next iteration will have even more data
  42. 42. Example • Which means next iteration will have even more data • And so on...
  43. 43. Example • Which means next iteration will have even more data • And so on... Does the runtime ever stabilize?
  44. 44. Example • Which means next iteration will have even more data • And so on... Does the runtime ever stabilize? If so, when?
  45. 45. Math Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) Runtime for a single run of a workflow
  46. 46. Math Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P Runtime for a single run of a workflow
  47. 47. Overhead (O) • Fixed time in workflow – Job startup time – Time spent independent of amount of data Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  48. 48. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  49. 49. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  50. 50. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime • P=2 -> Each hour adds two hours to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  51. 51. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime • P=2 -> Each hour adds two hours to runtime • P = 0.5 -> Each hour adds 30 minutes to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  52. 52. Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP
  53. 53. Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP Stabilizes when: Runtime (T) = Hours of data processed (H)
  54. 54. Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP Stabilizes when: Runtime (T) = Hours of data processed (H) T=O+TxP
  55. 55. Stable Runtime T=O+TxP
  56. 56. Stable Runtime T=O+TxP O T= 1-P
  57. 57. Stable Runtime T=O+TxP O T= 1-P Overhead Stable Runtime = 1 - (Time to process one hour of data)
  58. 58. Stable Runtime • Linearly proportional to Overhead (O) • Non-linearly proportional to P – Diminishing returns on each new machine
  59. 59. Double # machines • Why doesn’t the speed of my workflow double when I double the number of machines?
  60. 60. Double # machines O Old runtime = 1-P
  61. 61. Double # machines O Old runtime = 1-P O New runtime = 1 - P/2
  62. 62. Double # machines O Old runtime = 1-P O New runtime = 1 - P/2 New runtime 1-P = Old runtime 1 - P/2
  63. 63. Double # machines New runtime Old runtime Time to process one hour of data (P)
  64. 64. Double # machines • P = 0.9 (54 minutes / hour of data) -> Runtime decreases by 80%
  65. 65. Double # machines • P = 0.9 (54 minutes / hour of data) -> Runtime decreases by 80% • P = 0.2 (12 minutes / hour of data) -> Runtime decreases by 10%
  66. 66. Increase in Data • Why does a 10% increase in data cause my turnaround to increase by 200%?
  67. 67. Increase in Data O Old runtime = 1-P
  68. 68. Increase in Data O Old runtime = 1-P O New runtime = 1 - 1.1*P
  69. 69. Increase in Data O Old runtime = 1-P O New runtime = 1 - 1.1*P New runtime 1-P = Old runtime 1 - 1.1*P
  70. 70. Increase in Data New runtime Old runtime Time to process one hour of data (P)
  71. 71. Increase in Data • Less “extra capacity” -> more dramatic deterioration in performance
  72. 72. Increase in Data • Less “extra capacity” -> more dramatic deterioration in performance • Effect can also happen: • Increase in hardware/software failures • Sharing cluster
  73. 73. Real life example • How does optimizing out 30% of my workflow runtime cause the runtime to decrease by 80%?
  74. 74. Real life example • 30 hour workflow
  75. 75. Real life example • 30 hour workflow • Remove bottleneck causing 10 hours of overhead
  76. 76. Real life example • 30 hour workflow • Remove bottleneck causing 10 hours of overhead • Runtime dropped to 6 hours
  77. 77. Real life example O 30 = 1-P O - 10 6= 1-P O = 12.5, P = 0.58
  78. 78. Takeaways
  79. 79. Takeaways • You should measure the O and P values of your workflow to avoid disasters
  80. 80. Takeaways • You should measure the O and P values of your workflow to avoid disasters • When P is high: – Expand cluster – OR: Optimize code that touches data
  81. 81. Takeaways • You should measure the O and P values of your workflow to avoid disasters • When P is high: – Expand cluster – OR: Optimize code that touches data • When P is low: – Optimize overhead (i.e., reduce job startup time)
  82. 82. Questions? Nathan Marz BackType nathan.marz@gmail.com Twitter: @nathanmarz http://nathanmarz.com/blog

×