Mathematics of Batch Processing

10,438 views

Published on

Slides for a presentation I gave at Cloud Connect on March 18th, 2010. Based upon an article I wrote http://nathanmarz.com/blog/hadoop-mathematics/

Published in: Technology, Business

Mathematics of Batch Processing

  1. 1. The Mathematics of Batch Processing Nathan Marz BackType
  2. 2. Motivation A story about Timmy the software engineer
  3. 3. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  4. 4. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  5. 5. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  6. 6. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  7. 7. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  8. 8. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  9. 9. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  10. 10. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  11. 11. Timmy @ Big Data Inc. New Data Processed Output Hadoop Workflow
  12. 12. Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data
  13. 13. Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data • Current turnaround is 10 hours
  14. 14. Timmy @ Big Data Inc. • Business requirement: 15 hour turnaround on processing new data • Current turnaround is 10 hours • Plenty of extra capacity!
  15. 15. Timmy @ Big Data Inc. • Company increases data collection rate by 10%
  16. 16. Timmy @ Big Data Inc. • Company increases data collection rate by 10% Surprise! Turnaround time explodes to 30 hours!
  17. 17. Timmy @ Big Data Inc. Fix it ASAP! We’re losing customers!
  18. 18. Timmy @ Big Data Inc. We need 2 times more machines!
  19. 19. Timmy @ Big Data Inc. We don’t even have that much space in the datacenter!
  20. 20. Timmy @ Big Data Inc. Rack 1 Rack 2 Data Center
  21. 21. Timmy @ Big Data Inc. Rack 1 Rack 2 New Rack Data Center
  22. 22. Timmy @ Big Data Inc. • Turnaround drops to 6 hours!! ?? ??
  23. 23. False Assumptions • Will take 10% longer to process 10% more data • 50% more machines only creates 50% more performance
  24. 24. What is a batch processing system? while (true) { processNewData() }
  25. 25. “Hours of Data” • Assume constant rate of new data • Measure amount of data in terms of hours
  26. 26. Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%?
  27. 27. Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%? • Why doesn’t the speed of my workflow double when I double the number of machines?
  28. 28. Questions to answer • How does a 10% increase in data cause my turnaround time to increase by 200%? • Why doesn’t the speed of my workflow double when I double the number of machines? • How many machines do I need for my workflow to perform well and be fault- tolerant?
  29. 29. Example • Workflow that runs in 10 hours • 10 hours of data processed each run
  30. 30. Example
  31. 31. Example • Suppose you extend workflow with a component that will take 2 hours on 10 hour dataset
  32. 32. Example • Suppose you extend workflow with a component that will take 2 hours on 10 hour dataset • Workflow runtime may increase by a lot more than 2 hours!
  33. 33. Example
  34. 34. Example • Will it increase by 3 hours?
  35. 35. Example • Will it increase by 3 hours? • Will it increase by 50 hours?
  36. 36. Example • Will it increase by 3 hours? • Will it increase by 50 hours? • Will it get longer and longer each iteration forever?
  37. 37. Example
  38. 38. Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours
  39. 39. Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours • Next run, there will be 12 hours of data to process
  40. 40. Example • Increased runtime of workflow that operates on 10 hours of data to 12 hours • Next run, there will be 12 hours of data to process • Because more data, will take longer to run
  41. 41. Example • Which means next iteration will have even more data
  42. 42. Example • Which means next iteration will have even more data • And so on...
  43. 43. Example • Which means next iteration will have even more data • And so on... Does the runtime ever stabilize?
  44. 44. Example • Which means next iteration will have even more data • And so on... Does the runtime ever stabilize? If so, when?
  45. 45. Math Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) Runtime for a single run of a workflow
  46. 46. Math Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P Runtime for a single run of a workflow
  47. 47. Overhead (O) • Fixed time in workflow – Job startup time – Time spent independent of amount of data Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  48. 48. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  49. 49. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  50. 50. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime • P=2 -> Each hour adds two hours to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  51. 51. Time to Process One Hour of Data (P) • How long it takes to process one hour of data, minus overhead • P=1 -> Each hour adds one hour to runtime • P=2 -> Each hour adds two hours to runtime • P = 0.5 -> Each hour adds 30 minutes to runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T = O + H x P
  52. 52. Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP
  53. 53. Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP Stabilizes when: Runtime (T) = Hours of data processed (H)
  54. 54. Stable Runtime Runtime = Overhead + (Hours of Data) x (Time to process one hour of data) T=O+HxP Stabilizes when: Runtime (T) = Hours of data processed (H) T=O+TxP
  55. 55. Stable Runtime T=O+TxP
  56. 56. Stable Runtime T=O+TxP O T= 1-P
  57. 57. Stable Runtime T=O+TxP O T= 1-P Overhead Stable Runtime = 1 - (Time to process one hour of data)
  58. 58. Stable Runtime • Linearly proportional to Overhead (O) • Non-linearly proportional to P – Diminishing returns on each new machine
  59. 59. Double # machines • Why doesn’t the speed of my workflow double when I double the number of machines?
  60. 60. Double # machines O Old runtime = 1-P
  61. 61. Double # machines O Old runtime = 1-P O New runtime = 1 - P/2
  62. 62. Double # machines O Old runtime = 1-P O New runtime = 1 - P/2 New runtime 1-P = Old runtime 1 - P/2
  63. 63. Double # machines New runtime Old runtime Time to process one hour of data (P)
  64. 64. Double # machines • P = 0.9 (54 minutes / hour of data) -> Runtime decreases by 80%
  65. 65. Double # machines • P = 0.9 (54 minutes / hour of data) -> Runtime decreases by 80% • P = 0.2 (12 minutes / hour of data) -> Runtime decreases by 10%
  66. 66. Increase in Data • Why does a 10% increase in data cause my turnaround to increase by 200%?
  67. 67. Increase in Data O Old runtime = 1-P
  68. 68. Increase in Data O Old runtime = 1-P O New runtime = 1 - 1.1*P
  69. 69. Increase in Data O Old runtime = 1-P O New runtime = 1 - 1.1*P New runtime 1-P = Old runtime 1 - 1.1*P
  70. 70. Increase in Data New runtime Old runtime Time to process one hour of data (P)
  71. 71. Increase in Data • Less “extra capacity” -> more dramatic deterioration in performance
  72. 72. Increase in Data • Less “extra capacity” -> more dramatic deterioration in performance • Effect can also happen: • Increase in hardware/software failures • Sharing cluster
  73. 73. Real life example • How does optimizing out 30% of my workflow runtime cause the runtime to decrease by 80%?
  74. 74. Real life example • 30 hour workflow
  75. 75. Real life example • 30 hour workflow • Remove bottleneck causing 10 hours of overhead
  76. 76. Real life example • 30 hour workflow • Remove bottleneck causing 10 hours of overhead • Runtime dropped to 6 hours
  77. 77. Real life example O 30 = 1-P O - 10 6= 1-P O = 12.5, P = 0.58
  78. 78. Takeaways
  79. 79. Takeaways • You should measure the O and P values of your workflow to avoid disasters
  80. 80. Takeaways • You should measure the O and P values of your workflow to avoid disasters • When P is high: – Expand cluster – OR: Optimize code that touches data
  81. 81. Takeaways • You should measure the O and P values of your workflow to avoid disasters • When P is high: – Expand cluster – OR: Optimize code that touches data • When P is low: – Optimize overhead (i.e., reduce job startup time)
  82. 82. Questions? Nathan Marz BackType nathan.marz@gmail.com Twitter: @nathanmarz http://nathanmarz.com/blog

×