Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Monte carlo and network cmg'14

When sizing any network capacity, several factors, such as Traffic, Quality of Service (QoS), and Total Cost of Ownership (TCO) are usually taken into account. Generally, it boils down to a joint minimization of cost and maximization of traffic subject to the constraints of protocol and QoS requirements. The stochastic nature of network traffic and the link saturation queueing issues add uncertainty to the already complex optimization problem. In this paper, we examine the sources of traffic demand variability and dive into Monte-Carlo methodology as an efficient way for solving these problems.

  • Login to see the comments

Monte carlo and network cmg'14

  1. 1. Sources of Traffic Demand Variability and Use of Monte Carlo for Network Capacity Planning Performance and Capacity 2014 by CMG November 05, 2014 Alex Gilgur & Brian Eck Views and opinions expressed in this presentation are views and opinions of its authors. If found to be in contradiction with views and policies of Google, Inc., the latter take precedence. Select images are reproduced with permission from Google, Inc.
  2. 2. Moore’s Law in Reverse: Drinking from a firehose? http://www.kpcb.com/internet-trends $
  3. 3. …………... “Matter and energy had ended and with it, space and time... “All collected data had come to a final end. Nothing was left to be collected. “But all collected data had yet to be completely correlated and put together in all possible relationships. “A timeless interval was spent in doing that. “And it came to pass that AC learned how to reverse the direction of entropy. “But there was now no man to whom AC might give the answer of the last question.” Isaac Asimov. “The Last Question”. 1956 What does it cost to own a network? “... ‘THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL ANSWER.’”
  4. 4. What does it cost to own a network? We don’t have the time for all this! Guesstimate!
  5. 5. What does it cost to own a network? Ahah! But how sure are you? It depends on: ● number of servers ● topology ● policies ● traffic patterns ● network protocols
  6. 6. What does a network cost? What is the confidence interval of your “guesstimate” of Total Cost of Ownership of a network? Network Cost Demand Topology Policies Construction Node & Link Reliability The Fishbone Diagram Hardware & Software
  7. 7. Sizing the Network Network Cost Demand Topology Policies Construction Node & Link Reliability Hardware & Software Network SIZE Network Cost Network size is where we bring value
  8. 8. Network SIZE TopologyDemand Node & Link Reliability Demand Fishbone
  9. 9. Demand Fishbone Demand UsageQoS Topology Destination Source Guarantees Latency Flow
  10. 10. Demand Variability ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
  11. 11. Demand Forecastability: Noise & Gaps ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: o “from feast to famine” o Bursts o Congestion Control
  12. 12. Demand Forecastability: Non-Stationarity ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
  13. 13. Demand Variability: Non-stationarity ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
  14. 14. Demand Variability: QoS Variation SC1 SC2 ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
  15. 15. Demand Variability: Other Factors ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution: Bursty Wide Amplitude Complex Patterns Congestion Control
  16. 16. Demand Variability: Signal Distribution ● Noise & Gaps in data ● Non-stationarity & Outliers ● Variation by O & D Nodes o Node A o Node Z ● Variation by QoS o latency o Pr{delivery} ● Variation within QoS o other factors ● Distribution Bursty Wide Amplitude Complex Patterns Congestion Control
  17. 17. Demand Predictability ● Not all forecasting tools were created equal: ○ Non-Gaussian distributions ○ Non-stationarity ○ Congestion Control “All models are wrong. Some models are useful” - G.E.P. Box ● TSA is not the only way to forecast Demand: ○ Explanatory variables: ■ Timestamp is one of them ■ Power ■ CPU ■ Business Metrics Forecast
  18. 18. From Demand to Capacity Demand QoS Topology Capacity
  19. 19. QoS = what’s important to user 1. QoS = 1 / Latency 2. QoS = “Goodput” = Throughput * Pr{delivery} 1. Low Latency 2. High Probability of: a. Delivery b. Accuracy
  20. 20. Find shortest path from Node 1 to Node 2 Routing for Low Latency: SPF: “Travelling Salesman” 4 = Node 4 2 = “Latency of this link = 2 units” Cost = Latency QoS = 1/Cost = 1/Latency
  21. 21. Find shortest path from Node 1 to Node 2 IF Node 4 is down Cost = Latency QoS = 1/Cost = 1/Latency Find shortest path from Node 1 to Node 2 4 = Node 4 2 = “Latency of this link = 2 units” Routing for Low Latency: SPF: “Travelling Salesman”
  22. 22. Find shortest path from Node 1 to Node 2 IF Node 4 is down ... … and Link 3-5 is losing packetsCost = Latency QoS = 1/Cost = 1/Latency Find shortest path from Node 1 to Node 2 4 = Node 4 2 = “Latency of this link = 2 units” Routing for Low Latency: SPF: “Travelling Salesman”
  23. 23. QoS = what’s important to user 1. QoS = 1 / Latency 2. QoS = “Goodput” = Throughput * Pr{delivery} 1. Low Latency 2. High Probability of: a. Delivery b. Accuracy
  24. 24. “Travelling Salesman” Non-linear optimization Routing for “Goodput”: Nonlinear optimization
  25. 25. “Travelling Salesman” Non-linear optimization Routing for “Goodput”: Nonlinear optimization
  26. 26. Non-linear optimization Routing for “Goodput”: Can it be simplified? Assume: ● No Queueing ○ No Blocking Redefine: Can be pseudo-linearized
  27. 27. Routing As a Process SPF
  28. 28. SPF Routing As a Process Draining
  29. 29. SPF Routing As a Process
  30. 30. SPF Routing As a Process Draining
  31. 31. SPF Routing As a Process
  32. 32. SPF Routing As a Process Draining
  33. 33. SPF Routing As a Process
  34. 34. SPF Routing As a Process Draining
  35. 35. SPF Routing As a Process
  36. 36. “Whack-a-Mole!” Routing is updated all the time via: ● Protocol (e.g., TCP) ● SDN Control We need to accommodate each Flow’s: ● Primary Paths ● Alternative Paths
  37. 37. Network Demand & Throughput Link Throughput Demand Topology Node & Link Reliability Link Size
  38. 38. Demandi Throughputj Connex Traversal Time (Latency) Concurrencyj Capacity From Demand to Capacity:
  39. 39. Demandi Throughputj Link Traversal Time (Latency) Concurrencyj Erl-1 (N, PB) Capacity QoS PB To account for Queueing & StatMux, …
  40. 40. Demand Throughput Concurrency for Flowi Connex Traversal Time (Latency) Capacity For Long-Haul Networks, it reduced to… LPropagation >> LQueueing Erl-1 (N, PB) QoS PB
  41. 41. Demand Throughput Capacity Bandwidth Fill Factor For Long-Haul Network, it reduced to… Can’t forget the stochastic element LPropagation >> LQueueing Latency ~ const Concurrency = const * Throughput
  42. 42. We can forecast demand Demand: ● A1 -> Z1 : X11 Gbps ● A1 -> Z2 : X12 Gbps ● A2 -> Z3 : X23 Gbps Throughput on each Link Capacity for each Link
  43. 43. We can forecast demand Demand: ● A1 -> Z1 : X11 Gbps ● A1 -> Z2 : X12 Gbps ● A2 -> Z3 : X23 Gbps Throughput on each Link Capacity for each Link Throughput is combinatorial
  44. 44. Demand is NOT Deterministic Demand: ● A1 -> Z1 : X11 Gbps ● A1 -> Z2 : X12 Gbps ● A2 -> Z3 : X23 Gbps Throughput on each Link Neither is Throughput
  45. 45. Throughput: L12 = ? L24 = ? L43 = ? L31 = ? L141 = ? Demand: N1_N4: 100 Gbps N2_N4: 200 Gbps 100 G 100 G 200 G 100 G 200 G 200 G Throughput: L12 = 100 G L21 = 200 G L24 = 300 G L14 = 300 G L41 = 0 L43 = 0 L31 = 0 N1 N2 N3 N4 L31 L43 L24 L12 L141 5 315 25 22 From Deterministic Demand to Throughput
  46. 46. From Gaussian Demand to Throughput: Throughput: L12 = ? L24 = ? L43 = ? L31 = ? L141 = ? Demand: N1_N4: N (100, 10) Gbps N2_N4: N (200, 15) Gbps Throughput: L12 = N (100, 10) G L21 = N (200, 15) G L24 = N (300, 18) G L14 = N (300, 18) G L41 = 0 L43 = 0 L31 = 0 N1 N2 N3 N4 L31 L43 L24 L12 L141 5 315 25 22
  47. 47. Throughput: L12 = ? L24 = ? L43 = ? L31 = ? L141 = ? Demand: N1_N4: G (100, ...) Gbps N2_N4: G (200, ...) Gbps N1 N2 N3 N4 L31 L43 L24 L12 L141 5 315 25 22 ? From Generic Random Demand to Throughput:
  48. 48. Monte-Carlo
  49. 49. Monte-Carlo
  50. 50. Monte-Carlo
  51. 51. Every Demand VALUE is a REALIZATION of a RANGE of possible values Demand Forecast Replace point estimates with probability distributions
  52. 52. Link Throughput: Monte-Carlo Forecasting Replace point estimates with probability distributions Slice the timeline For each timestamp: For each Flow: roll the dice N times For each timestamp: For each of the N dice rolls: Throughput = sum (Flows)
  53. 53. Monte Carlo works with any Transfer Function Monte Carlo Throughput on each Link Demand (A-Z) Capacity for each Link
  54. 54. Use Case (a case study) ● Hundreds of links ● Thousands of demand flows forecasted o 95th percentile o Unspecified Prediction Intervals ● Establish optimal Inventory Size & Policies o Account for Demand Predictability ● Estimate demand variability effect on: o Network Size o TCO Forecast
  55. 55. Approach Quantify Demand Distributions (use Biases) Use Monte-Carlo to forecast Throughput Distributions Use Monte-Carlo to compute Capacity Predictive Intervals Use Monte-Carlo to optimize Inventory Size & Policies Biases = Forecast - Observed Biases != Residuals
  56. 56. Quantify Demand Ranges & Prepare MC “Forecasts” Start For Each Time Slice For Each Flow Compute: Bias = Projected - Observed Build: Bias Distribution Roll the dice N = 100 times Apply the rolled-out numbers to the baseline forecast for each flow Save the N Demand scenarios
  57. 57. Run the Pseudo-Random Demands through MC Map1 Map2 MapN MapN-1 Reduce F flows * N forecasts Map: Compute Capacities (N) Reduce: Analyze the N Capacity Forecasts L links: Capacity Prediction Intervals Capacity Forecasts for each Link
  58. 58. What does it cost to own a network?
  59. 59. ● Range forecasting is cool! ● Network Demand varies in many ways ● For WAN, it is OK to use throughput o still it’s better to use concurrency ● Demand ≠ Throughput o Demand -> Throughput -> Capacity ● Monte-Carlo is a model o Therefore it is wrong o But it is useful In Conclusion
  60. 60. Acknowledgements ● Google’s NetOps Division ● Google’s NetCap & ODS Teams ● Josep Ferrandiz ● Mike Perka ● Leonid Kats ● C. Steven Gunn ● Matthew Mathis ● Kevin J. Mitchell ● Linda Eck ● Sophia Shtilman ● Leora Gilgur
  61. 61. agilgur@google.com brianeck@google.com THANK YOU!!!
  62. 62. Backup Slides
  63. 63. Biases != Residuals. Why? How good are forecasts at predicting demand N days from “now” ???
  64. 64. H/W Availability: Fault Trees Reliability Function: Failure is a memoryless (Poisson) process F(C|t) = F ((1 OR 2)|t) = 1- (R(1|t) * R(2|t)) F(D|t) = F ((3 AND 4 AND 5)|t) = F(3|t) * F(4|t) * F(5|t) F(E|t) = F ((7 AND 8) | t) = F (7|t) * F(8|t) F(F|t) = F ((6 OR E) | t) = 1 - (1 - F(7|t) * F(8|t)) * R(6|t) F(B|t) = F ((C OR D OR F)|t) = 1 - R(1|t) * R(2|t) * (1-F(3|t) * F(4|t) * F(5|t)) * (1-F(7|t) * F(8|t)) * R(6|t) ⇒ R(A|t) = R(1|t) * R(2|t) * (1-F(3|t) * F(4|t) * F(5|t)) * (1-F(7|t) * F(8|t)) * R(6|t) C D F E B There’s got to be a cleaner way!
  65. 65. Fault Trees and Monte-Carlo C D F E B clock.start() for each component: component.update (time = clock) clock.set (min (next_update_time)) Component state = (run, fail) rule = (AND, OR, NONE) mtbf mttr next_update_time elements: Component fail() run() update(time) run(): if rule == NONE: state = run; else: //apply rule to elements return; fail(): if rule == NONE: state = fail; else: //apply rule to elements return; update (time): if time ≥ next_update_time: if state == fail: run(); next_update_time +=Exp(mtbf); else: fail(); next_update_time +=Exp(mttr); return;
  66. 66. Probability distributions Simplest - Uniform: Least relevant to anything real Convenient building block for any distribution Most standard - Gaussian: Mathematically the simplest Does not describe the IT world Most Relevant - Poisson & Exponential): Relatively simple mathematically Accurately describes times between arrivals and service times for a memoryless process. F(x) = Pr (X ≤ x) - CDF f (x) = F’(x) - PDF

×