Isolating Events from the Fail Whale


Published on

QCon NYC 2013

Published in: Technology
  • Be the first to comment

Isolating Events from the Fail Whale

  1. 1. @Twitter | QCon NY 2013 1Isolating Events from the Fail WhaleArun Kejariwal, Bryce Yan(@arun_kejariwal, @bryce_yan)Capacity Engineering @ TwitterJune 2013
  2. 2. @Twitter | QCon NY 2013 2Delivering Best User Experience•  Performance  Real time!  Latency tolerance of end-users has nose dived  Average, p99, p999  Variability on large clusters  Tolerate faults when using commodity hardware•  Availability  Anytime, Anywhere, Any Device•  Organic Growth  Over 200M monthly active users•  Events  Planned, Unplanned[3][2][1] Xu et al. NSDI 2013 -[2][3][1]
  3. 3. @Twitter | QCon NY 2013 3High Performance, Availability•  Capacity Planning  Throw hardware at the problem  Operationally inefficient  Even otherwiseo  How much?o  What kind? (Inventory management etc.)  Reactive approach  Degraded user experienceo  Impact bottomline  Overall goal  Deliver best user experience  Minimal operational footprint o  Factor in organic growth and lead times for provisioning additional capacity
  4. 4. @Twitter | QCon NY 2013 4Capacity Planning is Non-trivial•  Behavioral response is unpredictable•  Multiplier Effect  # Retweets x Followers of each retweeterLarge fan-out
  5. 5. @Twitter | QCon NY 2013 5Capacity Planning is Non-trivial (cont’d)•  Unforeseen events  Power failure  “Hurricane Sandy takes data centers offline with flooding, power outages”  Network issues  “Amazons compute cloud has a networking hiccup”•  Evolving product development landscape  New features  New products  New partners  “Twitter Arrives on Wall Street, Via Bloomberg”[1][2][4][3] Ballani et al. NSDI 2013 -[1][2] [3][4]14 June 2013
  6. 6. @Twitter | QCon NY 2013 6Capacity Planning is Non-trivial (cont’d)•  New hardware platforms  Purchase pipeline  How much and when to buy – Cost performance trade-off
  7. 7. @Twitter | QCon NY 2013 7Events•  Planned  Still, traffic pattern subject to, say,   Nature of the event   Behavioral response  Community effect  Demographics
  8. 8. @Twitter | QCon NY 2013 8Events (cont’d)•  Unplanned  Intensity of the event  Population densityJapan Tsunami New Zealand Earthquake Hurricane SandyFlash CrashEgyptian RevolutionIran’s Disputed Election Boston ExplosionRemembering Steve Jobs
  9. 9. @Twitter | QCon NY 2013 9Events (cont’d)•  Unplanned (transient)  Duration   Type of the transient eventWhite House Rumor: AP account being hacked[1][1]
  10. 10. @Twitter | QCon NY 2013 10Events (cont’d)•  Black Swans (ala Nassim Taleb)  Planned events, but…Superbowl’13 Blackout Zidane in “Action” “Hand of God”Usain Bolt’s 100m World Record
  11. 11. @Twitter | QCon NY 2013 11Events (cont’d)•  Events timelineTime
  12. 12. @Twitter | QCon NY 2013 12Events’ Impact•  Differ in characteristics  Tweets  Photos  Vines  Now, Music•  Consequently, tax different services  Different capacity requests
  13. 13. @Twitter | QCon NY 2013 13Capacity Modeling Overview
  14. 14. @Twitter | QCon NY 2013 14Capacity Modeling•  Takes core drivers as inputs to generate usage demand  Forecasts the amount of work based on core driver projections•  Relates the work metric to a primary resource to identify the capacitythreshold  Primary resources  Computing power (CPU, RAM)  Storage (disk I/O, disk space)  Network (network bandwidth)•  Generate hardware demand based on the limiting primary resource
  15. 15. @Twitter | QCon NY 2013 15Core Drivers•  Underlying business metrics that drive demand for more capacity  Active Users  Tweets per second (TPS)  Favorites per second (FPS)  Requests per second (RPS)•  Normalized by Active Users to isolate user engagement•  Project user engagement and Active Users independently
  16. 16. @Twitter | QCon NY 2013 16Active Users aka User Growth Normalized Core Drivers for EngagementCore Drivers (cont’d)PerActiveUserValuesTimeFavoritesRetweetsPoly. (Favorites)Linear (Retweets)ActiveUserCountTimeActiveUsersLinear (ActiveUsers)
  17. 17. @Twitter | QCon NY 2013 17Core Drivers (cont’d)TimeUser Growth: Active UsersActiveUsersLinear (ActiveUsers)TimeEngagement: Photos/Active UserPhotosLinear (Photos)TimeCore Driver: Photos per DayPhotosPhotosForecast
  18. 18. @Twitter | QCon NY 2013 18Capacity Threshold•  Primary resource scalability threshold  Determined by load testing  Synthetic load  Replaying production traffic  Real-time production traffic  Test systems may be  Isolated replicas of production  Staging systems in production  Production systems0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00ServiceResponseTimeCPUAverage Response Times vs CPUX
  19. 19. @Twitter | QCon NY 2013 19Hardware Demand•  Core driver  capacity threshold  scaling formula  server count•  Example  Core driver: Requests per Second  Per server request throughput determined by capacity threshold  Scaling formula for Sizing  Number of Servers = (RPS) / Per Server ThresholdCoreDriver(RPS)/ServerCountTimeRPS (Actuals) RPS (Forecast) # Servers (Actuals) # Servers (Forecast)
  20. 20. @Twitter | QCon NY 2013 20Deep Dive and Superbowl 2013
  21. 21. @Twitter | QCon NY 2013 21Events: High Level Methodology•  Goal  Handle traffic “spike”•  Predict expected traffic based on historical and temporal statistical analysis  Statistical Metrics  Average  Standard deviation  Max•  Limitations  Changing usage patterns  Organic growth, behavioral, cultural   Event driven  How a game would turn out?
  22. 22. @Twitter | QCon NY 2013 22Statistical Time Series Analysis•  Time window  Week over Week (WoW)  Month over Month (MoM)  Year over Year (YoY)•  Data Distribution  Normal, Log Normal, Multi-modal  Has implications on model selection•  Forecasting  Regression model  Linear, Spline  ARIMA  Trending, Seasonal, Residuals
  23. 23. @Twitter | QCon NY 2013 23Superbowl 2013: Capacity Planning•  Assess capacity requirement based 2011, 2012 Superbowl traffic patterns•  Core driver selection  RPS (Reads)  TPS (Writes)•  What time granularity to use?  Avg TPS (Tweets per sec)  1s/10s/15s/30s Max TPS  1 min/5 min/10 min Max TPS  1 hr Max TPS
  24. 24. @Twitter | QCon NY 2013 24Superbowl 2013: Capacity Planning (cont’d)•  Which metric to use?TimeHighly correlated
  25. 25. @Twitter | QCon NY 2013 25Superbowl 2013: Capacity Planning (cont’d)•  Which metric to use?  Time sensitive – correlation may change YoYTimeHighly correlated
  26. 26. @Twitter | QCon NY 2013 26Superbowl 2013: Capacity Planning (cont’d)•  Approaches  TPSSuperbowl (denote by Tn)  d-Day historical window  TPSn-1, TPSn-2, …, TPSn-d  Ratio Analysis  Rn = Tn/Max(TPSn-1, TPSn-2, …, TPSn-d)  Distribution Analysis  αn = (Tn - AVG(TPSn-1, TPSn-2, …, TPSn-d))/STDEV(TPSn-1, TPSn-2, …, TPSn-d)
  27. 27. @Twitter | QCon NY 2013 27Superbowl 2013: Capacity Planning (cont’d)•  Ratio Analysis (Rn)  1s Max TPS14 Day 28 day 45 Day2011 0.791 0.791 1.0072012 1.062 0.858 0.580
  28. 28. @Twitter | QCon NY 2013 28μSuperbowl 2013: Capacity Planning (cont’d)•  Distribution Analysis (αn)  AVG (μ), STDEV(σ)   μ increased YoY (expected)  σ also increased YoY  1s Max TPSTn /μ (Tn – μ)/σ2011 1.448 1.7462012 1.517 2.756TPS during Superbowl has beenmoving right YoY2011 2012
  29. 29. @Twitter | QCon NY 2013 29Superbowl 2013: Capacity Planning (cont’d)•  Distribution Analysis  YoY movement of TPSSuperbowl further into the right tail  Expectation: Progressive moves would be smaller  Overestimate α  Handle unplanned events  Business decision
  30. 30. @Twitter | QCon NY 2013 30Superbowl 2013: Capacity Planning (cont’d)•  Historical component  Determine extent of movement (αexpected) of TPSSuperbowl into right tail•  Temporal component  Current μc   Current σc•  Capacity planning  Plan capacity corresponding to μc + αexpected * σc  Scenario Analysis (ala Global Macro Hedge Funds)  αexpected o  αn-1 (same as last year)o  αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)
  31. 31. @Twitter | QCon NY 2013 31Superbowl 2013: Capacity Planning (cont’d)•  Capacity planning  1s Max TPS  αn-1  20K+  αn-1 + (αn-1 + αn-2)/2  22K+
  32. 32. @Twitter | QCon NY 2013 32Superbowl 2013: Capacity Planning (cont’d)•  Validation  1s Max TPS  αobserved < αexpected  Twitter was highly available during Superbowl 2013  Over-allocation concerns?  Minimal   Limited to few services  Seamlessly handled traffic spike due to the Superbowl 2013 Blackout
  33. 33. @Twitter | QCon NY 2013 33Join the Flock•  We are hiring!