Isolating Events from the Fail Whale

  • 1,303 views
Uploaded on

QCon NYC 2013

QCon NYC 2013

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,303
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. @Twitter | QCon NY 2013 1Isolating Events from the Fail WhaleArun Kejariwal, Bryce Yan(@arun_kejariwal, @bryce_yan)Capacity Engineering @ TwitterJune 2013
  • 2. @Twitter | QCon NY 2013 2Delivering Best User Experience•  Performance  Real time!  Latency tolerance of end-users has nose dived  Average, p99, p999  Variability on large clusters  Tolerate faults when using commodity hardware•  Availability  Anytime, Anywhere, Any Device•  Organic Growth  Over 200M monthly active users•  Events  Planned, Unplanned[3] https://twitter.com/twitter/status/281051652235087872[2] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf[1] Xu et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf[2][3][1]
  • 3. @Twitter | QCon NY 2013 3High Performance, Availability•  Capacity Planning  Throw hardware at the problem  Operationally inefficient  Even otherwiseo  How much?o  What kind? (Inventory management etc.)  Reactive approach  Degraded user experienceo  Impact bottomline  Overall goal  Deliver best user experience  Minimal operational footprint o  Factor in organic growth and lead times for provisioning additional capacity
  • 4. @Twitter | QCon NY 2013 4Capacity Planning is Non-trivial•  Behavioral response is unpredictable•  Multiplier Effect  # Retweets x Followers of each retweeterLarge fan-out
  • 5. @Twitter | QCon NY 2013 5Capacity Planning is Non-trivial (cont’d)•  Unforeseen events  Power failure  “Hurricane Sandy takes data centers offline with flooding, power outages”  Network issues  “Amazons compute cloud has a networking hiccup”•  Evolving product development landscape  New features  New products  New partners  “Twitter Arrives on Wall Street, Via Bloomberg”[1] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/[2] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/[4] http://dealbook.nytimes.com/2013/04/04/twitter-arrives-on-wall-street-via-bloomberg/[3] Ballani et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final186.pdf.[1][2] [3][4]14 June 2013
  • 6. @Twitter | QCon NY 2013 6Capacity Planning is Non-trivial (cont’d)•  New hardware platforms  Purchase pipeline  How much and when to buy – Cost performance trade-off
  • 7. @Twitter | QCon NY 2013 7Events•  Planned  Still, traffic pattern subject to, say,   Nature of the event   Behavioral response  Community effect  Demographics
  • 8. @Twitter | QCon NY 2013 8Events (cont’d)•  Unplanned  Intensity of the event  Population densityJapan Tsunami New Zealand Earthquake Hurricane SandyFlash CrashEgyptian RevolutionIran’s Disputed Election Boston ExplosionRemembering Steve Jobs
  • 9. @Twitter | QCon NY 2013 9Events (cont’d)•  Unplanned (transient)  Duration   Type of the transient eventWhite House Rumor: AP account being hacked[1][1] http://finance.yahoo.com/news/stocks-briefly-drop-recover-fake-172814328.html
  • 10. @Twitter | QCon NY 2013 10Events (cont’d)•  Black Swans (ala Nassim Taleb)  Planned events, but…Superbowl’13 Blackout Zidane in “Action” “Hand of God”Usain Bolt’s 100m World Record
  • 11. @Twitter | QCon NY 2013 11Events (cont’d)•  Events timelineTime
  • 12. @Twitter | QCon NY 2013 12Events’ Impact•  Differ in characteristics  Tweets  Photos  Vines  Now, Music•  Consequently, tax different services  Different capacity requests
  • 13. @Twitter | QCon NY 2013 13Capacity Modeling Overview
  • 14. @Twitter | QCon NY 2013 14Capacity Modeling•  Takes core drivers as inputs to generate usage demand  Forecasts the amount of work based on core driver projections•  Relates the work metric to a primary resource to identify the capacitythreshold  Primary resources  Computing power (CPU, RAM)  Storage (disk I/O, disk space)  Network (network bandwidth)•  Generate hardware demand based on the limiting primary resource
  • 15. @Twitter | QCon NY 2013 15Core Drivers•  Underlying business metrics that drive demand for more capacity  Active Users  Tweets per second (TPS)  Favorites per second (FPS)  Requests per second (RPS)•  Normalized by Active Users to isolate user engagement•  Project user engagement and Active Users independently
  • 16. @Twitter | QCon NY 2013 16Active Users aka User Growth Normalized Core Drivers for EngagementCore Drivers (cont’d)PerActiveUserValuesTimeFavoritesRetweetsPoly. (Favorites)Linear (Retweets)ActiveUserCountTimeActiveUsersLinear (ActiveUsers)
  • 17. @Twitter | QCon NY 2013 17Core Drivers (cont’d)TimeUser Growth: Active UsersActiveUsersLinear (ActiveUsers)TimeEngagement: Photos/Active UserPhotosLinear (Photos)TimeCore Driver: Photos per DayPhotosPhotosForecast
  • 18. @Twitter | QCon NY 2013 18Capacity Threshold•  Primary resource scalability threshold  Determined by load testing  Synthetic load  Replaying production traffic  Real-time production traffic  Test systems may be  Isolated replicas of production  Staging systems in production  Production systems0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00ServiceResponseTimeCPUAverage Response Times vs CPUX
  • 19. @Twitter | QCon NY 2013 19Hardware Demand•  Core driver  capacity threshold  scaling formula  server count•  Example  Core driver: Requests per Second  Per server request throughput determined by capacity threshold  Scaling formula for Sizing  Number of Servers = (RPS) / Per Server ThresholdCoreDriver(RPS)/ServerCountTimeRPS (Actuals) RPS (Forecast) # Servers (Actuals) # Servers (Forecast)
  • 20. @Twitter | QCon NY 2013 20Deep Dive and Superbowl 2013
  • 21. @Twitter | QCon NY 2013 21Events: High Level Methodology•  Goal  Handle traffic “spike”•  Predict expected traffic based on historical and temporal statistical analysis  Statistical Metrics  Average  Standard deviation  Max•  Limitations  Changing usage patterns  Organic growth, behavioral, cultural   Event driven  How a game would turn out?
  • 22. @Twitter | QCon NY 2013 22Statistical Time Series Analysis•  Time window  Week over Week (WoW)  Month over Month (MoM)  Year over Year (YoY)•  Data Distribution  Normal, Log Normal, Multi-modal  Has implications on model selection•  Forecasting  Regression model  Linear, Spline  ARIMA  Trending, Seasonal, Residuals
  • 23. @Twitter | QCon NY 2013 23Superbowl 2013: Capacity Planning•  Assess capacity requirement based 2011, 2012 Superbowl traffic patterns•  Core driver selection  RPS (Reads)  TPS (Writes)•  What time granularity to use?  Avg TPS (Tweets per sec)  1s/10s/15s/30s Max TPS  1 min/5 min/10 min Max TPS  1 hr Max TPS
  • 24. @Twitter | QCon NY 2013 24Superbowl 2013: Capacity Planning (cont’d)•  Which metric to use?TimeHighly correlated
  • 25. @Twitter | QCon NY 2013 25Superbowl 2013: Capacity Planning (cont’d)•  Which metric to use?  Time sensitive – correlation may change YoYTimeHighly correlated
  • 26. @Twitter | QCon NY 2013 26Superbowl 2013: Capacity Planning (cont’d)•  Approaches  TPSSuperbowl (denote by Tn)  d-Day historical window  TPSn-1, TPSn-2, …, TPSn-d  Ratio Analysis  Rn = Tn/Max(TPSn-1, TPSn-2, …, TPSn-d)  Distribution Analysis  αn = (Tn - AVG(TPSn-1, TPSn-2, …, TPSn-d))/STDEV(TPSn-1, TPSn-2, …, TPSn-d)
  • 27. @Twitter | QCon NY 2013 27Superbowl 2013: Capacity Planning (cont’d)•  Ratio Analysis (Rn)  1s Max TPS14 Day 28 day 45 Day2011 0.791 0.791 1.0072012 1.062 0.858 0.580
  • 28. @Twitter | QCon NY 2013 28μSuperbowl 2013: Capacity Planning (cont’d)•  Distribution Analysis (αn)  AVG (μ), STDEV(σ)   μ increased YoY (expected)  σ also increased YoY  1s Max TPSTn /μ (Tn – μ)/σ2011 1.448 1.7462012 1.517 2.756TPS during Superbowl has beenmoving right YoY2011 2012
  • 29. @Twitter | QCon NY 2013 29Superbowl 2013: Capacity Planning (cont’d)•  Distribution Analysis  YoY movement of TPSSuperbowl further into the right tail  Expectation: Progressive moves would be smaller  Overestimate α  Handle unplanned events  Business decision
  • 30. @Twitter | QCon NY 2013 30Superbowl 2013: Capacity Planning (cont’d)•  Historical component  Determine extent of movement (αexpected) of TPSSuperbowl into right tail•  Temporal component  Current μc   Current σc•  Capacity planning  Plan capacity corresponding to μc + αexpected * σc  Scenario Analysis (ala Global Macro Hedge Funds)  αexpected o  αn-1 (same as last year)o  αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)
  • 31. @Twitter | QCon NY 2013 31Superbowl 2013: Capacity Planning (cont’d)•  Capacity planning  1s Max TPS  αn-1  20K+  αn-1 + (αn-1 + αn-2)/2  22K+
  • 32. @Twitter | QCon NY 2013 32Superbowl 2013: Capacity Planning (cont’d)•  Validation  1s Max TPS  αobserved < αexpected  Twitter was highly available during Superbowl 2013  Over-allocation concerns?  Minimal   Limited to few services  Seamlessly handled traffic spike due to the Superbowl 2013 Blackout
  • 33. @Twitter | QCon NY 2013 33Join the Flock•  We are hiring!  https://twitter.com/JoinTheFlock  https://twitter.com/jobs