• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Isolating Events from the Fail Whale
 

Isolating Events from the Fail Whale

on

  • 1,088 views

QCon NYC 2013

QCon NYC 2013

Statistics

Views

Total Views
1,088
Views on SlideShare
1,062
Embed Views
26

Actions

Likes
1
Downloads
0
Comments
0

2 Embeds 26

http://www.linkedin.com 21
https://www.linkedin.com 5

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Isolating Events from the Fail Whale Isolating Events from the Fail Whale Presentation Transcript

    • @Twitter | QCon NY 2013 1Isolating Events from the Fail WhaleArun Kejariwal, Bryce Yan(@arun_kejariwal, @bryce_yan)Capacity Engineering @ TwitterJune 2013
    • @Twitter | QCon NY 2013 2Delivering Best User Experience•  Performance  Real time!  Latency tolerance of end-users has nose dived  Average, p99, p999  Variability on large clusters  Tolerate faults when using commodity hardware•  Availability  Anytime, Anywhere, Any Device•  Organic Growth  Over 200M monthly active users•  Events  Planned, Unplanned[3] https://twitter.com/twitter/status/281051652235087872[2] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf[1] Xu et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf[2][3][1]
    • @Twitter | QCon NY 2013 3High Performance, Availability•  Capacity Planning  Throw hardware at the problem  Operationally inefficient  Even otherwiseo  How much?o  What kind? (Inventory management etc.)  Reactive approach  Degraded user experienceo  Impact bottomline  Overall goal  Deliver best user experience  Minimal operational footprint o  Factor in organic growth and lead times for provisioning additional capacity
    • @Twitter | QCon NY 2013 4Capacity Planning is Non-trivial•  Behavioral response is unpredictable•  Multiplier Effect  # Retweets x Followers of each retweeterLarge fan-out
    • @Twitter | QCon NY 2013 5Capacity Planning is Non-trivial (cont’d)•  Unforeseen events  Power failure  “Hurricane Sandy takes data centers offline with flooding, power outages”  Network issues  “Amazons compute cloud has a networking hiccup”•  Evolving product development landscape  New features  New products  New partners  “Twitter Arrives on Wall Street, Via Bloomberg”[1] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/[2] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/[4] http://dealbook.nytimes.com/2013/04/04/twitter-arrives-on-wall-street-via-bloomberg/[3] Ballani et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final186.pdf.[1][2] [3][4]14 June 2013
    • @Twitter | QCon NY 2013 6Capacity Planning is Non-trivial (cont’d)•  New hardware platforms  Purchase pipeline  How much and when to buy – Cost performance trade-off
    • @Twitter | QCon NY 2013 7Events•  Planned  Still, traffic pattern subject to, say,   Nature of the event   Behavioral response  Community effect  Demographics
    • @Twitter | QCon NY 2013 8Events (cont’d)•  Unplanned  Intensity of the event  Population densityJapan Tsunami New Zealand Earthquake Hurricane SandyFlash CrashEgyptian RevolutionIran’s Disputed Election Boston ExplosionRemembering Steve Jobs
    • @Twitter | QCon NY 2013 9Events (cont’d)•  Unplanned (transient)  Duration   Type of the transient eventWhite House Rumor: AP account being hacked[1][1] http://finance.yahoo.com/news/stocks-briefly-drop-recover-fake-172814328.html
    • @Twitter | QCon NY 2013 10Events (cont’d)•  Black Swans (ala Nassim Taleb)  Planned events, but…Superbowl’13 Blackout Zidane in “Action” “Hand of God”Usain Bolt’s 100m World Record
    • @Twitter | QCon NY 2013 11Events (cont’d)•  Events timelineTime
    • @Twitter | QCon NY 2013 12Events’ Impact•  Differ in characteristics  Tweets  Photos  Vines  Now, Music•  Consequently, tax different services  Different capacity requests
    • @Twitter | QCon NY 2013 13Capacity Modeling Overview
    • @Twitter | QCon NY 2013 14Capacity Modeling•  Takes core drivers as inputs to generate usage demand  Forecasts the amount of work based on core driver projections•  Relates the work metric to a primary resource to identify the capacitythreshold  Primary resources  Computing power (CPU, RAM)  Storage (disk I/O, disk space)  Network (network bandwidth)•  Generate hardware demand based on the limiting primary resource
    • @Twitter | QCon NY 2013 15Core Drivers•  Underlying business metrics that drive demand for more capacity  Active Users  Tweets per second (TPS)  Favorites per second (FPS)  Requests per second (RPS)•  Normalized by Active Users to isolate user engagement•  Project user engagement and Active Users independently
    • @Twitter | QCon NY 2013 16Active Users aka User Growth Normalized Core Drivers for EngagementCore Drivers (cont’d)PerActiveUserValuesTimeFavoritesRetweetsPoly. (Favorites)Linear (Retweets)ActiveUserCountTimeActiveUsersLinear (ActiveUsers)
    • @Twitter | QCon NY 2013 17Core Drivers (cont’d)TimeUser Growth: Active UsersActiveUsersLinear (ActiveUsers)TimeEngagement: Photos/Active UserPhotosLinear (Photos)TimeCore Driver: Photos per DayPhotosPhotosForecast
    • @Twitter | QCon NY 2013 18Capacity Threshold•  Primary resource scalability threshold  Determined by load testing  Synthetic load  Replaying production traffic  Real-time production traffic  Test systems may be  Isolated replicas of production  Staging systems in production  Production systems0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00ServiceResponseTimeCPUAverage Response Times vs CPUX
    • @Twitter | QCon NY 2013 19Hardware Demand•  Core driver  capacity threshold  scaling formula  server count•  Example  Core driver: Requests per Second  Per server request throughput determined by capacity threshold  Scaling formula for Sizing  Number of Servers = (RPS) / Per Server ThresholdCoreDriver(RPS)/ServerCountTimeRPS (Actuals) RPS (Forecast) # Servers (Actuals) # Servers (Forecast)
    • @Twitter | QCon NY 2013 20Deep Dive and Superbowl 2013
    • @Twitter | QCon NY 2013 21Events: High Level Methodology•  Goal  Handle traffic “spike”•  Predict expected traffic based on historical and temporal statistical analysis  Statistical Metrics  Average  Standard deviation  Max•  Limitations  Changing usage patterns  Organic growth, behavioral, cultural   Event driven  How a game would turn out?
    • @Twitter | QCon NY 2013 22Statistical Time Series Analysis•  Time window  Week over Week (WoW)  Month over Month (MoM)  Year over Year (YoY)•  Data Distribution  Normal, Log Normal, Multi-modal  Has implications on model selection•  Forecasting  Regression model  Linear, Spline  ARIMA  Trending, Seasonal, Residuals
    • @Twitter | QCon NY 2013 23Superbowl 2013: Capacity Planning•  Assess capacity requirement based 2011, 2012 Superbowl traffic patterns•  Core driver selection  RPS (Reads)  TPS (Writes)•  What time granularity to use?  Avg TPS (Tweets per sec)  1s/10s/15s/30s Max TPS  1 min/5 min/10 min Max TPS  1 hr Max TPS
    • @Twitter | QCon NY 2013 24Superbowl 2013: Capacity Planning (cont’d)•  Which metric to use?TimeHighly correlated
    • @Twitter | QCon NY 2013 25Superbowl 2013: Capacity Planning (cont’d)•  Which metric to use?  Time sensitive – correlation may change YoYTimeHighly correlated
    • @Twitter | QCon NY 2013 26Superbowl 2013: Capacity Planning (cont’d)•  Approaches  TPSSuperbowl (denote by Tn)  d-Day historical window  TPSn-1, TPSn-2, …, TPSn-d  Ratio Analysis  Rn = Tn/Max(TPSn-1, TPSn-2, …, TPSn-d)  Distribution Analysis  αn = (Tn - AVG(TPSn-1, TPSn-2, …, TPSn-d))/STDEV(TPSn-1, TPSn-2, …, TPSn-d)
    • @Twitter | QCon NY 2013 27Superbowl 2013: Capacity Planning (cont’d)•  Ratio Analysis (Rn)  1s Max TPS14 Day 28 day 45 Day2011 0.791 0.791 1.0072012 1.062 0.858 0.580
    • @Twitter | QCon NY 2013 28μSuperbowl 2013: Capacity Planning (cont’d)•  Distribution Analysis (αn)  AVG (μ), STDEV(σ)   μ increased YoY (expected)  σ also increased YoY  1s Max TPSTn /μ (Tn – μ)/σ2011 1.448 1.7462012 1.517 2.756TPS during Superbowl has beenmoving right YoY2011 2012
    • @Twitter | QCon NY 2013 29Superbowl 2013: Capacity Planning (cont’d)•  Distribution Analysis  YoY movement of TPSSuperbowl further into the right tail  Expectation: Progressive moves would be smaller  Overestimate α  Handle unplanned events  Business decision
    • @Twitter | QCon NY 2013 30Superbowl 2013: Capacity Planning (cont’d)•  Historical component  Determine extent of movement (αexpected) of TPSSuperbowl into right tail•  Temporal component  Current μc   Current σc•  Capacity planning  Plan capacity corresponding to μc + αexpected * σc  Scenario Analysis (ala Global Macro Hedge Funds)  αexpected o  αn-1 (same as last year)o  αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)
    • @Twitter | QCon NY 2013 31Superbowl 2013: Capacity Planning (cont’d)•  Capacity planning  1s Max TPS  αn-1  20K+  αn-1 + (αn-1 + αn-2)/2  22K+
    • @Twitter | QCon NY 2013 32Superbowl 2013: Capacity Planning (cont’d)•  Validation  1s Max TPS  αobserved < αexpected  Twitter was highly available during Superbowl 2013  Over-allocation concerns?  Minimal   Limited to few services  Seamlessly handled traffic spike due to the Superbowl 2013 Blackout
    • @Twitter | QCon NY 2013 33Join the Flock•  We are hiring!  https://twitter.com/JoinTheFlock  https://twitter.com/jobs