SlideShare a Scribd company logo
1 of 33
@Twitter | QCon NY 2013 1
Isolating Events from the Fail Whale
Arun Kejariwal, Bryce Yan
(@arun_kejariwal, @bryce_yan)
Capacity Engineering @ Twitter
June 2013
@Twitter | QCon NY 2013 2
Delivering Best User Experience
•  Performance
  Real time!
  Latency tolerance of end-users has nose dived
  Average, p99, p999
  Variability on large clusters
  Tolerate faults when using commodity hardware
•  Availability
  Anytime, Anywhere, Any Device
•  Organic Growth
  Over 200M monthly active users
•  Events
  Planned, Unplanned
[3] https://twitter.com/twitter/status/281051652235087872
[2] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf
[1] Xu et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf
[2]
[3]
[1]
@Twitter | QCon NY 2013 3
High Performance, Availability
•  Capacity Planning
  Throw hardware at the problem
  Operationally inefficient
  Even otherwise
o  How much?
o  What kind? (Inventory management etc.)
  Reactive approach
  Degraded user experience
o  Impact bottomline
  Overall goal
  Deliver best user experience
  Minimal operational footprint 
o  Factor in organic growth and lead times for provisioning additional capacity
@Twitter | QCon NY 2013 4
Capacity Planning is Non-trivial
•  Behavioral response is unpredictable
•  Multiplier Effect
  # Retweets x Followers of each retweeter
Large fan-out
@Twitter | QCon NY 2013 5
Capacity Planning is Non-trivial (cont’d)
•  Unforeseen events
  Power failure
  “Hurricane Sandy takes data centers offline with flooding, power outages”
  Network issues
  “Amazon's compute cloud has a networking hiccup”
•  Evolving product development landscape
  New features
  New products
  New partners
  “Twitter Arrives on Wall Street, Via Bloomberg”
[1] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/
[2] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/
[4] http://dealbook.nytimes.com/2013/04/04/twitter-arrives-on-wall-street-via-bloomberg/
[3] Ballani et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final186.pdf.
[1]
[2] [3]
[4]
14 June 2013
@Twitter | QCon NY 2013 6
Capacity Planning is Non-trivial (cont’d)
•  New hardware platforms
  Purchase pipeline
  How much and when to buy – Cost performance trade-off
@Twitter | QCon NY 2013 7
Events
•  Planned


  Still, traffic pattern subject to, say, 
  Nature of the event 
  Behavioral response
  Community effect
  Demographics
@Twitter | QCon NY 2013 8
Events (cont’d)
•  Unplanned




  Intensity of the event
  Population density
Japan Tsunami
 New Zealand Earthquake
 Hurricane Sandy
Flash Crash
Egyptian Revolution
Iran’s Disputed Election
 Boston Explosion
Remembering Steve Jobs
@Twitter | QCon NY 2013 9
Events (cont’d)
•  Unplanned (transient)



  Duration 
  Type of the transient event
White House Rumor: AP account being hacked

























[1]
[1] http://finance.yahoo.com/news/stocks-briefly-drop-recover-fake-172814328.html
@Twitter | QCon NY 2013 10
Events (cont’d)
•  Black Swans (ala Nassim Taleb)
  Planned events, but…
Superbowl’13 Blackout
 Zidane in “Action”
 “Hand of God”
Usain Bolt’s 100m World Record
@Twitter | QCon NY 2013 11
Events (cont’d)
•  Events timeline
Time
@Twitter | QCon NY 2013 12
Events’ Impact
•  Differ in characteristics
  Tweets
  Photos
  Vines
  Now, Music
•  Consequently, tax different services
  Different capacity requests
@Twitter | QCon NY 2013 13
Capacity Modeling Overview
@Twitter | QCon NY 2013 14
Capacity Modeling
•  Takes core drivers as inputs to generate usage demand
  Forecasts the amount of work based on core driver projections
•  Relates the work metric to a primary resource to identify the capacity
threshold
  Primary resources
  Computing power (CPU, RAM)
  Storage (disk I/O, disk space)
  Network (network bandwidth)
•  Generate hardware demand based on the limiting primary resource
@Twitter | QCon NY 2013 15
Core Drivers
•  Underlying business metrics that drive demand for more capacity
  Active Users
  Tweets per second (TPS)
  Favorites per second (FPS)
  Requests per second (RPS)
•  Normalized by Active Users to isolate user engagement
•  Project user engagement and Active Users independently
@Twitter | QCon NY 2013 16
Active Users aka User Growth
 Normalized Core Drivers for Engagement
Core Drivers (cont’d)
PerActiveUserValues
Time
Favorites
Retweets
Poly. (Favorites)
Linear (Retweets)
ActiveUserCount
Time
Active
Users
Linear (Active
Users)
@Twitter | QCon NY 2013 17
Core Drivers (cont’d)
Time
User Growth: Active Users
Active
Users
Linear (Active
Users)
Time
Engagement: Photos/Active User
Photos
Linear (Photos)
Time
Core Driver: Photos per Day
Photos
Photos
Forecast
@Twitter | QCon NY 2013 18
Capacity Threshold
•  Primary resource scalability threshold
  Determined by load testing
  Synthetic load
  Replaying production traffic
  Real-time production traffic
  Test systems may be
  Isolated replicas of production
  Staging systems in production
  Production systems
0.00
 10.00
 20.00
 30.00
 40.00
 50.00
 60.00
 70.00
 80.00
 90.00
 100.00
ServiceResponseTime
CPU
Average Response Times vs CPU
X
@Twitter | QCon NY 2013 19
Hardware Demand
•  Core driver  capacity threshold  scaling formula  server count
•  Example
  Core driver: Requests per Second
  Per server request throughput determined by 
capacity threshold
  Scaling formula for Sizing
  Number of Servers = (RPS) / Per Server Threshold
CoreDriver(RPS)/ServerCount
Time
RPS (Actuals)
 RPS (Forecast)
 # Servers (Actuals)
 # Servers (Forecast)
@Twitter | QCon NY 2013 20
Deep Dive and Superbowl 2013
@Twitter | QCon NY 2013 21
Events: High Level Methodology
•  Goal
  Handle traffic “spike”
•  Predict expected traffic based on historical and temporal statistical analysis
  Statistical Metrics
  Average
  Standard deviation
  Max
•  Limitations
  Changing usage patterns
  Organic growth, behavioral, cultural 
  Event driven
  How a game would turn out?
@Twitter | QCon NY 2013 22
Statistical Time Series Analysis
•  Time window
  Week over Week (WoW)
  Month over Month (MoM)
  Year over Year (YoY)
•  Data Distribution
  Normal, Log Normal, Multi-modal
  Has implications on model selection
•  Forecasting
  Regression model
  Linear, Spline
  ARIMA
  Trending, Seasonal, Residuals
@Twitter | QCon NY 2013 23
Superbowl 2013: Capacity Planning
•  Assess capacity requirement based 2011, 2012 Superbowl traffic patterns

•  Core driver selection
  RPS (Reads)
  TPS (Writes)

•  What time granularity to use?
  Avg TPS (Tweets per sec)
  1s/10s/15s/30s Max TPS
  1 min/5 min/10 min Max TPS
  1 hr Max TPS
@Twitter | QCon NY 2013 24
Superbowl 2013: Capacity Planning (cont’d)
•  Which metric to use?
Time
Highly correlated
@Twitter | QCon NY 2013 25
Superbowl 2013: Capacity Planning (cont’d)
•  Which metric to use?
  Time sensitive – correlation may change YoY
Time
Highly correlated
@Twitter | QCon NY 2013 26
Superbowl 2013: Capacity Planning (cont’d)
•  Approaches
  TPSSuperbowl (denote by Tn)
  d-Day historical window
  TPSn-1, TPSn-2, …, TPSn-d
  Ratio Analysis
  Rn = Tn/Max(TPSn-1, TPSn-2, …, TPSn-d)
  Distribution Analysis
  αn = (Tn - AVG(TPSn-1, TPSn-2, …, TPSn-d))/STDEV(TPSn-1, TPSn-2, …, TPSn-d)
@Twitter | QCon NY 2013 27
Superbowl 2013: Capacity Planning (cont’d)
•  Ratio Analysis (Rn)
  1s Max TPS
14 Day
 28 day
 45 Day
2011
 0.791
 0.791
 1.007
2012
 1.062
 0.858
 0.580
@Twitter | QCon NY 2013 28
μ
Superbowl 2013: Capacity Planning (cont’d)
•  Distribution Analysis (αn)
  AVG (μ), STDEV(σ) 
  μ increased YoY (expected)
  σ also increased YoY
  1s Max TPS
Tn /μ
 (Tn – μ)/σ
2011
 1.448
 1.746
2012
 1.517
 2.756
TPS during Superbowl has been
moving right YoY
2011
 2012
@Twitter | QCon NY 2013 29
Superbowl 2013: Capacity Planning (cont’d)
•  Distribution Analysis
  YoY movement of TPSSuperbowl further into the right tail
  Expectation: Progressive moves would be smaller

  Overestimate α
  Handle unplanned events
  Business decision
@Twitter | QCon NY 2013 30
Superbowl 2013: Capacity Planning (cont’d)
•  Historical component
  Determine extent of movement (αexpected) of TPSSuperbowl into right tail

•  Temporal component
  Current μc 
  Current σc

•  Capacity planning
  Plan capacity corresponding to μc + αexpected * σc
  Scenario Analysis (ala Global Macro Hedge Funds)
  αexpected 
o  αn-1 (same as last year)
o  αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)
@Twitter | QCon NY 2013 31
Superbowl 2013: Capacity Planning (cont’d)
•  Capacity planning
  1s Max TPS
  αn-1  20K+
  αn-1 + (αn-1 + αn-2)/2  22K+
@Twitter | QCon NY 2013 32
Superbowl 2013: Capacity Planning (cont’d)
•  Validation
  1s Max TPS
  αobserved < αexpected


  Twitter was highly available during Superbowl 2013
  Over-allocation concerns?
  Minimal 
  Limited to few services
  Seamlessly handled traffic spike due to the Superbowl 2013 Blackout
@Twitter | QCon NY 2013 33
Join the Flock
•  We are hiring!
  https://twitter.com/JoinTheFlock
  https://twitter.com/jobs

More Related Content

Similar to Twitter QCon NY 2013: Isolating Events from the Fail Whale

Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introductionamiyadash
 
Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan Kumar
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Symeon Papadopoulos
 
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...ferda ofli
 
The STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design FeaturesThe STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design FeaturesGLTN_STDM
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Summit
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsArun Kejariwal
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
 
Analysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane SandyAnalysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane SandyCatherine Graham
 
22 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-1722 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-17indiawrm
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at TwitterPrasad Wagle
 
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Piyush Kumar
 
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...C4Media
 
Task Time Series CoronaWhy De
Task Time Series CoronaWhy DeTask Time Series CoronaWhy De
Task Time Series CoronaWhy DeIsaac Godfried
 
Backups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for NonprofitsBackups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for NonprofitsCommunity IT Innovators
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA
 
Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Gunjan Kumar
 

Similar to Twitter QCon NY 2013: Isolating Events from the Fail Whale (20)

Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan insight student conference v2
Gunjan insight student conference v2
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
 
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
 
The STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design FeaturesThe STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design Features
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Analysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane SandyAnalysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane Sandy
 
22 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-1722 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-17
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
 
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
 
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
 
Task Time Series CoronaWhy De
Task Time Series CoronaWhy DeTask Time Series CoronaWhy De
Task Time Series CoronaWhy De
 
Backups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for NonprofitsBackups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for Nonprofits
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017
 

More from Arun Kejariwal

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The EdgeArun Kejariwal
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar FunctionsArun Kejariwal
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsArun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsArun Kejariwal
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series DataArun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsArun Kejariwal
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly DetectionArun Kejariwal
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactArun Kejariwal
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterArun Kejariwal
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceArun Kejariwal
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintArun Kejariwal
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudArun Kejariwal
 

More from Arun Kejariwal (16)

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly Detection
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ Twitter
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
 

Recently uploaded

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 

Recently uploaded (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 

Twitter QCon NY 2013: Isolating Events from the Fail Whale

  • 1. @Twitter | QCon NY 2013 1 Isolating Events from the Fail Whale Arun Kejariwal, Bryce Yan (@arun_kejariwal, @bryce_yan) Capacity Engineering @ Twitter June 2013
  • 2. @Twitter | QCon NY 2013 2 Delivering Best User Experience •  Performance   Real time!   Latency tolerance of end-users has nose dived   Average, p99, p999   Variability on large clusters   Tolerate faults when using commodity hardware •  Availability   Anytime, Anywhere, Any Device •  Organic Growth   Over 200M monthly active users •  Events   Planned, Unplanned [3] https://twitter.com/twitter/status/281051652235087872 [2] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf [1] Xu et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf [2] [3] [1]
  • 3. @Twitter | QCon NY 2013 3 High Performance, Availability •  Capacity Planning   Throw hardware at the problem   Operationally inefficient   Even otherwise o  How much? o  What kind? (Inventory management etc.)   Reactive approach   Degraded user experience o  Impact bottomline   Overall goal   Deliver best user experience   Minimal operational footprint o  Factor in organic growth and lead times for provisioning additional capacity
  • 4. @Twitter | QCon NY 2013 4 Capacity Planning is Non-trivial •  Behavioral response is unpredictable •  Multiplier Effect   # Retweets x Followers of each retweeter Large fan-out
  • 5. @Twitter | QCon NY 2013 5 Capacity Planning is Non-trivial (cont’d) •  Unforeseen events   Power failure   “Hurricane Sandy takes data centers offline with flooding, power outages”   Network issues   “Amazon's compute cloud has a networking hiccup” •  Evolving product development landscape   New features   New products   New partners   “Twitter Arrives on Wall Street, Via Bloomberg” [1] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/ [2] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/ [4] http://dealbook.nytimes.com/2013/04/04/twitter-arrives-on-wall-street-via-bloomberg/ [3] Ballani et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final186.pdf. [1] [2] [3] [4] 14 June 2013
  • 6. @Twitter | QCon NY 2013 6 Capacity Planning is Non-trivial (cont’d) •  New hardware platforms   Purchase pipeline   How much and when to buy – Cost performance trade-off
  • 7. @Twitter | QCon NY 2013 7 Events •  Planned   Still, traffic pattern subject to, say,   Nature of the event   Behavioral response   Community effect   Demographics
  • 8. @Twitter | QCon NY 2013 8 Events (cont’d) •  Unplanned   Intensity of the event   Population density Japan Tsunami New Zealand Earthquake Hurricane Sandy Flash Crash Egyptian Revolution Iran’s Disputed Election Boston Explosion Remembering Steve Jobs
  • 9. @Twitter | QCon NY 2013 9 Events (cont’d) •  Unplanned (transient)   Duration   Type of the transient event White House Rumor: AP account being hacked [1] [1] http://finance.yahoo.com/news/stocks-briefly-drop-recover-fake-172814328.html
  • 10. @Twitter | QCon NY 2013 10 Events (cont’d) •  Black Swans (ala Nassim Taleb)   Planned events, but… Superbowl’13 Blackout Zidane in “Action” “Hand of God” Usain Bolt’s 100m World Record
  • 11. @Twitter | QCon NY 2013 11 Events (cont’d) •  Events timeline Time
  • 12. @Twitter | QCon NY 2013 12 Events’ Impact •  Differ in characteristics   Tweets   Photos   Vines   Now, Music •  Consequently, tax different services   Different capacity requests
  • 13. @Twitter | QCon NY 2013 13 Capacity Modeling Overview
  • 14. @Twitter | QCon NY 2013 14 Capacity Modeling •  Takes core drivers as inputs to generate usage demand   Forecasts the amount of work based on core driver projections •  Relates the work metric to a primary resource to identify the capacity threshold   Primary resources   Computing power (CPU, RAM)   Storage (disk I/O, disk space)   Network (network bandwidth) •  Generate hardware demand based on the limiting primary resource
  • 15. @Twitter | QCon NY 2013 15 Core Drivers •  Underlying business metrics that drive demand for more capacity   Active Users   Tweets per second (TPS)   Favorites per second (FPS)   Requests per second (RPS) •  Normalized by Active Users to isolate user engagement •  Project user engagement and Active Users independently
  • 16. @Twitter | QCon NY 2013 16 Active Users aka User Growth Normalized Core Drivers for Engagement Core Drivers (cont’d) PerActiveUserValues Time Favorites Retweets Poly. (Favorites) Linear (Retweets) ActiveUserCount Time Active Users Linear (Active Users)
  • 17. @Twitter | QCon NY 2013 17 Core Drivers (cont’d) Time User Growth: Active Users Active Users Linear (Active Users) Time Engagement: Photos/Active User Photos Linear (Photos) Time Core Driver: Photos per Day Photos Photos Forecast
  • 18. @Twitter | QCon NY 2013 18 Capacity Threshold •  Primary resource scalability threshold   Determined by load testing   Synthetic load   Replaying production traffic   Real-time production traffic   Test systems may be   Isolated replicas of production   Staging systems in production   Production systems 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 ServiceResponseTime CPU Average Response Times vs CPU X
  • 19. @Twitter | QCon NY 2013 19 Hardware Demand •  Core driver  capacity threshold  scaling formula  server count •  Example   Core driver: Requests per Second   Per server request throughput determined by capacity threshold   Scaling formula for Sizing   Number of Servers = (RPS) / Per Server Threshold CoreDriver(RPS)/ServerCount Time RPS (Actuals) RPS (Forecast) # Servers (Actuals) # Servers (Forecast)
  • 20. @Twitter | QCon NY 2013 20 Deep Dive and Superbowl 2013
  • 21. @Twitter | QCon NY 2013 21 Events: High Level Methodology •  Goal   Handle traffic “spike” •  Predict expected traffic based on historical and temporal statistical analysis   Statistical Metrics   Average   Standard deviation   Max •  Limitations   Changing usage patterns   Organic growth, behavioral, cultural   Event driven   How a game would turn out?
  • 22. @Twitter | QCon NY 2013 22 Statistical Time Series Analysis •  Time window   Week over Week (WoW)   Month over Month (MoM)   Year over Year (YoY) •  Data Distribution   Normal, Log Normal, Multi-modal   Has implications on model selection •  Forecasting   Regression model   Linear, Spline   ARIMA   Trending, Seasonal, Residuals
  • 23. @Twitter | QCon NY 2013 23 Superbowl 2013: Capacity Planning •  Assess capacity requirement based 2011, 2012 Superbowl traffic patterns •  Core driver selection   RPS (Reads)   TPS (Writes) •  What time granularity to use?   Avg TPS (Tweets per sec)   1s/10s/15s/30s Max TPS   1 min/5 min/10 min Max TPS   1 hr Max TPS
  • 24. @Twitter | QCon NY 2013 24 Superbowl 2013: Capacity Planning (cont’d) •  Which metric to use? Time Highly correlated
  • 25. @Twitter | QCon NY 2013 25 Superbowl 2013: Capacity Planning (cont’d) •  Which metric to use?   Time sensitive – correlation may change YoY Time Highly correlated
  • 26. @Twitter | QCon NY 2013 26 Superbowl 2013: Capacity Planning (cont’d) •  Approaches   TPSSuperbowl (denote by Tn)   d-Day historical window   TPSn-1, TPSn-2, …, TPSn-d   Ratio Analysis   Rn = Tn/Max(TPSn-1, TPSn-2, …, TPSn-d)   Distribution Analysis   αn = (Tn - AVG(TPSn-1, TPSn-2, …, TPSn-d))/STDEV(TPSn-1, TPSn-2, …, TPSn-d)
  • 27. @Twitter | QCon NY 2013 27 Superbowl 2013: Capacity Planning (cont’d) •  Ratio Analysis (Rn)   1s Max TPS 14 Day 28 day 45 Day 2011 0.791 0.791 1.007 2012 1.062 0.858 0.580
  • 28. @Twitter | QCon NY 2013 28 μ Superbowl 2013: Capacity Planning (cont’d) •  Distribution Analysis (αn)   AVG (μ), STDEV(σ)   μ increased YoY (expected)   σ also increased YoY   1s Max TPS Tn /μ (Tn – μ)/σ 2011 1.448 1.746 2012 1.517 2.756 TPS during Superbowl has been moving right YoY 2011 2012
  • 29. @Twitter | QCon NY 2013 29 Superbowl 2013: Capacity Planning (cont’d) •  Distribution Analysis   YoY movement of TPSSuperbowl further into the right tail   Expectation: Progressive moves would be smaller   Overestimate α   Handle unplanned events   Business decision
  • 30. @Twitter | QCon NY 2013 30 Superbowl 2013: Capacity Planning (cont’d) •  Historical component   Determine extent of movement (αexpected) of TPSSuperbowl into right tail •  Temporal component   Current μc   Current σc •  Capacity planning   Plan capacity corresponding to μc + αexpected * σc   Scenario Analysis (ala Global Macro Hedge Funds)   αexpected o  αn-1 (same as last year) o  αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)
  • 31. @Twitter | QCon NY 2013 31 Superbowl 2013: Capacity Planning (cont’d) •  Capacity planning   1s Max TPS   αn-1  20K+   αn-1 + (αn-1 + αn-2)/2  22K+
  • 32. @Twitter | QCon NY 2013 32 Superbowl 2013: Capacity Planning (cont’d) •  Validation   1s Max TPS   αobserved < αexpected   Twitter was highly available during Superbowl 2013   Over-allocation concerns?   Minimal   Limited to few services   Seamlessly handled traffic spike due to the Superbowl 2013 Blackout
  • 33. @Twitter | QCon NY 2013 33 Join the Flock •  We are hiring!   https://twitter.com/JoinTheFlock   https://twitter.com/jobs