SlideShare a Scribd company logo
@Twitter | QCon NY 2013 1
Isolating Events from the Fail Whale
Arun Kejariwal, Bryce Yan
(@arun_kejariwal, @bryce_yan)
Capacity Engineering @ Twitter
June 2013
@Twitter | QCon NY 2013 2
Delivering Best User Experience
•  Performance
  Real time!
  Latency tolerance of end-users has nose dived
  Average, p99, p999
  Variability on large clusters
  Tolerate faults when using commodity hardware
•  Availability
  Anytime, Anywhere, Any Device
•  Organic Growth
  Over 200M monthly active users
•  Events
  Planned, Unplanned
[3] https://twitter.com/twitter/status/281051652235087872
[2] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf
[1] Xu et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf
[2]
[3]
[1]
@Twitter | QCon NY 2013 3
High Performance, Availability
•  Capacity Planning
  Throw hardware at the problem
  Operationally inefficient
  Even otherwise
o  How much?
o  What kind? (Inventory management etc.)
  Reactive approach
  Degraded user experience
o  Impact bottomline
  Overall goal
  Deliver best user experience
  Minimal operational footprint 
o  Factor in organic growth and lead times for provisioning additional capacity
@Twitter | QCon NY 2013 4
Capacity Planning is Non-trivial
•  Behavioral response is unpredictable
•  Multiplier Effect
  # Retweets x Followers of each retweeter
Large fan-out
@Twitter | QCon NY 2013 5
Capacity Planning is Non-trivial (cont’d)
•  Unforeseen events
  Power failure
  “Hurricane Sandy takes data centers offline with flooding, power outages”
  Network issues
  “Amazon's compute cloud has a networking hiccup”
•  Evolving product development landscape
  New features
  New products
  New partners
  “Twitter Arrives on Wall Street, Via Bloomberg”
[1] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/
[2] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/
[4] http://dealbook.nytimes.com/2013/04/04/twitter-arrives-on-wall-street-via-bloomberg/
[3] Ballani et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final186.pdf.
[1]
[2] [3]
[4]
14 June 2013
@Twitter | QCon NY 2013 6
Capacity Planning is Non-trivial (cont’d)
•  New hardware platforms
  Purchase pipeline
  How much and when to buy – Cost performance trade-off
@Twitter | QCon NY 2013 7
Events
•  Planned


  Still, traffic pattern subject to, say, 
  Nature of the event 
  Behavioral response
  Community effect
  Demographics
@Twitter | QCon NY 2013 8
Events (cont’d)
•  Unplanned




  Intensity of the event
  Population density
Japan Tsunami
 New Zealand Earthquake
 Hurricane Sandy
Flash Crash
Egyptian Revolution
Iran’s Disputed Election
 Boston Explosion
Remembering Steve Jobs
@Twitter | QCon NY 2013 9
Events (cont’d)
•  Unplanned (transient)



  Duration 
  Type of the transient event
White House Rumor: AP account being hacked

























[1]
[1] http://finance.yahoo.com/news/stocks-briefly-drop-recover-fake-172814328.html
@Twitter | QCon NY 2013 10
Events (cont’d)
•  Black Swans (ala Nassim Taleb)
  Planned events, but…
Superbowl’13 Blackout
 Zidane in “Action”
 “Hand of God”
Usain Bolt’s 100m World Record
@Twitter | QCon NY 2013 11
Events (cont’d)
•  Events timeline
Time
@Twitter | QCon NY 2013 12
Events’ Impact
•  Differ in characteristics
  Tweets
  Photos
  Vines
  Now, Music
•  Consequently, tax different services
  Different capacity requests
@Twitter | QCon NY 2013 13
Capacity Modeling Overview
@Twitter | QCon NY 2013 14
Capacity Modeling
•  Takes core drivers as inputs to generate usage demand
  Forecasts the amount of work based on core driver projections
•  Relates the work metric to a primary resource to identify the capacity
threshold
  Primary resources
  Computing power (CPU, RAM)
  Storage (disk I/O, disk space)
  Network (network bandwidth)
•  Generate hardware demand based on the limiting primary resource
@Twitter | QCon NY 2013 15
Core Drivers
•  Underlying business metrics that drive demand for more capacity
  Active Users
  Tweets per second (TPS)
  Favorites per second (FPS)
  Requests per second (RPS)
•  Normalized by Active Users to isolate user engagement
•  Project user engagement and Active Users independently
@Twitter | QCon NY 2013 16
Active Users aka User Growth
 Normalized Core Drivers for Engagement
Core Drivers (cont’d)
PerActiveUserValues
Time
Favorites
Retweets
Poly. (Favorites)
Linear (Retweets)
ActiveUserCount
Time
Active
Users
Linear (Active
Users)
@Twitter | QCon NY 2013 17
Core Drivers (cont’d)
Time
User Growth: Active Users
Active
Users
Linear (Active
Users)
Time
Engagement: Photos/Active User
Photos
Linear (Photos)
Time
Core Driver: Photos per Day
Photos
Photos
Forecast
@Twitter | QCon NY 2013 18
Capacity Threshold
•  Primary resource scalability threshold
  Determined by load testing
  Synthetic load
  Replaying production traffic
  Real-time production traffic
  Test systems may be
  Isolated replicas of production
  Staging systems in production
  Production systems
0.00
 10.00
 20.00
 30.00
 40.00
 50.00
 60.00
 70.00
 80.00
 90.00
 100.00
ServiceResponseTime
CPU
Average Response Times vs CPU
X
@Twitter | QCon NY 2013 19
Hardware Demand
•  Core driver  capacity threshold  scaling formula  server count
•  Example
  Core driver: Requests per Second
  Per server request throughput determined by 
capacity threshold
  Scaling formula for Sizing
  Number of Servers = (RPS) / Per Server Threshold
CoreDriver(RPS)/ServerCount
Time
RPS (Actuals)
 RPS (Forecast)
 # Servers (Actuals)
 # Servers (Forecast)
@Twitter | QCon NY 2013 20
Deep Dive and Superbowl 2013
@Twitter | QCon NY 2013 21
Events: High Level Methodology
•  Goal
  Handle traffic “spike”
•  Predict expected traffic based on historical and temporal statistical analysis
  Statistical Metrics
  Average
  Standard deviation
  Max
•  Limitations
  Changing usage patterns
  Organic growth, behavioral, cultural 
  Event driven
  How a game would turn out?
@Twitter | QCon NY 2013 22
Statistical Time Series Analysis
•  Time window
  Week over Week (WoW)
  Month over Month (MoM)
  Year over Year (YoY)
•  Data Distribution
  Normal, Log Normal, Multi-modal
  Has implications on model selection
•  Forecasting
  Regression model
  Linear, Spline
  ARIMA
  Trending, Seasonal, Residuals
@Twitter | QCon NY 2013 23
Superbowl 2013: Capacity Planning
•  Assess capacity requirement based 2011, 2012 Superbowl traffic patterns

•  Core driver selection
  RPS (Reads)
  TPS (Writes)

•  What time granularity to use?
  Avg TPS (Tweets per sec)
  1s/10s/15s/30s Max TPS
  1 min/5 min/10 min Max TPS
  1 hr Max TPS
@Twitter | QCon NY 2013 24
Superbowl 2013: Capacity Planning (cont’d)
•  Which metric to use?
Time
Highly correlated
@Twitter | QCon NY 2013 25
Superbowl 2013: Capacity Planning (cont’d)
•  Which metric to use?
  Time sensitive – correlation may change YoY
Time
Highly correlated
@Twitter | QCon NY 2013 26
Superbowl 2013: Capacity Planning (cont’d)
•  Approaches
  TPSSuperbowl (denote by Tn)
  d-Day historical window
  TPSn-1, TPSn-2, …, TPSn-d
  Ratio Analysis
  Rn = Tn/Max(TPSn-1, TPSn-2, …, TPSn-d)
  Distribution Analysis
  αn = (Tn - AVG(TPSn-1, TPSn-2, …, TPSn-d))/STDEV(TPSn-1, TPSn-2, …, TPSn-d)
@Twitter | QCon NY 2013 27
Superbowl 2013: Capacity Planning (cont’d)
•  Ratio Analysis (Rn)
  1s Max TPS
14 Day
 28 day
 45 Day
2011
 0.791
 0.791
 1.007
2012
 1.062
 0.858
 0.580
@Twitter | QCon NY 2013 28
μ
Superbowl 2013: Capacity Planning (cont’d)
•  Distribution Analysis (αn)
  AVG (μ), STDEV(σ) 
  μ increased YoY (expected)
  σ also increased YoY
  1s Max TPS
Tn /μ
 (Tn – μ)/σ
2011
 1.448
 1.746
2012
 1.517
 2.756
TPS during Superbowl has been
moving right YoY
2011
 2012
@Twitter | QCon NY 2013 29
Superbowl 2013: Capacity Planning (cont’d)
•  Distribution Analysis
  YoY movement of TPSSuperbowl further into the right tail
  Expectation: Progressive moves would be smaller

  Overestimate α
  Handle unplanned events
  Business decision
@Twitter | QCon NY 2013 30
Superbowl 2013: Capacity Planning (cont’d)
•  Historical component
  Determine extent of movement (αexpected) of TPSSuperbowl into right tail

•  Temporal component
  Current μc 
  Current σc

•  Capacity planning
  Plan capacity corresponding to μc + αexpected * σc
  Scenario Analysis (ala Global Macro Hedge Funds)
  αexpected 
o  αn-1 (same as last year)
o  αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)
@Twitter | QCon NY 2013 31
Superbowl 2013: Capacity Planning (cont’d)
•  Capacity planning
  1s Max TPS
  αn-1  20K+
  αn-1 + (αn-1 + αn-2)/2  22K+
@Twitter | QCon NY 2013 32
Superbowl 2013: Capacity Planning (cont’d)
•  Validation
  1s Max TPS
  αobserved < αexpected


  Twitter was highly available during Superbowl 2013
  Over-allocation concerns?
  Minimal 
  Limited to few services
  Seamlessly handled traffic spike due to the Superbowl 2013 Blackout
@Twitter | QCon NY 2013 33
Join the Flock
•  We are hiring!
  https://twitter.com/JoinTheFlock
  https://twitter.com/jobs

More Related Content

Similar to Isolating Events from the Fail Whale

Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
amiyadash
 
Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan insight student conference v2
Gunjan insight student conference v2
Gunjan Kumar
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Symeon Papadopoulos
 
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
ferda ofli
 
The STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design FeaturesThe STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design Features
GLTN_STDM
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
Dr. Shikha Mehta
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
Arun Kejariwal
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Data Con LA
 
Analysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane SandyAnalysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane Sandy
Catherine Graham
 
22 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-1722 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-17
indiawrm
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
Biswajit Das
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
Innovation Enterprise
 
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Piyush Kumar
 
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
C4Media
 
Task Time Series CoronaWhy De
Task Time Series CoronaWhy DeTask Time Series CoronaWhy De
Task Time Series CoronaWhy De
Isaac Godfried
 
Backups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for NonprofitsBackups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for Nonprofits
Community IT Innovators
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA
 
Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017
Gunjan Kumar
 

Similar to Isolating Events from the Fail Whale (20)

Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 
Gunjan insight student conference v2
Gunjan insight student conference v2Gunjan insight student conference v2
Gunjan insight student conference v2
 
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...Community Structure, Interaction and Evolution Analysis of Online Social Netw...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
 
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...A Real-time System for Detecting Landslide Reports on Social Media using Arti...
A Real-time System for Detecting Landslide Reports on Social Media using Arti...
 
The STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design FeaturesThe STDM Development: Strategic Choices and Design Features
The STDM Development: Strategic Choices and Design Features
 
Spark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike FreedmanSpark Streaming and IoT by Mike Freedman
Spark Streaming and IoT by Mike Freedman
 
Shikha fdp 62_14july2017
Shikha fdp 62_14july2017Shikha fdp 62_14july2017
Shikha fdp 62_14july2017
 
Real Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and SystemsReal Time Analytics: Algorithms and Systems
Real Time Analytics: Algorithms and Systems
 
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...
 
Analysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane SandyAnalysis of Twitter Data During Hurricane Sandy
Analysis of Twitter Data During Hurricane Sandy
 
22 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-1722 - CSIRO - Water Data Management-Sep-17
22 - CSIRO - Water Data Management-Sep-17
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
Advanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITIAdvanced Analytics in Banking, CITI
Advanced Analytics in Banking, CITI
 
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
Mitigating User Experience from 'Breaking Bad': The Twitter Approach [Velocit...
 
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
Have Your Cake and Eat It Too -- Further Dispelling the Myths of the Lambda A...
 
Task Time Series CoronaWhy De
Task Time Series CoronaWhy DeTask Time Series CoronaWhy De
Task Time Series CoronaWhy De
 
Backups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for NonprofitsBackups and Disaster Recovery for Nonprofits
Backups and Disaster Recovery for Nonprofits
 
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
Data Con LA 2018 - Enabling real-time exploration and analytics at scale at H...
 
Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017Recommending Sequences RecTour 2017
Recommending Sequences RecTour 2017
 

More from Arun Kejariwal

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
Arun Kejariwal
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
Arun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
Arun Kejariwal
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
Arun Kejariwal
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
Arun Kejariwal
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
Arun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
Arun Kejariwal
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
Arun Kejariwal
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
Arun Kejariwal
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly Detection
Arun Kejariwal
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
Arun Kejariwal
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
Arun Kejariwal
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ Twitter
Arun Kejariwal
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
Arun Kejariwal
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
Arun Kejariwal
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
Arun Kejariwal
 

More from Arun Kejariwal (16)

Anomaly Detection At The Edge
Anomaly Detection At The EdgeAnomaly Detection At The Edge
Anomaly Detection At The Edge
 
Serverless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the EnterpriseServerless Streaming Architectures and Algorithms for the Enterprise
Serverless Streaming Architectures and Algorithms for the Enterprise
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Sequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time SeriesSequence-to-Sequence Modeling for Time Series
Sequence-to-Sequence Modeling for Time Series
 
Model Serving via Pulsar Functions
Model Serving via Pulsar FunctionsModel Serving via Pulsar Functions
Model Serving via Pulsar Functions
 
Designing Modern Streaming Data Applications
Designing Modern Streaming Data ApplicationsDesigning Modern Streaming Data Applications
Designing Modern Streaming Data Applications
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Deep Learning for Time Series Data
Deep Learning for Time Series DataDeep Learning for Time Series Data
Deep Learning for Time Series Data
 
Correlation Analysis on Live Data Streams
Correlation Analysis on Live Data StreamsCorrelation Analysis on Live Data Streams
Correlation Analysis on Live Data Streams
 
Live Anomaly Detection
Live Anomaly DetectionLive Anomaly Detection
Live Anomaly Detection
 
Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
Velocity 2015-final
Velocity 2015-finalVelocity 2015-final
Velocity 2015-final
 
Statistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ TwitterStatistical Learning Based Anomaly Detection @ Twitter
Statistical Learning Based Anomaly Detection @ Twitter
 
Days In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy serviceDays In Green (DIG): Forecasting the life of a healthy service
Days In Green (DIG): Forecasting the life of a healthy service
 
Techniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud FootprintTechniques for Minimizing Cloud Footprint
Techniques for Minimizing Cloud Footprint
 
A Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the CloudA Tool for Practical Garbage Collection Analysis In the Cloud
A Tool for Practical Garbage Collection Analysis In the Cloud
 

Recently uploaded

Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
KIRAN KV
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Jimmy Lai
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
UX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business GoalsUX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business Goals
FIDO Alliance
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
FIDO Alliance
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
SAI KAILASH R
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
shyamraj55
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
David Wilson
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
alexjohnson7307
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
Matthias Neugebauer
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
Steven Carlson
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
SubhamMandal40
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
Enterprise Knowledge
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
ankush9927
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
Brian Pichman
 

Recently uploaded (20)

Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
kk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdfkk vathada _digital transformation frameworks_2024.pdf
kk vathada _digital transformation frameworks_2024.pdf
 
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python CodebaseEuroPython 2024 - Streamlining Testing in a Large Python Codebase
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
UX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business GoalsUX Webinar Series: Aligning Authentication Experiences with Business Goals
UX Webinar Series: Aligning Authentication Experiences with Business Goals
 
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
UX Webinar Series: Drive Revenue and Decrease Costs with Passkeys for Consume...
 
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and DisadvantagesBLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
BLOCKCHAIN TECHNOLOGY - Advantages and Disadvantages
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
Integrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecaseIntegrating Kafka with MuleSoft 4 and usecase
Integrating Kafka with MuleSoft 4 and usecase
 
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
Mastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for SuccessMastering OnlyFans Clone App Development: Key Strategies for Success
Mastering OnlyFans Clone App Development: Key Strategies for Success
 
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
leewayhertz.com-Generative AI tech stack Frameworks infrastructure models and...
 
Opencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of MünsterOpencast Summit 2024 — Opencast @ University of Münster
Opencast Summit 2024 — Opencast @ University of Münster
 
Vulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive OverviewVulnerability Management: A Comprehensive Overview
Vulnerability Management: A Comprehensive Overview
 
Sonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdfSonkoloniya documentation - ONEprojukti.pdf
Sonkoloniya documentation - ONEprojukti.pdf
 
Improving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning ContentImproving Learning Content Efficiency with Reusable Learning Content
Improving Learning Content Efficiency with Reusable Learning Content
 
Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10Computer HARDWARE presenattion by CWD students class 10
Computer HARDWARE presenattion by CWD students class 10
 
Uncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in LibrariesUncharted Together- Navigating AI's New Frontiers in Libraries
Uncharted Together- Navigating AI's New Frontiers in Libraries
 

Isolating Events from the Fail Whale

  • 1. @Twitter | QCon NY 2013 1 Isolating Events from the Fail Whale Arun Kejariwal, Bryce Yan (@arun_kejariwal, @bryce_yan) Capacity Engineering @ Twitter June 2013
  • 2. @Twitter | QCon NY 2013 2 Delivering Best User Experience •  Performance   Real time!   Latency tolerance of end-users has nose dived   Average, p99, p999   Variability on large clusters   Tolerate faults when using commodity hardware •  Availability   Anytime, Anywhere, Any Device •  Organic Growth   Over 200M monthly active users •  Events   Planned, Unplanned [3] https://twitter.com/twitter/status/281051652235087872 [2] http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/people/jeff/Berkeley-Latency-Mar2012.pdf [1] Xu et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final77.pdf [2] [3] [1]
  • 3. @Twitter | QCon NY 2013 3 High Performance, Availability •  Capacity Planning   Throw hardware at the problem   Operationally inefficient   Even otherwise o  How much? o  What kind? (Inventory management etc.)   Reactive approach   Degraded user experience o  Impact bottomline   Overall goal   Deliver best user experience   Minimal operational footprint o  Factor in organic growth and lead times for provisioning additional capacity
  • 4. @Twitter | QCon NY 2013 4 Capacity Planning is Non-trivial •  Behavioral response is unpredictable •  Multiplier Effect   # Retweets x Followers of each retweeter Large fan-out
  • 5. @Twitter | QCon NY 2013 5 Capacity Planning is Non-trivial (cont’d) •  Unforeseen events   Power failure   “Hurricane Sandy takes data centers offline with flooding, power outages”   Network issues   “Amazon's compute cloud has a networking hiccup” •  Evolving product development landscape   New features   New products   New partners   “Twitter Arrives on Wall Street, Via Bloomberg” [1] http://arstechnica.com/information-technology/2012/10/hurricane-sandy-takes-data-centers-offline-with-flooding-power-outages/ [2] http://www.zdnet.com/amazons-compute-cloud-has-a-networking-hiccup-7000005776/ [4] http://dealbook.nytimes.com/2013/04/04/twitter-arrives-on-wall-street-via-bloomberg/ [3] Ballani et al. NSDI 2013 - https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final186.pdf. [1] [2] [3] [4] 14 June 2013
  • 6. @Twitter | QCon NY 2013 6 Capacity Planning is Non-trivial (cont’d) •  New hardware platforms   Purchase pipeline   How much and when to buy – Cost performance trade-off
  • 7. @Twitter | QCon NY 2013 7 Events •  Planned   Still, traffic pattern subject to, say,   Nature of the event   Behavioral response   Community effect   Demographics
  • 8. @Twitter | QCon NY 2013 8 Events (cont’d) •  Unplanned   Intensity of the event   Population density Japan Tsunami New Zealand Earthquake Hurricane Sandy Flash Crash Egyptian Revolution Iran’s Disputed Election Boston Explosion Remembering Steve Jobs
  • 9. @Twitter | QCon NY 2013 9 Events (cont’d) •  Unplanned (transient)   Duration   Type of the transient event White House Rumor: AP account being hacked [1] [1] http://finance.yahoo.com/news/stocks-briefly-drop-recover-fake-172814328.html
  • 10. @Twitter | QCon NY 2013 10 Events (cont’d) •  Black Swans (ala Nassim Taleb)   Planned events, but… Superbowl’13 Blackout Zidane in “Action” “Hand of God” Usain Bolt’s 100m World Record
  • 11. @Twitter | QCon NY 2013 11 Events (cont’d) •  Events timeline Time
  • 12. @Twitter | QCon NY 2013 12 Events’ Impact •  Differ in characteristics   Tweets   Photos   Vines   Now, Music •  Consequently, tax different services   Different capacity requests
  • 13. @Twitter | QCon NY 2013 13 Capacity Modeling Overview
  • 14. @Twitter | QCon NY 2013 14 Capacity Modeling •  Takes core drivers as inputs to generate usage demand   Forecasts the amount of work based on core driver projections •  Relates the work metric to a primary resource to identify the capacity threshold   Primary resources   Computing power (CPU, RAM)   Storage (disk I/O, disk space)   Network (network bandwidth) •  Generate hardware demand based on the limiting primary resource
  • 15. @Twitter | QCon NY 2013 15 Core Drivers •  Underlying business metrics that drive demand for more capacity   Active Users   Tweets per second (TPS)   Favorites per second (FPS)   Requests per second (RPS) •  Normalized by Active Users to isolate user engagement •  Project user engagement and Active Users independently
  • 16. @Twitter | QCon NY 2013 16 Active Users aka User Growth Normalized Core Drivers for Engagement Core Drivers (cont’d) PerActiveUserValues Time Favorites Retweets Poly. (Favorites) Linear (Retweets) ActiveUserCount Time Active Users Linear (Active Users)
  • 17. @Twitter | QCon NY 2013 17 Core Drivers (cont’d) Time User Growth: Active Users Active Users Linear (Active Users) Time Engagement: Photos/Active User Photos Linear (Photos) Time Core Driver: Photos per Day Photos Photos Forecast
  • 18. @Twitter | QCon NY 2013 18 Capacity Threshold •  Primary resource scalability threshold   Determined by load testing   Synthetic load   Replaying production traffic   Real-time production traffic   Test systems may be   Isolated replicas of production   Staging systems in production   Production systems 0.00 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 100.00 ServiceResponseTime CPU Average Response Times vs CPU X
  • 19. @Twitter | QCon NY 2013 19 Hardware Demand •  Core driver  capacity threshold  scaling formula  server count •  Example   Core driver: Requests per Second   Per server request throughput determined by capacity threshold   Scaling formula for Sizing   Number of Servers = (RPS) / Per Server Threshold CoreDriver(RPS)/ServerCount Time RPS (Actuals) RPS (Forecast) # Servers (Actuals) # Servers (Forecast)
  • 20. @Twitter | QCon NY 2013 20 Deep Dive and Superbowl 2013
  • 21. @Twitter | QCon NY 2013 21 Events: High Level Methodology •  Goal   Handle traffic “spike” •  Predict expected traffic based on historical and temporal statistical analysis   Statistical Metrics   Average   Standard deviation   Max •  Limitations   Changing usage patterns   Organic growth, behavioral, cultural   Event driven   How a game would turn out?
  • 22. @Twitter | QCon NY 2013 22 Statistical Time Series Analysis •  Time window   Week over Week (WoW)   Month over Month (MoM)   Year over Year (YoY) •  Data Distribution   Normal, Log Normal, Multi-modal   Has implications on model selection •  Forecasting   Regression model   Linear, Spline   ARIMA   Trending, Seasonal, Residuals
  • 23. @Twitter | QCon NY 2013 23 Superbowl 2013: Capacity Planning •  Assess capacity requirement based 2011, 2012 Superbowl traffic patterns •  Core driver selection   RPS (Reads)   TPS (Writes) •  What time granularity to use?   Avg TPS (Tweets per sec)   1s/10s/15s/30s Max TPS   1 min/5 min/10 min Max TPS   1 hr Max TPS
  • 24. @Twitter | QCon NY 2013 24 Superbowl 2013: Capacity Planning (cont’d) •  Which metric to use? Time Highly correlated
  • 25. @Twitter | QCon NY 2013 25 Superbowl 2013: Capacity Planning (cont’d) •  Which metric to use?   Time sensitive – correlation may change YoY Time Highly correlated
  • 26. @Twitter | QCon NY 2013 26 Superbowl 2013: Capacity Planning (cont’d) •  Approaches   TPSSuperbowl (denote by Tn)   d-Day historical window   TPSn-1, TPSn-2, …, TPSn-d   Ratio Analysis   Rn = Tn/Max(TPSn-1, TPSn-2, …, TPSn-d)   Distribution Analysis   αn = (Tn - AVG(TPSn-1, TPSn-2, …, TPSn-d))/STDEV(TPSn-1, TPSn-2, …, TPSn-d)
  • 27. @Twitter | QCon NY 2013 27 Superbowl 2013: Capacity Planning (cont’d) •  Ratio Analysis (Rn)   1s Max TPS 14 Day 28 day 45 Day 2011 0.791 0.791 1.007 2012 1.062 0.858 0.580
  • 28. @Twitter | QCon NY 2013 28 μ Superbowl 2013: Capacity Planning (cont’d) •  Distribution Analysis (αn)   AVG (μ), STDEV(σ)   μ increased YoY (expected)   σ also increased YoY   1s Max TPS Tn /μ (Tn – μ)/σ 2011 1.448 1.746 2012 1.517 2.756 TPS during Superbowl has been moving right YoY 2011 2012
  • 29. @Twitter | QCon NY 2013 29 Superbowl 2013: Capacity Planning (cont’d) •  Distribution Analysis   YoY movement of TPSSuperbowl further into the right tail   Expectation: Progressive moves would be smaller   Overestimate α   Handle unplanned events   Business decision
  • 30. @Twitter | QCon NY 2013 30 Superbowl 2013: Capacity Planning (cont’d) •  Historical component   Determine extent of movement (αexpected) of TPSSuperbowl into right tail •  Temporal component   Current μc   Current σc •  Capacity planning   Plan capacity corresponding to μc + αexpected * σc   Scenario Analysis (ala Global Macro Hedge Funds)   αexpected o  αn-1 (same as last year) o  αn-1 + (αn-1 + αn-2)/2 (extrapolate from last two years)
  • 31. @Twitter | QCon NY 2013 31 Superbowl 2013: Capacity Planning (cont’d) •  Capacity planning   1s Max TPS   αn-1  20K+   αn-1 + (αn-1 + αn-2)/2  22K+
  • 32. @Twitter | QCon NY 2013 32 Superbowl 2013: Capacity Planning (cont’d) •  Validation   1s Max TPS   αobserved < αexpected   Twitter was highly available during Superbowl 2013   Over-allocation concerns?   Minimal   Limited to few services   Seamlessly handled traffic spike due to the Superbowl 2013 Blackout
  • 33. @Twitter | QCon NY 2013 33 Join the Flock •  We are hiring!   https://twitter.com/JoinTheFlock   https://twitter.com/jobs