SlideShare a Scribd company logo
Benefits of Big Data: Handling Operations at Scale
Where it all began in 2000
More than 190 million reviews and opinions
from travellers around the world
250,000+
attractions
680,000+
hotels
1,000,000+
restaurants
Operates in 45
countries
315 million
unique visitors
per month
Where we are now: World’s largest travel site
NORTH
AMERICA
22%
EUROPE
44%
MIDDLE
EAST &
AFRICA
4%
LATAM
8%
ASIA PACIFIC
22%
Source: comScore Media Metrix for TripAdvisor Sites, worldwide, August 2014
With nearly 280 million unique monthly visitors
Traffic and Infrastructure
o From 500K to 1.5 million hits per
minute
o > 1000 Production Servers (Real,
not Virtual)
o Split across multiple data centres in
the US
o >2 TB of Compressed Log data per
day
Managed by a
team of just
12
engineers
So where does Big Data fit In?
Big Data @ TripAdvisor
o 10 TB of Postgres Site Data
o 2.5 PB of Data in Hadoop
o ~160 Large Hadoop Nodes
o ~280 TB of Logs (last 7 months) on
site
o Redshift and Tableau for Ad hoc
exploration of data
o SSAS Data Cubes for static models
So what about Operations?
Challenges in Operations
o Traditional Ops tools don’t scale well
o A human can’t review 30K Graphs
(30 metrics x 1000 Servers)
o Aggregate Chart data is a hack and
not very helpful
o Tools record and present the data
only, a human has to interpret it
Pray
Example: Release Day
Monitor
30K
Cacti
Charts
Start
Release
Imagine 9+ releases a week with Cacti and
Nagios…
The tools are not designed for this!
So what’s the Solution?
Solution: Better Analytics
o Change to more flexible technology
o Measure everything! (~700K per
second)
o Interpret the data within 10 to 15
minutes
o Tune! Remove as many false positives
as possible
o Alert – Page someone on unexpected
changes
If its
worth
doing, its
worth
measuring
What we built
Web
Servers
Log Central
Servers
File Servers
Why Postgres?
Analysis
2–3 TB of
metric’s
data
Anomaly
Detection
90+ days
aggregate
data
Holding
~2TB of
data
Anomaly Detection
Capture what is
happening now
Compare it against
data 1 day ago
Compare it against what
other Pools are doing
Look at historical variance
over 9 days
Track statistically
significant changes
Rollout
Test for
anomaly
Release Day v2
Rollback
Monitor
key
metrics
Deploy
new code
to two
pools
No praying needed!
Results
Roll back 1 release per week
15 minutes vs 3 hours
Dramatic reduction in user facing issues
Increasing number of releases
Strict process on rolling back
Why no Hadoop?
VS
What’s Next?
Reduce
False negatives
Alert
On performance
or behaviour
changes
Visualise
Data through
dashboards
Expand
Set of metrics
we’re alerting on
Correlate
system and application
metrics
BigDataInOperationsV8

More Related Content

What's hot

Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
confluent
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
Nascenia IT
 
Dealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakeDealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data Lake
Pat Patterson
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
DataWorks Summit/Hadoop Summit
 
Spark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony Baer
Spark Summit
 
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
Spark Summit
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
✔ Eric David Benari, PMP
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Databricks
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Spark Summit
 
Winning the On-Demand Economy with Spark and Predictive Analytics
Winning the On-Demand Economy with Spark and Predictive AnalyticsWinning the On-Demand Economy with Spark and Predictive Analytics
Winning the On-Demand Economy with Spark and Predictive Analytics
SingleStore
 
Spark Usage in Enterprise Business Operations
Spark Usage in Enterprise Business OperationsSpark Usage in Enterprise Business Operations
Spark Usage in Enterprise Business Operations
SAP Technology
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
Revolution Analytics
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
Blake Irvine
 
Presto summit israel 2019-04
Presto summit   israel 2019-04Presto summit   israel 2019-04
Presto summit israel 2019-04
Ori Reshef
 
Data analytics at a petabyte scale final
Data analytics at a petabyte scale   finalData analytics at a petabyte scale   final
Data analytics at a petabyte scale final
Ori Reshef
 
Zillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning toolsZillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning tools
njstevens
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
dfilppi
 
How the growth of R helps data-driven organizations succeed
How the growth of R helps data-driven organizations succeedHow the growth of R helps data-driven organizations succeed
How the growth of R helps data-driven organizations succeed
Revolution Analytics
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
Snowplow Analytics
 
What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?
FlyData Inc.
 

What's hot (20)

Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
Simplifying Event Streaming: Tools for Location Transparency and Data Evoluti...
 
Introduction to basic data analytics tools
Introduction to basic data analytics toolsIntroduction to basic data analytics tools
Introduction to basic data analytics tools
 
Dealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data LakeDealing With Drift - Building an Enterprise Data Lake
Dealing With Drift - Building an Enterprise Data Lake
 
The Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data ArchitectureThe Stream is the Database - Revolutionizing Healthcare Data Architecture
The Stream is the Database - Revolutionizing Healthcare Data Architecture
 
Spark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony BaerSpark and the Enterprise by Tony Baer
Spark and the Enterprise by Tony Baer
 
Spark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu AdunuthulaSpark Summit Keynote by Seshu Adunuthula
Spark Summit Keynote by Seshu Adunuthula
 
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
Database Camp 2016 @ United Nations, NYC - Michael Glukhovsky, Co-Founder, Re...
 
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platfor...
 
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
 
Winning the On-Demand Economy with Spark and Predictive Analytics
Winning the On-Demand Economy with Spark and Predictive AnalyticsWinning the On-Demand Economy with Spark and Predictive Analytics
Winning the On-Demand Economy with Spark and Predictive Analytics
 
Spark Usage in Enterprise Business Operations
Spark Usage in Enterprise Business OperationsSpark Usage in Enterprise Business Operations
Spark Usage in Enterprise Business Operations
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Presto summit israel 2019-04
Presto summit   israel 2019-04Presto summit   israel 2019-04
Presto summit israel 2019-04
 
Data analytics at a petabyte scale final
Data analytics at a petabyte scale   finalData analytics at a petabyte scale   final
Data analytics at a petabyte scale final
 
Zillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning toolsZillow's favorite big data & machine learning tools
Zillow's favorite big data & machine learning tools
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
 
How the growth of R helps data-driven organizations succeed
How the growth of R helps data-driven organizations succeedHow the growth of R helps data-driven organizations succeed
How the growth of R helps data-driven organizations succeed
 
Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3Snowplow presentation for Amsterdam Meetup #3
Snowplow presentation for Amsterdam Meetup #3
 
What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?What's So Unique About a Columnar Database?
What's So Unique About a Columnar Database?
 

Similar to BigDataInOperationsV8

The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
DataWorks Summit/Hadoop Summit
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
Francesco Mucio
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
Rob Winters
 
AWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data AnalyticsAWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data Analytics
Amazon Web Services
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
Prasad Wagle
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
Eva Tse
 
Big Data - What the Heck?
Big Data - What the Heck?Big Data - What the Heck?
Big Data - What the Heck?
Saurage Marketing Research
 
What the Heck is Big Data?
What the Heck is Big Data?What the Heck is Big Data?
What the Heck is Big Data?
Saurage Marketing Research
 
Our big data
Our big dataOur big data
Our big data
uthrarajan
 
AWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data AnalyticsAWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data Analytics
Amazon Web Services
 
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
Rainer Sternfeld
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
Amazon Web Services
 
Systemof insight
Systemof insightSystemof insight
Systemof insight
suresh sood
 
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At IntuitHadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Rekha Joshi
 
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
Kaan Onuk
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
MDS ap
 
Data science a glance
Data science a glanceData science a glance
Data science a glance
Adekunle Babatunde Anthony
 
Scaling Your Data: Data Democratisation and DataOps
Scaling Your Data: Data Democratisation and DataOpsScaling Your Data: Data Democratisation and DataOps
Scaling Your Data: Data Democratisation and DataOps
Juan Sebastián Urrego Escobar
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
IIIT Allahabad
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
Sanoj Kumar
 

Similar to BigDataInOperationsV8 (20)

The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit The Evolution of Big Data Pipelines at Intuit
The Evolution of Big Data Pipelines at Intuit
 
MicroStrategy at Badoo
MicroStrategy at BadooMicroStrategy at Badoo
MicroStrategy at Badoo
 
Billions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right NowBillions of Rows, Millions of Insights, Right Now
Billions of Rows, Millions of Insights, Right Now
 
AWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data AnalyticsAWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data Analytics
 
Extracting Insights from Data at Twitter
Extracting Insights from Data at TwitterExtracting Insights from Data at Twitter
Extracting Insights from Data at Twitter
 
The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)The evolution of the big data platform @ Netflix (OSCON 2015)
The evolution of the big data platform @ Netflix (OSCON 2015)
 
Big Data - What the Heck?
Big Data - What the Heck?Big Data - What the Heck?
Big Data - What the Heck?
 
What the Heck is Big Data?
What the Heck is Big Data?What the Heck is Big Data?
What the Heck is Big Data?
 
Our big data
Our big dataOur big data
Our big data
 
AWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data AnalyticsAWS Enterprise Day | Big Data Analytics
AWS Enterprise Day | Big Data Analytics
 
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
Indexing the Real World Sensor Networks (at RE.WORK Internet of Things Summit...
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
Systemof insight
Systemof insightSystemof insight
Systemof insight
 
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At IntuitHadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
 
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data...
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
Data science a glance
Data science a glanceData science a glance
Data science a glance
 
Scaling Your Data: Data Democratisation and DataOps
Scaling Your Data: Data Democratisation and DataOpsScaling Your Data: Data Democratisation and DataOps
Scaling Your Data: Data Democratisation and DataOps
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 

BigDataInOperationsV8

  • 1. Benefits of Big Data: Handling Operations at Scale
  • 2. Where it all began in 2000
  • 3. More than 190 million reviews and opinions from travellers around the world 250,000+ attractions 680,000+ hotels 1,000,000+ restaurants Operates in 45 countries 315 million unique visitors per month
  • 4. Where we are now: World’s largest travel site NORTH AMERICA 22% EUROPE 44% MIDDLE EAST & AFRICA 4% LATAM 8% ASIA PACIFIC 22% Source: comScore Media Metrix for TripAdvisor Sites, worldwide, August 2014 With nearly 280 million unique monthly visitors
  • 5. Traffic and Infrastructure o From 500K to 1.5 million hits per minute o > 1000 Production Servers (Real, not Virtual) o Split across multiple data centres in the US o >2 TB of Compressed Log data per day Managed by a team of just 12 engineers
  • 6. So where does Big Data fit In?
  • 7. Big Data @ TripAdvisor o 10 TB of Postgres Site Data o 2.5 PB of Data in Hadoop o ~160 Large Hadoop Nodes o ~280 TB of Logs (last 7 months) on site o Redshift and Tableau for Ad hoc exploration of data o SSAS Data Cubes for static models
  • 8. So what about Operations?
  • 9. Challenges in Operations o Traditional Ops tools don’t scale well o A human can’t review 30K Graphs (30 metrics x 1000 Servers) o Aggregate Chart data is a hack and not very helpful o Tools record and present the data only, a human has to interpret it
  • 10. Pray Example: Release Day Monitor 30K Cacti Charts Start Release Imagine 9+ releases a week with Cacti and Nagios… The tools are not designed for this!
  • 11. So what’s the Solution?
  • 12. Solution: Better Analytics o Change to more flexible technology o Measure everything! (~700K per second) o Interpret the data within 10 to 15 minutes o Tune! Remove as many false positives as possible o Alert – Page someone on unexpected changes If its worth doing, its worth measuring
  • 13. What we built Web Servers Log Central Servers File Servers
  • 14. Why Postgres? Analysis 2–3 TB of metric’s data Anomaly Detection 90+ days aggregate data Holding ~2TB of data
  • 15. Anomaly Detection Capture what is happening now Compare it against data 1 day ago Compare it against what other Pools are doing Look at historical variance over 9 days Track statistically significant changes
  • 16.
  • 17. Rollout Test for anomaly Release Day v2 Rollback Monitor key metrics Deploy new code to two pools No praying needed!
  • 18. Results Roll back 1 release per week 15 minutes vs 3 hours Dramatic reduction in user facing issues Increasing number of releases Strict process on rolling back
  • 20. What’s Next? Reduce False negatives Alert On performance or behaviour changes Visualise Data through dashboards Expand Set of metrics we’re alerting on Correlate system and application metrics

Editor's Notes

  1. Believe it or not, this is where it all began for TripAdvisor back in 2000 – in a tiny office over a pizza place in Needham, Massachusetts.
  2. World’s Largest Travel Site ~315 million unique visitors a month > 190 million reviews and opinions > 4.4 million accommodations, restaurants, and attractions Operates in 45 countries including China as DaoDao.com
  3. TripAdvisor is now the world’s largest travel site, with over 315 million unique visitors every month. We operate in 45 countries around the globe, and over 75% of our traffic comes from outside the U.S.
  4. From 500K to 1.5 million hits per minute Not including our CDN Traffic (many Gbps) > 1000 Production Servers (Real, not Virtual) Split across multiple data centers in the US >2 TB of Compressed Log data per day Managed by a team of 10 Software Engineers!
  5. Traditional Ops Tools don’t scale well Data “Trapped” in tools like Cacti A human can’t review 30K Graphs (30 metrics x 1000 Servers) 30 metrics per server is not enough! Automating the provisioning of a new Server with tools like Cacti is challenging (Magic numbers etc.) Aggregate Chart data is a hack and not very helpful The tools record and present the data only, a human has to interpret it
  6. Imagine 9+ releases a week with Cacti and Nagios Start Release Manually Monitor 30K Cacti Charts Pray! Ok the site is stable but is it still making money? High Stress / Time Consuming process Release Engineer is full time monitoring the progress of a release Why? Because our tools are not designed for this!
  7. Change to more flexible technology that can: Scale out, not up Be provisioned using automation (Puppet/Chef) Extensible via plugins. Measure everything! From 30 metrics to 530 per server Adding more all the time. Interpret the Data within 10 to 15 mins Anomaly Detection, whats changed? Tune! Remove as many false positives as possible Alert – Page someone on unexpected changes
  8. Collectd to gather key metrics from our servers Scribe (because it was already there) to stream the logs to a collector Rsync to ship logs to our Boston area Data center ETL the Data into Graphite and Postgres Graphite provides all the graphing/Dashboards Postgres is more flexible and allows us to do our anomaly detection and custom reporting Graphite was faster than Postgres for Drawing Graphs (Postgres was doing full table scans)
  9. Pushed a lot of analysis work to Postgres Way better than a time series database for the anomaly detection With SSD’s our Server can easily handle 2 to 3 TB of Metric’s data Holds a day of raw metrics data (~1TB) Over 90 days of aggregate data at an hourly resolution Currently holding ~2TB of data
  10. Capture what’s happening now Compare it against Data from 1 day ago Compare it against what the other Pools (collections of web servers)are doing now Look at the Historical Variance over the past 9 days If the change is greater by some statistically significant amount (x2 maybe x3 depending on the metric) Alert!
  11. Deploy new code to two small pools first Allow the code to soak for ~30 mins Monitor key metrics such as: Revenue per session Rate of 500’s 50 additional System stats (CPU, Memory, JVM Metrics) Internal Error logging (More than before?) Red/Green Status on Jenkins Green – Complete Rollout Red – Rollback Alert (Page) on any major deviations
  12. Roll back 1 Release per week Typically because of Higher Error Rates or Commerce Impact Typical Releases require 15 mins of effort vs 3 hours Dramatic reduction in user facing issues Increasing the Number of releases so we can push more features out on a daily basis Very strict on Rolling back No more heroic effort to “make it work” Dev will fix it for the next release
  13. Latency! Can’t wait for the Jobs to run to get the results Watching Projects like Apache Phoenix We didn’t need it YET As we grow will have to move to Hadoop or something like it For now Postgres is working well for us Aggregates will be archived in Hadoop As we expand our capacity planning more historical data is critical Did I mention we LOVE Postgres?
  14. Expand the set of metrics we’re alerting on Add Postgres specific metrics Add more Network Equipment Reduce the level of false positives or noise by using more sophisticated algorithms Move to Apache Kafka instead of Scrib Pub/Sub Messaging Framework Low Latency, HA, and Scalable Alert on performance/behavior changes on key site functionality More Dashboards for visualizing the data Correlate system and application metrics E.g. Response Time drops while IOWait increases on system X.