Monitoring @ Facebook
Ran Leibman
Production Engineer
Monitoring Tools, Components, & Mentality at Facebook
Who Am I ?
Agenda
• Problems in today’s monitoring, solutions & approaches
• Facebook Monitoring Architecture
• Dive into each component
• Show Use cases
The Problems
Problems - Nagios Checks
•Good for binary checks
•Monitoring based on a “point in time” when the script executes
•Can’t monitor on a time window (can do clowny checks with a
temp file but that is not so elegant …)
•Perf data
•Not all plugins implement this
•Data gathering is coupled with the way you want to alert
•Hard to aggregate from perf data from multiple checks
Problems - Cron Scripts
•Difficult to put this scripts in “maintenance mode” while deploying
code or acknowledging an issue
•How do you know that your script actually runs?
•Who stopped crond!?
•Error handling?
Problems - Metrics R 2nd Class Citizen
•We are not always treating our metric store like we treat our
application data
•Storing metrics in clowny temp files or some unmaintained mysql
•Metric is DATA like any other and we should treat it as such!
•How & from where we are going to query it?
•What is the best way to store it?
•Retention?
Problems - Ops Ownership
•Usually only the ops teams
owns monitoring
•Even if the developers wants
to add metrics and alerts they
have a steep learning curve in
order to achieve this
Facebook Monitoring Architecture
Operational Data Store - ODS
Operational Data Store - ODS
•“key —> float” type of metrics, associated with an entity
•system.load1, system.cpu-user, system.n_eth-txbyt
•chef.run_sucess
•chef.last_run_time
•myapp.request_num
•myapp.request_median
Entity / Key-Value
Operational Data Store - ODS
•Use Gorilla in-memory TSDB for short term data (24 hours)
•Store permanent data in HBase on top of HDFS
•Aggregation
•by rack / cluster / tier
•by custom tags - app, tier name, etc …
•cross datacenter aggregations
Data Store
Collecting Metrics
Operational Data Store - ODS
• API modules for every imaginable language
• Thrift Endpoint - https://thrift.apache.org
• Implements fb303 counters
• fb303 counters are collected from the service by FBAgent
• FBAgent submit the metrics over to ODS
Retention
Operational Data Store - ODS
•Metrics is A LOT of data!
•All ODS metrics are being rolled up in the same way
•Daily - 2 Weeks (depends on you)
•Weekly - 2 Weeks (1min)
•Monthly - 1 Month (1h)
•Yearly - until the end of time (1h)
•How do I solve the data loss??
Aggregation
Operational Data Store
•Aggregate important metrics save spikes
•p50, p90, p99
•top(N)
•count
•min, max
•Aggregate by cluster | rack | custom
Scuba - Real-Time Deep Dive Log Monitoring
Whats is going on RIGHT NOW with my service?
Scuba - Real-Time Log Monitoring
•Was started as a hackathon ! today we can’t live without it
•Combine application logs from all servers & containers into a
single table
•Data is stored in memory
•Very small lag (<1min)
•Super fast queries (median of ~300ms)
•SQL like query syntax
Logging to scuba
Scuba - Real-Time Log Monitoring
• Libraries for every imaginable language
• PHP, Python, C++, Bash, etc …
• Scuba supports:
• String
• Ints
• Set of String
• Stack of strings (usually for stack traces)
So… What’s the catch ?
Scuba - Real-Time Log Monitoring
• Strict quota policy
• by size
• by time
• Use sampling in order to reduce load
• Not good for pipelining of data
Alarm System
What is an Alert?
Alarm System
•Creating an alert does not mean you’ll be notified!
•Alert is an event stating that something happened
•Can’t ssh to server
•p50 of request time is slower than 100ms 80% of the time in the last 10min
•the application tier in us-east is 50% down
•The alert should contain ALL relevant data about the event
•Alerts can be suppressed in case of maintenance
Alarm System - Alert Structure
FBAR - Facebook Auto-Remediation
Automation Automation Automation !
FBAR
• Most alarms could be auto remediate
without human intervention
• Code it once, never do it again
• Doing the work of 136,000 engineering
hours (29/04/2015)
•136,000 / 8 = 17,000 engineers a day !
FBAR
FBAR
FBAR
Notifications & Subscriptions
What should I be paged about?
Notification & Subscriptions
•Actionable alerts
•Impactful alerts
•Before you subscribe to an alert ask yourself:
•Can I automate this? (FBAR!)
•Is this actionable?
•Should an engineer wake up because of this?
Notification & Subscriptions
Dashboards
What are they good for ?
Dashboards
• Awesome tool for debugging production issues
• Making your case:
• We need to fix this code path
• If we had the BLABLA tool it would reduce this by a factor of X
• Since we deployed the last release engagement dropped by
10% in west Europe
• Dashboards are cool to look at =) it’s the best way to get an
understanding of what is going on with the service
Cubism
based on Cubism.js
? ‫לנו‬ ‫היה‬ ‫מה‬ ‫אז‬
Use data!
Think before you alert
Surfacing problems you never thought
existed
Monitoring is not an “Ops Job”
Questions?
Ran Leibman
Production Engineer
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015

  • 1.
    Monitoring @ Facebook RanLeibman Production Engineer Monitoring Tools, Components, & Mentality at Facebook
  • 2.
  • 3.
    Agenda • Problems intoday’s monitoring, solutions & approaches • Facebook Monitoring Architecture • Dive into each component • Show Use cases
  • 4.
  • 6.
    Problems - NagiosChecks •Good for binary checks •Monitoring based on a “point in time” when the script executes •Can’t monitor on a time window (can do clowny checks with a temp file but that is not so elegant …) •Perf data •Not all plugins implement this •Data gathering is coupled with the way you want to alert •Hard to aggregate from perf data from multiple checks
  • 7.
    Problems - CronScripts •Difficult to put this scripts in “maintenance mode” while deploying code or acknowledging an issue •How do you know that your script actually runs? •Who stopped crond!? •Error handling?
  • 8.
    Problems - MetricsR 2nd Class Citizen •We are not always treating our metric store like we treat our application data •Storing metrics in clowny temp files or some unmaintained mysql •Metric is DATA like any other and we should treat it as such! •How & from where we are going to query it? •What is the best way to store it? •Retention?
  • 9.
    Problems - OpsOwnership •Usually only the ops teams owns monitoring •Even if the developers wants to add metrics and alerts they have a steep learning curve in order to achieve this
  • 10.
  • 11.
  • 12.
    Operational Data Store- ODS •“key —> float” type of metrics, associated with an entity •system.load1, system.cpu-user, system.n_eth-txbyt •chef.run_sucess •chef.last_run_time •myapp.request_num •myapp.request_median Entity / Key-Value
  • 13.
    Operational Data Store- ODS •Use Gorilla in-memory TSDB for short term data (24 hours) •Store permanent data in HBase on top of HDFS •Aggregation •by rack / cluster / tier •by custom tags - app, tier name, etc … •cross datacenter aggregations Data Store
  • 14.
    Collecting Metrics Operational DataStore - ODS • API modules for every imaginable language • Thrift Endpoint - https://thrift.apache.org • Implements fb303 counters • fb303 counters are collected from the service by FBAgent • FBAgent submit the metrics over to ODS
  • 15.
    Retention Operational Data Store- ODS •Metrics is A LOT of data! •All ODS metrics are being rolled up in the same way •Daily - 2 Weeks (depends on you) •Weekly - 2 Weeks (1min) •Monthly - 1 Month (1h) •Yearly - until the end of time (1h) •How do I solve the data loss??
  • 16.
    Aggregation Operational Data Store •Aggregateimportant metrics save spikes •p50, p90, p99 •top(N) •count •min, max •Aggregate by cluster | rack | custom
  • 17.
    Scuba - Real-TimeDeep Dive Log Monitoring
  • 18.
    Whats is goingon RIGHT NOW with my service? Scuba - Real-Time Log Monitoring •Was started as a hackathon ! today we can’t live without it •Combine application logs from all servers & containers into a single table •Data is stored in memory •Very small lag (<1min) •Super fast queries (median of ~300ms) •SQL like query syntax
  • 19.
    Logging to scuba Scuba- Real-Time Log Monitoring • Libraries for every imaginable language • PHP, Python, C++, Bash, etc … • Scuba supports: • String • Ints • Set of String • Stack of strings (usually for stack traces)
  • 20.
    So… What’s thecatch ? Scuba - Real-Time Log Monitoring • Strict quota policy • by size • by time • Use sampling in order to reduce load • Not good for pipelining of data
  • 26.
  • 28.
    What is anAlert? Alarm System •Creating an alert does not mean you’ll be notified! •Alert is an event stating that something happened •Can’t ssh to server •p50 of request time is slower than 100ms 80% of the time in the last 10min •the application tier in us-east is 50% down •The alert should contain ALL relevant data about the event •Alerts can be suppressed in case of maintenance
  • 29.
    Alarm System -Alert Structure
  • 34.
    FBAR - FacebookAuto-Remediation
  • 35.
    Automation Automation Automation! FBAR • Most alarms could be auto remediate without human intervention • Code it once, never do it again • Doing the work of 136,000 engineering hours (29/04/2015) •136,000 / 8 = 17,000 engineers a day !
  • 36.
  • 37.
  • 38.
  • 39.
  • 42.
    What should Ibe paged about? Notification & Subscriptions •Actionable alerts •Impactful alerts •Before you subscribe to an alert ask yourself: •Can I automate this? (FBAR!) •Is this actionable? •Should an engineer wake up because of this?
  • 45.
  • 46.
  • 47.
    What are theygood for ? Dashboards • Awesome tool for debugging production issues • Making your case: • We need to fix this code path • If we had the BLABLA tool it would reduce this by a factor of X • Since we deployed the last release engagement dropped by 10% in west Europe • Dashboards are cool to look at =) it’s the best way to get an understanding of what is going on with the service
  • 48.
  • 52.
    ? ‫לנו‬ ‫היה‬‫מה‬ ‫אז‬
  • 53.
  • 54.
  • 55.
    Surfacing problems younever thought existed
  • 56.
    Monitoring is notan “Ops Job”
  • 57.