Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lessons Learned Monitoring Production

1,562 views

Published on

As a growing company Wix has tried many monitoring solutions some worked better than others. In this talk we will go over the lessons we learned at Wix about what to monitor and how to monitor production systems; when to trigger alerts and also when not to trigger alerts.
We will go over some of the tools we use and also some of the tools we built to help us sleep better at night while doing 400 deployments to production every month.

http://www.youtube.com/watch?v=OLPA2KOWJ8I

Published in: Technology, Business
  • Be the first to comment

Lessons Learned Monitoring Production

  1. 1. Red Alert Or False Alarm Monitoring Production Systems Aviran Mordo Head Of Back-End Engineering @ Wix @aviranm http://www.linkedin.com/in/aviran http://www.aviransplace.com 01:21
  2. 2. About Wix 01:21
  3. 3. Wix in Numbers • 40,000,000 users – Adding over 1,000,000 new users each month • Static storage is over 200TB of data – Adding over 1TB of files every day • 3 Data centers + 2 Clouds (Google AE, Amazon) – Around 300 servers • 400 Deployments a month (Continuous Delivery) • Over 100,000,000 Server API calls per day • Over 450 people work at Wix – ~ 150 people in R&D 01:21
  4. 4. 01:21
  5. 5. 01:21
  6. 6. 01:21
  7. 7. 01:21 End user monitoring
  8. 8. 01:21 Cons • No early warning – Only when site is down • Don’t know what is the problem • Does not monitor API Pros • 24 / 7 Uptime monitoring • Different Geo locations Pingdom
  9. 9. 01:21 Cons • Manually record flows • Does not monitor internal servers Pros • Transaction monitoring from real user perspective • Support Flash • Different geo locations Keynote
  10. 10. Monitor Hardware and OS 01:21 Cons • Monitor at the OS level, not application level* • Does not know when there is a problem with the application (the Pros • Monitor machine health • Built-in integration with Graphite • Custom checks
  11. 11. 01:21 Look inside the application
  12. 12. Server Logs 01:21 Cons • Too much information • Hard to read, Not friendly to developers • Pinpointing the problem takes long time • Server cluster need log Pros • Verbose and flexible
  13. 13. Log collections 01:21 • Client & Server logs are collected with Flume and Syslog-ng • Storm + Esper analyzes log events and feeds Graphite • Store in Hadoop+HBase for in-depth analysis
  14. 14. Self Reporting Framework 01:21 • Automatic method level performance reporting • Custom metering • Exception classifications • 4 severity levels (Recoverable, Warning, Error, Fatal) • Business Exceptions • System Exceptions
  15. 15. App-Info 01:21
  16. 16. App-Info Monitoring 01:21 • Expose via API as JSON • Collect Metrics via Nagios / Graphite • Nagios alerts based on app-info metrics
  17. 17. App-Info Monitoring 01:21 Cons • Cores grained information for an overview • Too much information Pros • Detailed and easy view of a server • Almost no need to look at logs
  18. 18. Graphite 01:21 • All systems feed Graphite with metrics (Nagios, App-info, Storm) • Nagios query Graphite and triggers alerts
  19. 19. Graphite 01:21 Cons • Not a dashboard (you can build dashboard on top of it) • Design data schema (hierarchy) in Pros • Numerous formulas available • Share graphs • Easy to create new graphs
  20. 20. 01:21
  21. 21. New Relic 01:21 Pros • Easy to use – developer friendly • Service level overview (both cluster and single server) • Customizable dashboards • JVM profiler on production • Code instrumentation • Real User Monitoring
  22. 22. New Relic 01:21 Cons • No distributed transaction trace for specific server • No exception classification • A lot of false alarms due to misbehaving bots • False alarms for low throughput services
  23. 23. 01:21 What’s Next
  24. 24. 01:21
  25. 25. We Are Hiring  01:21
  26. 26. 01:21 Aviran Mordo @aviranm http://www.linkedin.com/in/aviran http://www.aviransplace.com http://www.slideshare.net/aviranwix/monitoring-production

×