• Save
Lessons Learned Monitoring Production
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Lessons Learned Monitoring Production

  • 981 views
Uploaded on

As a growing company Wix has tried many monitoring solutions some worked better than others. In this talk we will go over the lessons we learned at Wix about what to monitor and how to monitor......

As a growing company Wix has tried many monitoring solutions some worked better than others. In this talk we will go over the lessons we learned at Wix about what to monitor and how to monitor production systems; when to trigger alerts and also when not to trigger alerts.
We will go over some of the tools we use and also some of the tools we built to help us sleep better at night while doing 400 deployments to production every month.

http://www.youtube.com/watch?v=OLPA2KOWJ8I

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
981
On Slideshare
940
From Embeds
41
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 41

http://lanyrd.com 22
https://twitter.com 17
http://www.feedspot.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Today I’m going to tell you how we grew our monitoring operations with the growth of the company
  • *Monitor applications via log parsing* Removing server, changing topology of serves
  • Logs usually don’t work

Transcript

  • 1. Red Alert Or False Alarm Monitoring Production Systems Aviran Mordo Head Of Back-End Engineering @ Wix @aviranm http://www.linkedin.com/in/aviran http://www.aviransplace.com 01:21
  • 2. About Wix 01:21
  • 3. Wix in Numbers • 40,000,000 users – Adding over 1,000,000 new users each month • Static storage is over 200TB of data – Adding over 1TB of files every day • 3 Data centers + 2 Clouds (Google AE, Amazon) – Around 300 servers • 400 Deployments a month (Continuous Delivery) • Over 100,000,000 Server API calls per day • Over 450 people work at Wix – ~ 150 people in R&D 01:21
  • 4. 01:21
  • 5. 01:21
  • 6. 01:21
  • 7. 01:21 End user monitoring
  • 8. 01:21 Cons • No early warning – Only when site is down • Don’t know what is the problem • Does not monitor API Pros • 24 / 7 Uptime monitoring • Different Geo locations Pingdom
  • 9. 01:21 Cons • Manually record flows • Does not monitor internal servers Pros • Transaction monitoring from real user perspective • Support Flash • Different geo locations Keynote
  • 10. Monitor Hardware and OS 01:21 Cons • Monitor at the OS level, not application level* • Does not know when there is a problem with the application (the Pros • Monitor machine health • Built-in integration with Graphite • Custom checks
  • 11. 01:21 Look inside the application
  • 12. Server Logs 01:21 Cons • Too much information • Hard to read, Not friendly to developers • Pinpointing the problem takes long time • Server cluster need log Pros • Verbose and flexible
  • 13. Log collections 01:21 • Client & Server logs are collected with Flume and Syslog-ng • Storm + Esper analyzes log events and feeds Graphite • Store in Hadoop+HBase for in-depth analysis
  • 14. Self Reporting Framework 01:21 • Automatic method level performance reporting • Custom metering • Exception classifications • 4 severity levels (Recoverable, Warning, Error, Fatal) • Business Exceptions • System Exceptions
  • 15. App-Info 01:21
  • 16. App-Info Monitoring 01:21 • Expose via API as JSON • Collect Metrics via Nagios / Graphite • Nagios alerts based on app-info metrics
  • 17. App-Info Monitoring 01:21 Cons • Cores grained information for an overview • Too much information Pros • Detailed and easy view of a server • Almost no need to look at logs
  • 18. Graphite 01:21 • All systems feed Graphite with metrics (Nagios, App-info, Storm) • Nagios query Graphite and triggers alerts
  • 19. Graphite 01:21 Cons • Not a dashboard (you can build dashboard on top of it) • Design data schema (hierarchy) in Pros • Numerous formulas available • Share graphs • Easy to create new graphs
  • 20. 01:21
  • 21. New Relic 01:21 Pros • Easy to use – developer friendly • Service level overview (both cluster and single server) • Customizable dashboards • JVM profiler on production • Code instrumentation • Real User Monitoring
  • 22. New Relic 01:21 Cons • No distributed transaction trace for specific server • No exception classification • A lot of false alarms due to misbehaving bots • False alarms for low throughput services
  • 23. 01:21 What’s Next
  • 24. 01:21
  • 25. We Are Hiring  01:21
  • 26. 01:21 Aviran Mordo @aviranm http://www.linkedin.com/in/aviran http://www.aviransplace.com http://www.slideshare.net/aviranwix/monitoring-production