Red Alert Or False Alarm
Monitoring Production Systems
Aviran Mordo
Head Of Back-End Engineering @ Wix
@aviranm
http://www...
About Wix
01:21
Wix in Numbers
• 40,000,000 users
– Adding over 1,000,000 new users each month
• Static storage is over 200TB of data
– Ad...
01:21
01:21
01:21
01:21
End user monitoring
01:21
Cons
• No early warning – Only when site
is down
• Don’t know what is the problem
• Does not monitor API
Pros
• 24 /...
01:21
Cons
• Manually record flows
• Does not monitor internal servers
Pros
• Transaction monitoring from real
user perspe...
Monitor Hardware and OS
01:21
Cons
• Monitor at the OS level, not
application level*
• Does not know when there is a
probl...
01:21
Look inside the application
Server Logs
01:21
Cons
• Too much information
• Hard to read, Not friendly to
developers
• Pinpointing the problem takes l...
Log collections
01:21
• Client & Server logs are collected
with Flume and Syslog-ng
• Storm + Esper analyzes log events
an...
Self Reporting Framework
01:21
• Automatic method level
performance reporting
• Custom metering
• Exception classification...
App-Info
01:21
App-Info Monitoring
01:21
• Expose via API as JSON
• Collect Metrics via Nagios /
Graphite
• Nagios alerts based on app-in...
App-Info Monitoring
01:21
Cons
• Cores grained information for an
overview
• Too much information
Pros
• Detailed and easy...
Graphite
01:21
• All systems feed Graphite with
metrics (Nagios, App-info, Storm)
• Nagios query Graphite and triggers
ale...
Graphite
01:21
Cons
• Not a dashboard (you can build
dashboard on top of it)
• Design data schema (hierarchy) in
Pros
• Nu...
01:21
New Relic
01:21
Pros
• Easy to use – developer friendly
• Service level overview (both
cluster and single server)
• Custom...
New Relic
01:21
Cons
• No distributed transaction trace
for specific server
• No exception classification
• A lot of false...
01:21
What’s Next
01:21
We Are Hiring 
01:21
01:21
Aviran Mordo
@aviranm
http://www.linkedin.com/in/aviran
http://www.aviransplace.com
http://www.slideshare.net/aviran...
Upcoming SlideShare
Loading in...5
×

Lessons Learned Monitoring Production

903

Published on

As a growing company Wix has tried many monitoring solutions some worked better than others. In this talk we will go over the lessons we learned at Wix about what to monitor and how to monitor production systems; when to trigger alerts and also when not to trigger alerts.
We will go over some of the tools we use and also some of the tools we built to help us sleep better at night while doing 400 deployments to production every month.

http://www.youtube.com/watch?v=OLPA2KOWJ8I

Published in: Technology, Business
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
903
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Today I’m going to tell you how we grew our monitoring operations with the growth of the company
  • *Monitor applications via log parsing* Removing server, changing topology of serves
  • Logs usually don’t work
  • Transcript of "Lessons Learned Monitoring Production"

    1. 1. Red Alert Or False Alarm Monitoring Production Systems Aviran Mordo Head Of Back-End Engineering @ Wix @aviranm http://www.linkedin.com/in/aviran http://www.aviransplace.com 01:21
    2. 2. About Wix 01:21
    3. 3. Wix in Numbers • 40,000,000 users – Adding over 1,000,000 new users each month • Static storage is over 200TB of data – Adding over 1TB of files every day • 3 Data centers + 2 Clouds (Google AE, Amazon) – Around 300 servers • 400 Deployments a month (Continuous Delivery) • Over 100,000,000 Server API calls per day • Over 450 people work at Wix – ~ 150 people in R&D 01:21
    4. 4. 01:21
    5. 5. 01:21
    6. 6. 01:21
    7. 7. 01:21 End user monitoring
    8. 8. 01:21 Cons • No early warning – Only when site is down • Don’t know what is the problem • Does not monitor API Pros • 24 / 7 Uptime monitoring • Different Geo locations Pingdom
    9. 9. 01:21 Cons • Manually record flows • Does not monitor internal servers Pros • Transaction monitoring from real user perspective • Support Flash • Different geo locations Keynote
    10. 10. Monitor Hardware and OS 01:21 Cons • Monitor at the OS level, not application level* • Does not know when there is a problem with the application (the Pros • Monitor machine health • Built-in integration with Graphite • Custom checks
    11. 11. 01:21 Look inside the application
    12. 12. Server Logs 01:21 Cons • Too much information • Hard to read, Not friendly to developers • Pinpointing the problem takes long time • Server cluster need log Pros • Verbose and flexible
    13. 13. Log collections 01:21 • Client & Server logs are collected with Flume and Syslog-ng • Storm + Esper analyzes log events and feeds Graphite • Store in Hadoop+HBase for in-depth analysis
    14. 14. Self Reporting Framework 01:21 • Automatic method level performance reporting • Custom metering • Exception classifications • 4 severity levels (Recoverable, Warning, Error, Fatal) • Business Exceptions • System Exceptions
    15. 15. App-Info 01:21
    16. 16. App-Info Monitoring 01:21 • Expose via API as JSON • Collect Metrics via Nagios / Graphite • Nagios alerts based on app-info metrics
    17. 17. App-Info Monitoring 01:21 Cons • Cores grained information for an overview • Too much information Pros • Detailed and easy view of a server • Almost no need to look at logs
    18. 18. Graphite 01:21 • All systems feed Graphite with metrics (Nagios, App-info, Storm) • Nagios query Graphite and triggers alerts
    19. 19. Graphite 01:21 Cons • Not a dashboard (you can build dashboard on top of it) • Design data schema (hierarchy) in Pros • Numerous formulas available • Share graphs • Easy to create new graphs
    20. 20. 01:21
    21. 21. New Relic 01:21 Pros • Easy to use – developer friendly • Service level overview (both cluster and single server) • Customizable dashboards • JVM profiler on production • Code instrumentation • Real User Monitoring
    22. 22. New Relic 01:21 Cons • No distributed transaction trace for specific server • No exception classification • A lot of false alarms due to misbehaving bots • False alarms for low throughput services
    23. 23. 01:21 What’s Next
    24. 24. 01:21
    25. 25. We Are Hiring  01:21
    26. 26. 01:21 Aviran Mordo @aviranm http://www.linkedin.com/in/aviran http://www.aviransplace.com http://www.slideshare.net/aviranwix/monitoring-production

    ×