Triage: real-world error logging for web applications

791 views
662 views

Published on

Notes from my talk at PyCon Australia 2012

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
791
On SlideShare
0
From Embeds
0
Number of Embeds
13
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Triage: real-world error logging for web applications

  1. 1. Sunday, 19 August 12
  2. 2. Triage Dealing with errors in production PyCon Australia 2012 Luke Cawood / @lwcd Lars Yencken / @larsyenckenSunday, 19 August 12
  3. 3. 99designsSunday, 19 August 12
  4. 4. Sunday, 19 August 12
  5. 5. Balancer Cache Cache App App App App Memcache DB DB Queue WorkerSunday, 19 August 12
  6. 6. Balancer Cache App App App App Memcache DB Queue WorkerSunday, 19 August 12
  7. 7. ErrorsSunday, 19 August 12
  8. 8. Sunday, 19 August 12
  9. 9. Hmmm....Sunday, 19 August 12
  10. 10. TriageSunday, 19 August 12
  11. 11. Triage • Improve signal to noise ratio by aggregating similar errors • Allow for claiming, resolving and ranking errors in terms of importance • Integration with github, build tools • Play with new tools and technology • Provide open source alternative to commercial products in this spaceSunday, 19 August 12
  12. 12. Round 1(Fight!)Sunday, 19 August 12
  13. 13. Round 1(Fight!) • Errors continue to log directly to mongo • Aggregation via incremental MapReduce • Deliver a prototype in one daySunday, 19 August 12
  14. 14. Sunday, 19 August 12
  15. 15. Scalability Fatality! • Worked fine during development • Production load caused the MapReduce to asplode! • (Not that we have a lot of errors, right?!)Sunday, 19 August 12
  16. 16. Round 2Sunday, 19 August 12
  17. 17. (sub)zeroMQ • Async error API using zeroMQ pub/sub sockets • MessagePack as error format (fast, binary) • Aggregation in pythonSunday, 19 August 12
  18. 18. Aggregation Method • Generate hash in python based on error document • Query mongo for error hash • Create or update error document based on outcome of query, incrementing counters etc where appropriateSunday, 19 August 12
  19. 19. Sunday, 19 August 12
  20. 20. Sunday, 19 August 12
  21. 21. Sunday, 19 August 12
  22. 22. Scalability Fatality 2 • Multithreaded experiments • Mongo optimisations • There is no schema • The cake is a lie • Mongo ‘upsert’ rocks!Sunday, 19 August 12
  23. 23. Updating like a boss collection.update(criteria, document, upsert=False)Sunday, 19 August 12
  24. 24. Updating like a boss collection.update(criteria, document, upsert=False)Sunday, 19 August 12
  25. 25. Updating like a boss collection.update(criteria, document, upsert=False)Sunday, 19 August 12
  26. 26. Updating like a boss collection.update(criteria, document, upsert=False)Sunday, 19 August 12
  27. 27. Updating like a boss collection.update(criteria, document, upsert=False)Sunday, 19 August 12
  28. 28. Sunday, 19 August 12
  29. 29. Outcomes & futureSunday, 19 August 12
  30. 30. Outcomes • Getting the ‘right’ level of grouping hard • What to do with errors that just wont go away? • Error occurrence count - what does this tell us?Sunday, 19 August 12
  31. 31. Future • Easier installation, package in pypi • Better language support (plz halp) • Drop in replacement for airbrake etc • Client side logging (javascript) • Email style filters & actions - ifttt.comSunday, 19 August 12
  32. 32. Thanks • 99designs for research and development time • Contributors: • Luke Cawood - Project lead • Josh Benham - Developer • Jamison Lu - Developer • Additional assistance • Lars Yencken - Operations • 99designs UX teamSunday, 19 August 12
  33. 33. Thanks for listening! https://github.com/lwc/triageSunday, 19 August 12

×