0
ProcessingFirefox CrashReports withPythonlaura@ mozilla.com@lxt
Overview• The basics• The numbers• Work process and tools
The basics
Socorro          Very Large Array at Socorro, New Mexico, USA. Photo taken by Hajor, 08.Aug.2004. Released under cc.by.sa ...
“Socorro has a lot of moving parts”...“I prefer to think of them as dancing parts”
Basic architecture (simplified)              Collector   Crashmover              cronjobs     HBase        Monitor        ...
Lifetime of a crash • Raw crash submitted by user via POST (metadata json +    minidump) • Collected to disk by collector ...
Processing• Processor spins off minidumpstackwalk (MDSW)• MDSW re-unites raw crash with symbols to generate a stack• Proce...
Back end processing Large number of cron jobs, e.g.:  • Calculate aggregates: Top crashers by signature, URL,     domain  ...
Middleware• Moving all data access to be through REST API (by end of   year)• (Still some queries in webapp)• Enable other...
Webapp• Hard part here: how to visualize some of this data• Example: nightly builds, moving to reporting in build time   r...
Implementation details • Python 2.6 mostly (except PHP for the webapp) • PostgreSQL9.1, some stored procs in pgpl/sql • me...
Scale
A different type of scaling:• Typical webapp: scale to millions of users without   degradation of response time• Socorro: ...
Basic law of scale still applies:The bigger you get, the more spectacularly you fail
Some numbers• At peak we receive 2300 crashes per minute• 2.5 million per day• Median crash size 150k, max size 20MB (reje...
What can we do?• Does betaN have more (null signature) crashes than other   betas?• Analyze differences between Flash vers...
Implementation scale• > 115 physical boxes (not cloud)• Now up to 8 developers + sysadmins + QA + Hadoop ops/  analysts  •...
Managing complexity
Development process• Fork• Hard to install: use a VM (more in a moment)• Pull request with bugfix/feature• Code review• La...
Development process -2• Jenkins polls github master, picks up changes• Jenkins runs tests, builds a package• Package autom...
Note “Manual” deployment =• Run a single script with the build as   parameter• Pushes it out to all machines and restarts ...
ABSOLUTELY CRITICALAll the machinery for continuous deploymenteven if you don’t want to deploy continuously
Configuration management• Some releases involve a configuration change• These are controlled and managed through Puppet• A...
Virtualization • You don’t want to install HBase • Use Vagrant to set up a virtual machine • Use Jenkins to build a new Va...
Virtualization - 2• VM work at       github.com/rhelmer/socorro-vagrant• Also pulled in as a submodule to socorro
Upcoming• ElasticSearch implemented for better search including   faceted search, waiting on hardware to pref it on• More ...
Everything is open (source)  Fork: https://github.com/mozilla/socorro  Read/file/fix bugs: https://bugzilla.mozilla.org/  ...
Questions?• Ask me (almost) anything, now or later• laura@mozilla.com
Crash reports pycodeconf
Crash reports pycodeconf
Crash reports pycodeconf
Crash reports pycodeconf
Upcoming SlideShare
Loading in...5
×

Crash reports pycodeconf

974

Published on

Published in: Technology, News & Politics
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
974
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • Transcript of "Crash reports pycodeconf"

    1. 1. ProcessingFirefox CrashReports withPythonlaura@ mozilla.com@lxt
    2. 2. Overview• The basics• The numbers• Work process and tools
    3. 3. The basics
    4. 4. Socorro Very Large Array at Socorro, New Mexico, USA. Photo taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg
    5. 5. “Socorro has a lot of moving parts”...“I prefer to think of them as dancing parts”
    6. 6. Basic architecture (simplified) Collector Crashmover cronjobs HBase Monitor Middleware PostgreSQL Processor Webapp
    7. 7. Lifetime of a crash • Raw crash submitted by user via POST (metadata json + minidump) • Collected to disk by collector (web.py WSGI app) • Moved to HBase by crashmover • Noticed in queue by monitor and assigned for processing
    8. 8. Processing• Processor spins off minidumpstackwalk (MDSW)• MDSW re-unites raw crash with symbols to generate a stack• Processor generates a signature and pulls out other data• Processor writes processed crash back to HBase and to PostgreSQL
    9. 9. Back end processing Large number of cron jobs, e.g.: • Calculate aggregates: Top crashers by signature, URL, domain • Process incoming builds from ftp server • Match known crashes to bugzilla bugs • Duplicate detection • Match up pairs of dumps (OOPP, content crashes, etc) • Generate extracts (CSV) for engineers to analyze
    10. 10. Middleware• Moving all data access to be through REST API (by end of year)• (Still some queries in webapp)• Enable other front ends to data and us to rewrite webapp using Django in 2012• In upcoming version (late 2011, 2012) each component will have its own API for status and health checks
    11. 11. Webapp• Hard part here: how to visualize some of this data• Example: nightly builds, moving to reporting in build time rather than clock time• Code a bit crufty: rewrite in 2012• Currently KohanaPHP, will be (likely) Django
    12. 12. Implementation details • Python 2.6 mostly (except PHP for the webapp) • PostgreSQL9.1, some stored procs in pgpl/sql • memcache for the webapp • Thrift for HBase access • HBase we use CDH3
    13. 13. Scale
    14. 14. A different type of scaling:• Typical webapp: scale to millions of users without degradation of response time• Socorro: less than a hundred users, terabytes of data.
    15. 15. Basic law of scale still applies:The bigger you get, the more spectacularly you fail
    16. 16. Some numbers• At peak we receive 2300 crashes per minute• 2.5 million per day• Median crash size 150k, max size 20MB (reject bigger)• ~110TB stored in HDFS (3x replication, ~40TB of HBase data)
    17. 17. What can we do?• Does betaN have more (null signature) crashes than other betas?• Analyze differences between Flash versions x and y crashes• Detect duplicate crashes• Detect explosive crashes• Find “frankeninstalls”• Email victims of a malware-related crash
    18. 18. Implementation scale• > 115 physical boxes (not cloud)• Now up to 8 developers + sysadmins + QA + Hadoop ops/ analysts • (Yes, hiring. mozilla.org/careers)• Deploy approximately weekly but could do continuous if needed
    19. 19. Managing complexity
    20. 20. Development process• Fork• Hard to install: use a VM (more in a moment)• Pull request with bugfix/feature• Code review• Lands on master
    21. 21. Development process -2• Jenkins polls github master, picks up changes• Jenkins runs tests, builds a package• Package automatically picked up and pushed to dev• Wanted changes merged to release branch• Jenkins builds release branch, manual push to stage• QA runs acceptance on stage (Selenium/Jenkins + manual)• Manual push same build to production
    22. 22. Note “Manual” deployment =• Run a single script with the build as parameter• Pushes it out to all machines and restarts where needed
    23. 23. ABSOLUTELY CRITICALAll the machinery for continuous deploymenteven if you don’t want to deploy continuously
    24. 24. Configuration management• Some releases involve a configuration change• These are controlled and managed through Puppet• Again, a single line to change config the same way every time• Config controlled the same way on dev and stage; tested the same way; deployed the same way
    25. 25. Virtualization • You don’t want to install HBase • Use Vagrant to set up a virtual machine • Use Jenkins to build a new Vagrant VM with each code build • Use Puppet to configure the VM the same way as production • Tricky part is still getting the right data
    26. 26. Virtualization - 2• VM work at github.com/rhelmer/socorro-vagrant• Also pulled in as a submodule to socorro
    27. 27. Upcoming• ElasticSearch implemented for better search including faceted search, waiting on hardware to pref it on• More analytics: automatic detection of explosive crashes, malware, etc• Better queueing: looking at Sagrada queue• Grand Unified Configuration System
    28. 28. Everything is open (source) Fork: https://github.com/mozilla/socorro Read/file/fix bugs: https://bugzilla.mozilla.org/ Docs: http://www.readthedocs.org/docs/socorro Mailing list: https://lists.mozilla.org/listinfo/tools-socorro Join us in IRC: irc.mozilla.org #breakpad
    29. 29. Questions?• Ask me (almost) anything, now or later• laura@mozilla.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×