Processing
Firefox Crash
Reports with
Python

laura@ mozilla.com
@lxt
Overview



• The basics
• The numbers
• Work process and tools
The basics
Socorro




          Very Large Array at Socorro, New Mexico, USA. Photo taken by Hajor, 08.Aug.2004. Released under cc.by.sa
          and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg
“Socorro has a lot of moving parts”
...
“I prefer to think of them as dancing parts”
Basic architecture (simplified)

              Collector   Crashmover




              cronjobs     HBase        Monitor




             Middleware   PostgreSQL   Processor




              Webapp
Lifetime of a crash


 • Raw crash submitted by user via POST (metadata json +
    minidump)

 • Collected to disk by collector (web.py WSGI app)
 • Moved to HBase by crashmover
 • Noticed in queue by monitor and assigned for processing
Processing


• Processor spins off minidumpstackwalk (MDSW)
• MDSW re-unites raw crash with symbols to generate a stack
• Processor generates a signature and pulls out other data
• Processor writes processed crash back to HBase and to
  PostgreSQL
Back end processing
 Large number of cron jobs, e.g.:

  • Calculate aggregates: Top crashers by signature, URL,
     domain

  • Process incoming builds from ftp server
  • Match known crashes to bugzilla bugs
  • Duplicate detection
  • Match up pairs of dumps (OOPP, content crashes, etc)
  • Generate extracts (CSV) for engineers to analyze
Middleware


• Moving all data access to be through REST API (by end of
   year)

• (Still some queries in webapp)
• Enable other front ends to data and us to rewrite webapp
   using Django in 2012

• In upcoming version (late 2011, 2012) each component will
   have its own API for status and health checks
Webapp


• Hard part here: how to visualize some of this data
• Example: nightly builds, moving to reporting in build time
   rather than clock time

• Code a bit crufty: rewrite in 2012
• Currently KohanaPHP, will be (likely) Django
Implementation details


 • Python 2.6 mostly (except PHP for the webapp)
 • PostgreSQL9.1, some stored procs in pgpl/sql
 • memcache for the webapp
 • Thrift for HBase access
 • HBase we use CDH3
Scale
A different type of scaling:
• Typical webapp: scale to millions of users without
   degradation of response time
• Socorro: less than a hundred users, terabytes of data.
Basic law of scale still applies:
The bigger you get, the more spectacularly you fail
Some numbers



• At peak we receive 2300 crashes per minute
• 2.5 million per day
• Median crash size 150k, max size 20MB (reject bigger)
• ~110TB stored in HDFS (3x replication, ~40TB of HBase data)
What can we do?

• Does betaN have more (null signature) crashes than other
   betas?
• Analyze differences between Flash versions x and y crashes
• Detect duplicate crashes
• Detect explosive crashes
• Find “frankeninstalls”
• Email victims of a malware-related crash
Implementation scale


• > 115 physical boxes (not cloud)
• Now up to 8 developers + sysadmins + QA + Hadoop ops/
  analysts

  • (Yes, hiring.   mozilla.org/careers)

• Deploy approximately weekly but could do continuous if
  needed
Managing complexity
Development process


• Fork
• Hard to install: use a VM (more in a moment)
• Pull request with bugfix/feature
• Code review
• Lands on master
Development process -2

• Jenkins polls github master, picks up changes
• Jenkins runs tests, builds a package
• Package automatically picked up and pushed to dev
• Wanted changes merged to release branch
• Jenkins builds release branch, manual push to stage
• QA runs acceptance on stage (Selenium/Jenkins + manual)
• Manual push same build to production
Note “Manual” deployment =

• Run a single script with the build as
   parameter

• Pushes it out to all machines and restarts
   where needed
ABSOLUTELY CRITICAL

All the machinery for continuous deployment
even if you don’t want to deploy continuously
Configuration management


• Some releases involve a configuration change
• These are controlled and managed through Puppet
• Again, a single line to change config the same way every
   time

• Config controlled the same way on dev and stage; tested the
   same way; deployed the same way
Virtualization


 • You don’t want to install HBase
 • Use Vagrant to set up a virtual machine
 • Use Jenkins to build a new Vagrant VM with each code build
 • Use Puppet to configure the VM the same way as production
 • Tricky part is still getting the right data
Virtualization - 2



• VM work at
       github.com/rhelmer/socorro-vagrant

• Also pulled in as a submodule to socorro
Upcoming


• ElasticSearch implemented for better search including
   faceted search, waiting on hardware to pref it on

• More analytics: automatic detection of explosive crashes,
   malware, etc

• Better queueing: looking at Sagrada queue
• Grand Unified Configuration System
Everything is open (source)


  Fork: https://github.com/mozilla/socorro
  Read/file/fix bugs: https://bugzilla.mozilla.org/
  Docs: http://www.readthedocs.org/docs/socorro
  Mailing list: https://lists.mozilla.org/listinfo/tools-socorro
  Join us in IRC: irc.mozilla.org #breakpad
Questions?




• Ask me (almost) anything, now or later
• laura@mozilla.com

Crash reports pycodeconf

  • 1.
  • 2.
    Overview • The basics •The numbers • Work process and tools
  • 3.
  • 4.
    Socorro Very Large Array at Socorro, New Mexico, USA. Photo taken by Hajor, 08.Aug.2004. Released under cc.by.sa and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg
  • 9.
    “Socorro has alot of moving parts” ... “I prefer to think of them as dancing parts”
  • 10.
    Basic architecture (simplified) Collector Crashmover cronjobs HBase Monitor Middleware PostgreSQL Processor Webapp
  • 11.
    Lifetime of acrash • Raw crash submitted by user via POST (metadata json + minidump) • Collected to disk by collector (web.py WSGI app) • Moved to HBase by crashmover • Noticed in queue by monitor and assigned for processing
  • 12.
    Processing • Processor spinsoff minidumpstackwalk (MDSW) • MDSW re-unites raw crash with symbols to generate a stack • Processor generates a signature and pulls out other data • Processor writes processed crash back to HBase and to PostgreSQL
  • 13.
    Back end processing Large number of cron jobs, e.g.: • Calculate aggregates: Top crashers by signature, URL, domain • Process incoming builds from ftp server • Match known crashes to bugzilla bugs • Duplicate detection • Match up pairs of dumps (OOPP, content crashes, etc) • Generate extracts (CSV) for engineers to analyze
  • 14.
    Middleware • Moving alldata access to be through REST API (by end of year) • (Still some queries in webapp) • Enable other front ends to data and us to rewrite webapp using Django in 2012 • In upcoming version (late 2011, 2012) each component will have its own API for status and health checks
  • 15.
    Webapp • Hard parthere: how to visualize some of this data • Example: nightly builds, moving to reporting in build time rather than clock time • Code a bit crufty: rewrite in 2012 • Currently KohanaPHP, will be (likely) Django
  • 16.
    Implementation details •Python 2.6 mostly (except PHP for the webapp) • PostgreSQL9.1, some stored procs in pgpl/sql • memcache for the webapp • Thrift for HBase access • HBase we use CDH3
  • 17.
  • 18.
    A different typeof scaling: • Typical webapp: scale to millions of users without degradation of response time • Socorro: less than a hundred users, terabytes of data.
  • 19.
    Basic law ofscale still applies: The bigger you get, the more spectacularly you fail
  • 20.
    Some numbers • Atpeak we receive 2300 crashes per minute • 2.5 million per day • Median crash size 150k, max size 20MB (reject bigger) • ~110TB stored in HDFS (3x replication, ~40TB of HBase data)
  • 21.
    What can wedo? • Does betaN have more (null signature) crashes than other betas? • Analyze differences between Flash versions x and y crashes • Detect duplicate crashes • Detect explosive crashes • Find “frankeninstalls” • Email victims of a malware-related crash
  • 22.
    Implementation scale • >115 physical boxes (not cloud) • Now up to 8 developers + sysadmins + QA + Hadoop ops/ analysts • (Yes, hiring. mozilla.org/careers) • Deploy approximately weekly but could do continuous if needed
  • 23.
  • 24.
    Development process • Fork •Hard to install: use a VM (more in a moment) • Pull request with bugfix/feature • Code review • Lands on master
  • 25.
    Development process -2 •Jenkins polls github master, picks up changes • Jenkins runs tests, builds a package • Package automatically picked up and pushed to dev • Wanted changes merged to release branch • Jenkins builds release branch, manual push to stage • QA runs acceptance on stage (Selenium/Jenkins + manual) • Manual push same build to production
  • 26.
    Note “Manual” deployment= • Run a single script with the build as parameter • Pushes it out to all machines and restarts where needed
  • 27.
    ABSOLUTELY CRITICAL All themachinery for continuous deployment even if you don’t want to deploy continuously
  • 28.
    Configuration management • Somereleases involve a configuration change • These are controlled and managed through Puppet • Again, a single line to change config the same way every time • Config controlled the same way on dev and stage; tested the same way; deployed the same way
  • 29.
    Virtualization • Youdon’t want to install HBase • Use Vagrant to set up a virtual machine • Use Jenkins to build a new Vagrant VM with each code build • Use Puppet to configure the VM the same way as production • Tricky part is still getting the right data
  • 30.
    Virtualization - 2 •VM work at github.com/rhelmer/socorro-vagrant • Also pulled in as a submodule to socorro
  • 31.
    Upcoming • ElasticSearch implementedfor better search including faceted search, waiting on hardware to pref it on • More analytics: automatic detection of explosive crashes, malware, etc • Better queueing: looking at Sagrada queue • Grand Unified Configuration System
  • 32.
    Everything is open(source) Fork: https://github.com/mozilla/socorro Read/file/fix bugs: https://bugzilla.mozilla.org/ Docs: http://www.readthedocs.org/docs/socorro Mailing list: https://lists.mozilla.org/listinfo/tools-socorro Join us in IRC: irc.mozilla.org #breakpad
  • 33.
    Questions? • Ask me(almost) anything, now or later • laura@mozilla.com