Lifetime of a crash • Raw crash submitted by user via POST (metadata json + minidump) • Collected to disk by collector (web.py WSGI app) • Moved to HBase by crashmover • Noticed in queue by monitor and assigned for processing
Processing• Processor spins off minidumpstackwalk (MDSW)• MDSW re-unites raw crash with symbols to generate a stack• Processor generates a signature and pulls out other data• Processor writes processed crash back to HBase and to PostgreSQL
Back end processing Large number of cron jobs, e.g.: • Calculate aggregates: Top crashers by signature, URL, domain • Process incoming builds from ftp server • Match known crashes to bugzilla bugs • Duplicate detection • Match up pairs of dumps (OOPP, content crashes, etc) • Generate extracts (CSV) for engineers to analyze
Middleware• Moving all data access to be through REST API (by end of year)• (Still some queries in webapp)• Enable other front ends to data and us to rewrite webapp using Django in 2012• In upcoming version (late 2011, 2012) each component will have its own API for status and health checks
Webapp• Hard part here: how to visualize some of this data• Example: nightly builds, moving to reporting in build time rather than clock time• Code a bit crufty: rewrite in 2012• Currently KohanaPHP, will be (likely) Django
Implementation details • Python 2.6 mostly (except PHP for the webapp) • PostgreSQL9.1, some stored procs in pgpl/sql • memcache for the webapp • Thrift for HBase access • HBase we use CDH3
A different type of scaling:• Typical webapp: scale to millions of users without degradation of response time• Socorro: less than a hundred users, terabytes of data.
Basic law of scale still applies:The bigger you get, the more spectacularly you fail
Some numbers• At peak we receive 2300 crashes per minute• 2.5 million per day• Median crash size 150k, max size 20MB (reject bigger)• ~110TB stored in HDFS (3x replication, ~40TB of HBase data)
What can we do?• Does betaN have more (null signature) crashes than other betas?• Analyze differences between Flash versions x and y crashes• Detect duplicate crashes• Detect explosive crashes• Find “frankeninstalls”• Email victims of a malware-related crash
Implementation scale• > 115 physical boxes (not cloud)• Now up to 8 developers + sysadmins + QA + Hadoop ops/ analysts • (Yes, hiring. mozilla.org/careers)• Deploy approximately weekly but could do continuous if needed
Development process• Fork• Hard to install: use a VM (more in a moment)• Pull request with bugfix/feature• Code review• Lands on master
Development process -2• Jenkins polls github master, picks up changes• Jenkins runs tests, builds a package• Package automatically picked up and pushed to dev• Wanted changes merged to release branch• Jenkins builds release branch, manual push to stage• QA runs acceptance on stage (Selenium/Jenkins + manual)• Manual push same build to production
Note “Manual” deployment =• Run a single script with the build as parameter• Pushes it out to all machines and restarts where needed
ABSOLUTELY CRITICALAll the machinery for continuous deploymenteven if you don’t want to deploy continuously
Configuration management• Some releases involve a configuration change• These are controlled and managed through Puppet• Again, a single line to change config the same way every time• Config controlled the same way on dev and stage; tested the same way; deployed the same way
Virtualization • You don’t want to install HBase • Use Vagrant to set up a virtual machine • Use Jenkins to build a new Vagrant VM with each code build • Use Puppet to configure the VM the same way as production • Tricky part is still getting the right data
Virtualization - 2• VM work at github.com/rhelmer/socorro-vagrant• Also pulled in as a submodule to socorro
Upcoming• ElasticSearch implemented for better search including faceted search, waiting on hardware to pref it on• More analytics: automatic detection of explosive crashes, malware, etc• Better queueing: looking at Sagrada queue• Grand Unified Configuration System
Everything is open (source) Fork: https://github.com/mozilla/socorro Read/file/fix bugs: https://bugzilla.mozilla.org/ Docs: http://www.readthedocs.org/docs/socorro Mailing list: https://lists.mozilla.org/listinfo/tools-socorro Join us in IRC: irc.mozilla.org #breakpad
Questions?• Ask me (almost) anything, now or later• firstname.lastname@example.org
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.