Crash reports pycodeconf

Processing
Firefox Crash
Reports with
Python

laura@ mozilla.com
@lxt

Overview

• The basics
• The numbers
• Work process and tools

Socorro

Very Large Array at Socorro, New Mexico, USA. Photo taken by Hajor, 08.Aug.2004. Released under cc.by.sa
and/or GFDL. Source: http://en.wikipedia.org/wiki/File:USA.NM.VeryLargeArray.02.jpg

“Socorro has a lot of moving parts”
...
“I prefer to think of them as dancing parts”

Basic architecture (simplified)

Collector Crashmover

cronjobs HBase Monitor

Middleware PostgreSQL Processor

Webapp

Lifetime of a crash

• Raw crash submitted by user via POST (metadata json +
minidump)

• Collected to disk by collector (web.py WSGI app)
• Moved to HBase by crashmover
• Noticed in queue by monitor and assigned for processing

Processing

• Processor spins off minidumpstackwalk (MDSW)
• MDSW re-unites raw crash with symbols to generate a stack
• Processor generates a signature and pulls out other data
• Processor writes processed crash back to HBase and to
PostgreSQL

Back end processing
Large number of cron jobs, e.g.:

• Calculate aggregates: Top crashers by signature, URL,
domain

• Process incoming builds from ftp server
• Match known crashes to bugzilla bugs
• Duplicate detection
• Match up pairs of dumps (OOPP, content crashes, etc)
• Generate extracts (CSV) for engineers to analyze

Middleware

• Moving all data access to be through REST API (by end of
year)

• (Still some queries in webapp)
• Enable other front ends to data and us to rewrite webapp
using Django in 2012

• In upcoming version (late 2011, 2012) each component will
have its own API for status and health checks

Webapp

• Hard part here: how to visualize some of this data
• Example: nightly builds, moving to reporting in build time
rather than clock time

• Code a bit crufty: rewrite in 2012
• Currently KohanaPHP, will be (likely) Django

Implementation details

• Python 2.6 mostly (except PHP for the webapp)
• PostgreSQL9.1, some stored procs in pgpl/sql
• memcache for the webapp
• Thrift for HBase access
• HBase we use CDH3

A different type of scaling:
• Typical webapp: scale to millions of users without
degradation of response time
• Socorro: less than a hundred users, terabytes of data.

Basic law of scale still applies:
The bigger you get, the more spectacularly you fail

Some numbers

• At peak we receive 2300 crashes per minute
• 2.5 million per day
• Median crash size 150k, max size 20MB (reject bigger)
• ~110TB stored in HDFS (3x replication, ~40TB of HBase data)

What can we do?

• Does betaN have more (null signature) crashes than other
betas?
• Analyze differences between Flash versions x and y crashes
• Detect duplicate crashes
• Detect explosive crashes
• Find “frankeninstalls”
• Email victims of a malware-related crash

Implementation scale

• > 115 physical boxes (not cloud)
• Now up to 8 developers + sysadmins + QA + Hadoop ops/
analysts

• (Yes, hiring. mozilla.org/careers)

• Deploy approximately weekly but could do continuous if
needed

Development process

• Fork
• Hard to install: use a VM (more in a moment)
• Pull request with bugfix/feature
• Code review
• Lands on master

Development process -2

• Jenkins polls github master, picks up changes
• Jenkins runs tests, builds a package
• Package automatically picked up and pushed to dev
• Wanted changes merged to release branch
• Jenkins builds release branch, manual push to stage
• QA runs acceptance on stage (Selenium/Jenkins + manual)
• Manual push same build to production

Note “Manual” deployment =

• Run a single script with the build as
parameter

• Pushes it out to all machines and restarts
where needed

ABSOLUTELY CRITICAL

All the machinery for continuous deployment
even if you don’t want to deploy continuously

Configuration management

• Some releases involve a configuration change
• These are controlled and managed through Puppet
• Again, a single line to change config the same way every
time

• Config controlled the same way on dev and stage; tested the
same way; deployed the same way

Virtualization

• You don’t want to install HBase
• Use Vagrant to set up a virtual machine
• Use Jenkins to build a new Vagrant VM with each code build
• Use Puppet to configure the VM the same way as production
• Tricky part is still getting the right data

Virtualization - 2

• VM work at
github.com/rhelmer/socorro-vagrant

• Also pulled in as a submodule to socorro

Upcoming

• ElasticSearch implemented for better search including
faceted search, waiting on hardware to pref it on

• More analytics: automatic detection of explosive crashes,
malware, etc

• Better queueing: looking at Sagrada queue
• Grand Unified Configuration System

Everything is open (source)

Fork: https://github.com/mozilla/socorro
Read/file/fix bugs: https://bugzilla.mozilla.org/
Docs: http://www.readthedocs.org/docs/socorro
Mailing list: https://lists.mozilla.org/listinfo/tools-socorro
Join us in IRC: irc.mozilla.org #breakpad

Questions?

• Ask me (almost) anything, now or later
• laura@mozilla.com

Crash reports pycodeconf

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Crash reports pycodeconf

Similar to Crash reports pycodeconf (20)

Recently uploaded

Recently uploaded (20)

Crash reports pycodeconf

Editor's Notes